Home Blog Using Materialized Views to Implement Efficient Reports in Django

DEV

Using Materialized Views to Implement Efficient Reports in Django

Posted by gavin on March 20, 2018, 4:32 p.m.

I am working with a client that frequently requests summary reports over their database. These reports were originally seldom-requested so they were implemented by manually running SQL queries when necessary. Ideally there would just be a reporting page on their site that gave them all the reports they needed, but the reports are quite slow to generate. Generating them on demand would mean the user is waiting for several minutes. Even storing the generated reports in memcached would not be sufficient, because there's always the possibility of hitting an empty cache.

We would like a system where the report is generated on a schedule, guaranteed to be available at all times, and stored in a structured format that allows easy processing to both display the report and provide exports of the data. Materialized views are a feature of postgres and other advanced SQL databases that provide exactly these requirements.

A materialized view looks like a normal table to SELECT queries, but actually contains the pre-computed results from another, usually slower, query. Once created, the materialized view can be refreshed on demand. The fact that the materialized view appears as a normal table also allows us to use the familiar Django ORM to retrieve and format our report.

For our example, we'll have a report that counts user signups per month for the lifetime of our site. This is slow when we have a lot of users.

Since the Django ORM is only able to query the materialized view, not create or refresh it, some raw SQL is necessary. To set up querying, we define a model with fields that match the results of our materialized view, and set managed = False. Making the model unmanaged tells Django not to try to create the table on its own, because we're going to create a view with raw SQL instead.

1
2
3
4
5
6
class UserSignup(models.Model)
    date = models.DateField()
    count = models.IntegerField()

    class Meta:
        managed = False

I also create a regular model to track when the reports were last updated:

1
2
3
class RefreshLog(models.Model):
    created_at = models.DateTimeField(auto_now_add=True, db_index=True)
    duration = models.DurationField()

I show the most recent refresh date on the reports page to give the user confidence that they are seeing up-to-date data.

Then create a migration (Django needs a migration whenever you add a model, even if it's not managed and doesn't have any changes to apply to the database). To the generated migration we'll add a RunSQL operation to create and drop our materialized view.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
migrations.RunSQL(
    """
    CREATE MATERIALIZED VIEW reports_usersignup AS
    SELECT date_trunc('month', date_joined) AS "date", count(*) AS "count"
    FROM auth_user
    GROUP BY 1;
    CREATE UNIQUE INDEX reports_usersignup_pk ON reports_usersignup(date);
    """,
    """
    DROP MATERIALIZED VIEW reports_usersignup;
    """
)

The unique index is not strictly necessary, but will allow us to use CONCURRENTLY option when refreshing the materialized view.

So far we've set up the materialized view, but it will never get refreshed. I have a refresh_reports function that does the refreshing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import datetime
import time

from django.db import transaction, connection

from .models import RefreshLog

@transaction.atomic
def refresh_reports():
    start_time = time.monotonic()
    with connection.cursor() as cursor:
        cursor.execute("REFRESH MATERIALIZED VIEW CONCURRENTLY reports_usersignup")
    end_time = time.monotonic()
    RefreshLog.objects.create(duration=datetime.timedelta(seconds=end_time - start_time))

Performing the update in a transaction ensures that the RefreshLog is always consistent with the actual last refresh of the materialized view. This function can be called from a celery beat task or management command that's triggered by a cron job.

We use the CONCURRENTLY option to REFRESH MATERIALIZED VIEW to ensure that the reports page stays available while our report is generating. Otherwise, postgres takes a lock on the materialized view that prevents querying it, which would force the user to wait if the tried to view the report while it was regenerating. This way we can refresh the report as often as we want without interfering with users viewing the report.

All that's left is to write a view that allows querying the report. I retrieve all the UserSignup records with UserSignup.objects.order_by('date'), and the most recent update with RefreshLog.objects.order_by('-created_at').first(). The template shows the most recent update and renders the report nicely in a table. Storing the report as structured data in a table makes it easy to add a CSV export of the report if desired.

This approach is fast and easily extensible to include more than one report or multiple views of the same report.