Using Materialized Views to Implement Efficient Reports in Django
Posted by gavin on March 20, 2018, 4:32 p.m.
I am working with a client that frequently requests summary reports over their database. These reports were originally seldom-requested so they were implemented by manually running SQL queries when necessary. Ideally there would just be a reporting page on their site that gave them all the reports they needed, but the reports are quite slow to generate. Generating them on demand would mean the user is waiting for several minutes. Even storing the generated reports in memcached would not be sufficient, because there's always the possibility of hitting an empty cache.
We would like a system where the report is generated on a schedule, guaranteed to be available at all times, and stored in a structured format that allows easy processing to both display the report and provide exports of the data. Materialized views are a feature of postgres and other advanced SQL databases that provide exactly these requirements.
A materialized view looks like a normal table to
SELECT queries, but
actually contains the pre-computed results from another, usually slower,
query. Once created, the materialized view can be refreshed on demand.
The fact that the materialized view appears as a normal table also
allows us to use the familiar Django ORM to retrieve and format our
For our example, we'll have a report that counts user signups per month for the lifetime of our site. This is slow when we have a lot of users.
Since the Django ORM is only able to query the materialized view, not create
or refresh it, some raw SQL is necessary. To set up querying, we define a
model with fields that match the results of our materialized view, and set
managed = False. Making the model unmanaged tells Django not to try to
create the table on its own, because we're going to create a view with raw SQL
1 2 3 4 5 6
class UserSignup(models.Model) date = models.DateField() count = models.IntegerField() class Meta: managed = False
I also create a regular model to track when the reports were last updated:
1 2 3
class RefreshLog(models.Model): created_at = models.DateTimeField(auto_now_add=True, db_index=True) duration = models.DurationField()
I show the most recent refresh date on the reports page to give the user confidence that they are seeing up-to-date data.
Then create a migration (Django needs a migration whenever you add a model,
even if it's not managed and doesn't have any changes to apply to the
database). To the generated migration we'll add a
RunSQL operation to create
and drop our materialized view.
1 2 3 4 5 6 7 8 9 10 11 12
migrations.RunSQL( """ CREATE MATERIALIZED VIEW reports_usersignup AS SELECT date_trunc('month', date_joined) AS "date", count(*) AS "count" FROM auth_user GROUP BY 1; CREATE UNIQUE INDEX reports_usersignup_pk ON reports_usersignup(date); """, """ DROP MATERIALIZED VIEW reports_usersignup; """ )
The unique index is not strictly necessary, but will allow us to use
CONCURRENTLY option when refreshing the materialized view.
So far we've set up the materialized view, but it will never get refreshed. I
refresh_reports function that does the refreshing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
import datetime import time from django.db import transaction, connection from .models import RefreshLog @transaction.atomic def refresh_reports(): start_time = time.monotonic() with connection.cursor() as cursor: cursor.execute("REFRESH MATERIALIZED VIEW CONCURRENTLY reports_usersignup") end_time = time.monotonic() RefreshLog.objects.create(duration=datetime.timedelta(seconds=end_time - start_time))
Performing the update in a transaction ensures that the RefreshLog is always consistent with the actual last refresh of the materialized view. This function can be called from a celery beat task or management command that's triggered by a cron job.
We use the
CONCURRENTLY option to
REFRESH MATERIALIZED VIEW to ensure that
the reports page stays available while our report is generating. Otherwise,
postgres takes a lock on the materialized view that prevents querying it,
which would force the user to wait if the tried to view the report while it
was regenerating. This way we can refresh the report as often as we want
without interfering with users viewing the report.
All that's left is to write a view that allows querying the report. I retrieve
all the UserSignup records with
UserSignup.objects.order_by('date'), and the
most recent update with
The template shows the most recent update and renders the report nicely in a
table. Storing the report as structured data in a table makes it easy to add a
CSV export of the report if desired.
This approach is fast and easily extensible to include more than one report or multiple views of the same report.