Full-text Search with Django and PostgreSQL

Django has several PostgreSQL database functions to support full-text search. If you are using PostgreSQL as the database backend, then it is easy to add full-text search to your Django app.

In this post, I would like to build an demo app with full-text search. This post covers several PostgreSQL-specific features, such as SearchVectorField, SearchVector, SearchQuery, and SearchRank. In addition, this post uses PostgreSQL triggers to update SearchVectorField automatically and explains how to manage triggers with Django migrations.

This post is organized as follows:

  1. Setup environment
  2. SearchVectorField and GinIndex
  3. Update search vectors
  4. Full-text search queries
  5. Update search vectors with a trigger
  6. Rank and order
  7. Add weights to search vectors
  8. Derive search vectors from related objects
  9. Epilogue

The source code of the demo app is in the GitHub repository loganchien/django-fts-demo.

Setup Environment

To build the demo app, PostgreSQL and Django are required. This section includes the instructions to set up the environment.

PostgreSQL

Install PostgreSQL on Ubuntu with:

$ sudo apt-get install postgresql

To create users and databases, enter PostgreSQL interactive shell with:

$ sudo -u postgres psql

In the PostgreSQL interactive shell, create a user with:

postgres=# CREATE USER [username] PASSWORD '[password]' CREATEDB;

And then, create a database with:

postgres=# CREATE DATABASE [db_name] OWNER [username] ENCODING 'utf-8';

Leave the interactive shell with \q:

postgres=# \q

Test whether it works or not:

$ psql -h localhost -U [username] [db_name]
Password for user [username]: [password]
postgres=# \q

Django

Create and enter a Python virtual environment with:

$ mkvirtualenv fts_demo -p /usr/bin/python3

Install django and psycopg2 package with pip install:

$ pip install django==2.0 psycopg2==2.7.3.2

Create a project named fts_demo and an app named fts:

$ django-admin startproject fts_demo
$ cd fts_demo
$ ./manage.py startapp fts

Open fts_demo/settings.py and edit the settings:

$ vim fts_demo/settings.py

Add fts to the end of INSTALLED_APPS:

INSTALLED_APPS = [
    'django.contrib.admin',
    # ...
    'django.contrib.staticfiles',
    'fts',  # ADDED
]

Change the default database settings in the DATABASES variable:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': '[db_name]',
        'USER': '[username]',
        'PASSWORD': '[password]',
        'HOST': '127.0.0.1',
        'PORT': '5432',
    }
}

SearchVectorField and GinIndex

To perform full-text searches efficiently, a database has to preprocess the data and summerize them as search vectors. Different search techniques may have different search vector structures. In PostgreSQL, a search vector consists of several vocabularies and their locations. For example, in the table below, the inputs on the left column are converted to search vectors on the right column:

Id Input Search Vector
1 John is reading a book. 'book':5 'john':1 'read':3
2 Johnson reads an article. 'articl':4 'johnson':1 'read':2

Because it takes time to convert strings into search vectors, it would be better to save these search vectors in the database as well. Since Django 1.10, you may add SearchVectorField to a model and save the search vector to this column. It will be converted to tsvector, which is a PostgreSQL built-in text search type.

A search vector column summerizes a row. However, there are many rows in a database table, thus indexes should be created so that the database can select matching rows efficiently. For example, a naive index implementation may map words to row IDs, so that the database can easily answer 1 and 2 when a user searches for read:

Words Rows
articl 2
book 1
john 1
johnson 2
read 1, 2

In the real world, the structures of database indexes are much more complex. PostgreSQL provides GinIndex and GistIndex. GinIndex is based on Generalized Inverted Index (GIN) and GistIndex is based on Generalized Search Tree (GiST). They have different trade-offs and performance characteristics. Django supports both of them and their usages are similar. For the sake of brevity, GinIndex is chosen for our demo app.

Define a Model

In our demo app, we would like to save several articles in the database. Each article consists of a headline and some content. And we would like to search these articles by some keywords.

Let's open fts/models.py and add a Article model:

from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Article(models.Model):
    headline = models.TextField()
    content = models.TextField()
    search_vector = SearchVectorField(null=True)

    class Meta(object):
        indexes = [GinIndex(fields=['search_vector'])]

Then create a migration for this model:

$ ./manage.py makemigrations fts

And migrate the database with:

$ ./manage.py migrate fts

As described earlier, the search_vector field is created to hold the preprocessed search vectors. The search_vector field has to be nullable (null=True) because due to some limitation SearchVectorField has to be updated after an object is created.

In addition, a GinIndex is created for the search_vector field to enable efficient search queries. It specified in the Meta class.

Update Search Vectors

Before performing any full-text search, you should update search vectors. If you don't update search vectors, no results will be found.

For example, if you run the following code in ./manage.py shell:

from fts.models import Article

Article.objects.create(
    headline='Progressive tesne', content='John is reading a book.')
Article.objects.create(
    headline='Present tense', content='Johnson reads an article.')
Article.objects.create(
    headline='Article for piano', content='Bob plays a piano.')

Article.objects.filter(search_vector='book').values_list('content')

Then, you will see this output (no results):

<QuerySet []>

To update the search vectors, run the code below:

from django.contrib.postgres.search import SearchVector
Article.objects.update(search_vector=SearchVector('headline', 'content'))

The arguments of SearchVector() are the names of the columns that contain the input strings. The search vectors will be derived from these columns. In our example, we would like to derive search vectors the from headline and content columns.

Now, the full-text search should work:

>>> Article.objects.filter(search_vector='book').values_list('content')
<QuerySet [('John is reading a book.',)]>

Note

SearchVectorField must be updated after an object is created because SearchVector() is an F() expression, which refers another column. It is not available when you are creating a new object.

Full-text Search Queries

To search for a keyword, add filter(search_vector=query) to the database query. filter() will select the relevant objects that match with the query.

For example, the example below searches for the objects with book:

>>> Article.objects.filter(search_vector='book').values_list('content')
<QuerySet [('John is reading a book.',)]>

The example below shows that prefixes don't match (i.e. book in the content column doesn't match with the query keyword boo):

>>> Article.objects.filter(search_vector='boo').values_list('content')
<QuerySet []>

The example below shows that John and Johnson are different:

>>> Article.objects.filter(search_vector='john').values_list('content')
<QuerySet [('John is reading a book.',)]>

The example below shows how stemming works. A query with the keyword read matches both reading and reads:

>>> Article.objects.filter(search_vector='read').values_list('content')
<QuerySet [
    ('John is reading a book.',),
    ('Johnson reads an article.',)
]>

Using SearchQuery

More advanced queries can be built with SearchQuery:

from django.contrib.postgres.search import SearchQuery

First of all, wrapping a string with SearchQuery() is equal to the queries mentioned earlier:

>>> query = SearchQuery('book')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('John is reading a book.',)]>

Two SearchQuery() instances can be combined with an OR operator (|). The example below matches all articles with either read or piano:

>>> query = SearchQuery('read') | SearchQuery('piano')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [
    ('John is reading a book.',),
    ('Johnson reads an article.',),
    ('Bob plays a piano.',)
]>

Two SearchQuery() instances can be combined with an AND operator (&). The example below matches all articles with both read and article):

>>> query = SearchQuery('read') & SearchQuery('article')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('Johnson reads an article.',)]>

A SearchQuery() instance can be qualified by a NOT operator (~). The example below matches all articles without johnson:

>>> query = ~SearchQuery('johnson')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('John is reading a book.',), ('Bob plays a piano.',)]>

Of course, operators can be combined together. For example, the example below matches all articles with read but without johnson:

>>> query = SearchQuery('read') & ~SearchQuery('johnson')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('John is reading a book.',)]>

Finally, after testing, you may clear the database by rolling back the migrations:

$ ./manage.py migrate fts zero

Update Search Vectors with a Trigger

A trigger is a database hook that calls the registered procedure when the specified events occur. An application of triggers is to keep the database in a consistent state. For example, one may create a trigger to restore the invariant when a query invalidates the database invariant. If the trigger fails to restore the invariant, the query will fail and the database will remain unchanged. In our demo app, we would like to create a trigger which always keeps search_vector in sync with title and content.

This is the SQL syntax to create a trigger:

CREATE TRIGGER [trigger_name]
[BEFORE | AFTER] [INSERT | UPDATE | UPDATE OF column| DELETE]
ON [table_name]
FOR EACH [ROW | STATEMENT]
EXECUTE PROCEDURE [procedure];

A trigger may listen to following events:

  • INSERT -- A new row is inserted into the table.
  • UPDATE -- A row in the table is updated.
  • UPDATE OF column1, column2 -- The value of column1 or column2 of a row in the table is updated.
  • DELETE -- A row is deleted from a table.

A trigger may listen to multiple events. To listen to multiple events, combine the events with OR operators. For example, a trigger with INSERT OR UPDATE listens to both INSERT and UPDATE events.

A trigger may call the procedure BEFORE or AFTER the event is effective. A trigger that runs before the insert event won't find the to-be-inserted row in the table. Similarly, a trigger that runs after the delete event will not find the deleted row in the table.

A SQL statement may insert, update, or delete multiple rows. If a trigger specifies FOR EACH STATEMENT, then the callback procedure is called only once per query. If a trigger specifies FOR EACH ROW, then the callback procedure is called for each rows.

tsvector_update_trigger() is a PostgreSQL built-in function that reads the strings from the specified columns and write the computed search vector to the destination search vector column:

tsvector_update_trigger(tsvector, lang_config, column[, ...])

The first argument is the destination column to save the computed search vector. The second argument is the language of the input strings. The rest of the arguments are the source columns from which the search vector is derived.

For example, the code below collects the strings from the headline and content, treat them as English, convert them into a search vector, and save the result to the search_vector column:

tsvector_update_trigger(
    search_vector, 'pg_catalog.english', headline, content)

Combining the knowledge above, the SQL statement below will create a trigger that updates the search vector:

CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
    search_vector, 'pg_catalog.english', headline, content);

This trigger will run tsvector_update_trigger() when a row is inserted into fts_article or when the headline, content, or search_vector column of a row is updated.

Manage Triggers with Django Migrations

To integrate the SQL statement into our Django app, let's create an empty migration:

$ ./manage.py makemigrations fts -n create_trigger --empty

And then, open fts/migrations/0002_create_trigger.py and add a RunSQL operation:

from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [
        ('fts', '0001_initial'),
    ]

    operations = [
        migrations.RunSQL(
            sql='''
            CREATE TRIGGER article_update_trigger
            BEFORE INSERT OR UPDATE OF headline, content, search_vector
            ON fts_article
            FOR EACH ROW EXECUTE PROCEDURE
            tsvector_update_trigger(
              search_vector, 'pg_catalog.english', headline, content);

            UPDATE fts_article SET search_vector = NULL;
            ''',

            reverse_sql='''
            DROP TRIGGER IF EXISTS article_update_trigger
            ON fts_article;
            '''),
    ]

The forward migration SQL statement (sql) creates a trigger named article_update_trigger and updates all existing search vectors by triggering the article_update_trigger.

The backward migration SQL statement simply drops the article_update_trigger.

Run the command below to test the forward migration:

$ ./manage.py migrate fts 0002_create_trigger
Operations to perform:
  Target specific migration: 0002_create_trigger, from fts
Running migrations:
  Applying fts.0002_create_trigger... OK

And run the command below to test the rollback migration:

$ ./manage.py migrate fts 0001_initial
Operations to perform:
  Target specific migration: 0001_initial, from fts
Running migrations:
  Rendering model states... DONE
  Unapplying fts.0002_create_trigger... OK

Finally, run the forward migration again so that you can run some test code later:

$ ./manage.py migrate fts

Now, run the following code in ./manage.py shell:

from fts.models import Article

Article.objects.create(
    headline='Progressive tesne', content='John is reading a book.')
Article.objects.create(
    headline='Present tense', content='Johnson reads an article.')
Article.objects.create(
    headline='Article for piano', content='Bob plays a piano.')

Article.objects.filter(search_vector='book').values_list('content')

The output will be:

<QuerySet [('John is reading a book.',)]>

The trigger will update the search vector column whenever a row is inserted or updated, thus you don't have to update the search vector manually anymore. Comparing to the search results in the previous section, the search results shall remain unchanged. You may run the search queries in the Full-text Search Queries section and verify it.

Rank and Order

The filter() function mentioned above only selects the matching objects. It would be helpful to sort the matching objects by relevance. For example, objects with more matching terms should be displayed before objects with less matching terms.

In Django, you can annotate a query with SearchRank() and then sort the matching objects by the value in the annotated field:

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F

query = SearchQuery('hello')
rank = SearchRank(F('search_vector'), query)

Article.objects.annotate(rank=rank) \
    .filter(search_vector=query).order_by('-rank') \
    .values_list('id', 'content', 'rank')

To compute the rank of each objects, the SearchRank class takes three arguments:

  • The first argument should be a search vector field. It can be either a SearchVector instance or an F() expression. A SearchVector instance computes search vectors from other columns on-the-fly. On the other hand, an F() expression refers an existing search vector field in a model.
  • The second argument should be a search query.

After creating a SearchRank instance, the function annotate(rank=rank) is called to create a temporary field named rank which contains the ranks computed by SearchRank. It can be further sorted by the order_by('-rank') function.

To test the code above, let's create some test data in ./manage.py shell:

from fts.models import Article

Article.objects.create(headline='x4', content='hello hello hello hello')
Article.objects.create(headline='x1', content='hello')
Article.objects.create(headline='x3', content='hello hello hello')
Article.objects.create(headline='x2', content='hello hello')

The code below shows how to sort matching objects by relevance:

query = SearchQuery('hello')
rank = SearchRank(F('search_vector'), query)

Article.objects.annotate(rank=rank) \
    .filter(search_vector=query).order_by('-rank') \
    .values_list('id', 'content', 'rank')

The output looks like:

<QuerySet [
    (1, 'hello hello hello hello', 0.0865452),
    (3, 'hello hello hello', 0.0827456),
    (4, 'hello hello', 0.0759909),
    (2, 'hello', 0.0607927)
]>

That's all. In the next section, we will add weights to different columns.

Add Weights to Search Vectors

So far, all columns are equal. An occurrance in one column is equal to an occurrance from another column. However, in the real world, the importance of columns are unequal. An occurrance in one column may outweigh occurrances from other columns. In our demo app, headline is much more important than content, thus a heavier weight should be given to headline.

Add Weights in Django

To update the search vector manually, rollback the migration which adds the trigger:

$ ./manage.py migrate fts 0001_initial

Run the code below in ./manage.py shell:

from django.contrib.postgres.search import SearchVector

sv = SearchVector('headline', weight='A') + \
     SearchVector('content', weight='B')

Article.objects.update(search_vector=sv)

And then, list the updated search vectors with:

>>> Article.objects.all().values_list('id', 'search_vector')
<QuerySet [
    (1, "'book':7B 'john':3B 'progress':1A 'read':5B 'tesn':2A"),
    (2, "'articl':6B 'johnson':3B 'present':1A 'read':4B 'tens':2A"),
    (3, "'articl':1A 'bob':4B 'piano':3A,7B 'play':5B")
]>

The biggest difference is that the weights are appended to each locations. For example, articl in the second row is 6B and articl in the third row is 1A. Similarly, piano in the third row includes both 3A and 7B.

With these weights, the third row will have higher rank in the following query:

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F

query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query)

Article.objects.annotate(rank=rank) \
    .filter(search_vector=query).order_by('-rank')\
    .values_list('id', 'rank')

The output becomes:

<QuerySet [(3, 0.607927), (2, 0.243171)]>

Note

Before adding weights, both rows have the same rank 0.0607927.

Add Weights in Triggers

It is much more difficult to add weights in triggers because tsvector_update_trigger() does not support weights. To build search vectors with weights, you have to write a PostgreSQL function. Writing functions for triggers is not an easy task. We will only cover some topics as needed. Please read the PostgreSQL manual for further information.

First of all, this is the SQL syntax to create a PostgreSQL trigger function:

CREATE [OR REPLACE] FUNCTION function_name()
RETURNS TRIGGER
LANGUAGE plpgsql AS $$
BEGIN
  -- ... code ...
  RETURN NEW;
END;
$$;

This SQL statement creates a function named function_name, without arguments, and returns a trigger. The LANGUAGE clause specifies that the code in the quotation $$ is PL/pgSQL, which is a SQL extension defined by PostgreSQL.

NEW and OLD are special global variables if a trigger has the FOR EACH ROW clause. NEW refers the row which will be inserted into the table by an insert query or the new row which will replace the old row in an update query. On the other hand, OLD refers the old row which will be replaced by an update query or the row which will be deleted from the table by a delete query.

If the procedure is triggered by an insert query or an update query, then the procedure should return NEW. If the procedure is triggered by a delete query, then the procedure should return OLD. It is fine to return NULL but you will get different results. In our demo app, we don't want such behavior, thus we won't return NULL.

To compute search vectors, three functions are required:

  • to_tsvector([lang_config, ]string) -- This function converts a string into a search vector. An optional language configuration may be given. The search vector can be concatenated with || operators.
  • setweight(tsvector, weight) -- This function adds weights to a search vector. The weight may be A, B, C, or D.
  • coalesce(value, alt) -- This function returns alt if value is NULL. This function is important because to_tsvector() will return NULL if the input string is NULL.

To save the computed search vector into a column, SELECT ... INTO ... statement may be used. It computes the expression after SELECT and save the result to the column specified after INTO.

Combining the knowledge above, update_article_search_vector() can be created with the SQL statement below:

CREATE OR REPLACE FUNCTION update_article_search_vector()
RETURNS TRIGGER
LANGUAGE plpgsql AS $$
BEGIN
  SELECT
    setweight(to_tsvector(coalesce(NEW.headline, '')), 'A') ||
    setweight(to_tsvector(coalesce(NEW.content, '')), 'B')
  INTO NEW.search_vector;
  RETURN NEW;
END;
$$;

And the article_update_trigger should call update_article_search_vector() instead:

CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW EXECUTE PROCEDURE update_article_search_vector();

To integrate these SQL statements into our demo app, create an empty migration with:

$ ./manage.py makemigrations fts -n add_weights --empty

Then open fts/migrations/0003_add_weights.py and add the code below:

from django.db import migrations


class Migration(migrations.Migration):

    dependencies = [
        ('fts', '0002_create_trigger'),
    ]

    operations = [
        migrations.RunSQL(
            sql='''
            DROP TRIGGER article_update_trigger ON fts_article;

            CREATE OR REPLACE FUNCTION update_article_search_vector()
            RETURNS TRIGGER
            LANGUAGE plpgsql AS $$
            BEGIN
              SELECT
                setweight(to_tsvector(
                  coalesce(NEW.headline, '')), 'A') ||
                setweight(to_tsvector(
                  coalesce(NEW.content, '')), 'B')
              INTO NEW.search_vector;
              RETURN NEW;
            END;
            $$;

            CREATE TRIGGER article_update_trigger
            BEFORE INSERT OR UPDATE OF headline, content, search_vector
            ON fts_article
            FOR EACH ROW
            EXECUTE PROCEDURE update_article_search_vector();

            UPDATE fts_article SET search_vector = NULL;
            ''',
            reverse_sql='''
            DROP TRIGGER article_update_trigger ON fts_article;

            DROP FUNCTION update_article_search_vector();

            CREATE TRIGGER article_update_trigger
            BEFORE INSERT OR UPDATE OF headline, content, search_vector
            ON fts_article
            FOR EACH ROW EXECUTE PROCEDURE
            tsvector_update_trigger(
              search_vector, 'pg_catalog.english', headline, content);

            UPDATE fts_article SET search_vector = NULL;
            '''),
    ]

The forward migration drops the existing article_update_trigger, creates a new function named update_article_search_vector(), create a new article_update_trigger which calls update_article_search_vector(), and then update all search vectors.

Finally, you may migrate the database:

$ ./manage.py migrate fts

And run the following code in ./manage.py shell to test whether the trigger works:

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F

query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query)

Article.objects.annotate(rank=rank) \
    .filter(search_vector=query).order_by('-rank')\
    .values_list('id', 'rank')

The output must be:

<QuerySet [(3, 0.607927), (2, 0.243171)]>

Map to Numeric Weights

Weights in search vectors are represented by alphabets A, B, C, and D. This representation gives the flexibility to map them into different numeric weights. For example, some users may think headlines are more important than contents but other users may think conversely. An option may be provided to users so that users can pick their preferences.

A weights argument may be passed to SearchRank(). The weights argument should be a list with 4 floating point numbers ranging from 0.0 to 1.0. The floating point numbers stand for the numeric weight for D, C, B, and A respectively. If the weights is not specified, then the default values are 0.1, 0.2, 0.4, and 1.0 respectively.

For example, if headlines are more important than contents, then you may use the default weights:

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F

query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query,
                  weights=[0.1, 0.2, 0.4, 1.0])

Article.objects.annotate(rank=rank) \
    .filter(search_vector=query).order_by('-rank')\
    .values_list('id', 'rank')

And the output will show that row 3 comes before row 2:

<QuerySet [(3, 0.607927), (2, 0.243171)]>

Conversely, if contents are more important than headlines, then you may swap the third and fourth weight:

query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query,
                  weights=[0.1, 0.2, 1.0, 0.4])  # SWAPPED

Article.objects.annotate(rank=rank) \
    .filter(search_vector=query).order_by('-rank')\
    .values_list('id', 'rank')

And the output will show that row 2 comes before row 3:

<QuerySet [(2, 0.607927), (3, 0.243171)]>

Epilogue

In this post, I started from explaining the purpose of SearchVectorField and GinIndex. Next, I discussed how to update SearchVectorField and make queries with SearchQuery. Then, I went through more advanced topics, such as updating search vectors with PostgreSQL triggers, sorting objects by relevance, and adding weights. Finally, I ended up with a technique to derive search vectors from related objects. I believe this article covered all common use cases. All of the code can be found in the GitHub repository loganchien/django-fts-demo.

This is a long article. I hope this article is helpful and enjoyable. Feel free to write me e-mails if you have any comments.