Django has several PostgreSQL database functions to support full-text search. If you are using PostgreSQL as the database backend, then it is easy to add full-text search to your Django app.
In this post, I would like to build an demo app with full-text search. This post covers several PostgreSQL-specific features, such as SearchVectorField, SearchVector, SearchQuery, and SearchRank. In addition, this post uses PostgreSQL triggers to update SearchVectorField automatically and explains how to manage triggers with Django migrations.
This post is organized as follows:
- Setup environment
- SearchVectorField and GinIndex
- Update search vectors
- Full-text search queries
- Update search vectors with a trigger
- Rank and order
- Add weights to search vectors
- Derive search vectors from related objects
- Epilogue
The source code of the demo app is in the GitHub repository loganchien/django-fts-demo.
Setup Environment
To build the demo app, PostgreSQL and Django are required. This section includes the instructions to set up the environment.
PostgreSQL
Install PostgreSQL on Ubuntu with:
$ sudo apt-get install postgresql
To create users and databases, enter PostgreSQL interactive shell with:
$ sudo -u postgres psql
In the PostgreSQL interactive shell, create a user with:
postgres=# CREATE USER [username] PASSWORD '[password]' CREATEDB;
And then, create a database with:
postgres=# CREATE DATABASE [db_name] OWNER [username] ENCODING 'utf-8';
Leave the interactive shell with \q
:
postgres=# \q
Test whether it works or not:
$ psql -h localhost -U [username] [db_name]
Password for user [username]: [password]
postgres=# \q
Django
Create and enter a Python virtual environment with:
$ mkvirtualenv fts_demo -p /usr/bin/python3
Install django
and psycopg2
package with pip install
:
$ pip install django==2.0 psycopg2==2.7.3.2
Create a project named fts_demo
and an app named fts
:
$ django-admin startproject fts_demo
$ cd fts_demo
$ ./manage.py startapp fts
Open fts_demo/settings.py
and edit the settings:
$ vim fts_demo/settings.py
Add fts
to the end of INSTALLED_APPS
:
INSTALLED_APPS = [
'django.contrib.admin',
# ...
'django.contrib.staticfiles',
'fts', # ADDED
]
Change the default database settings in the DATABASES
variable:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': '[db_name]',
'USER': '[username]',
'PASSWORD': '[password]',
'HOST': '127.0.0.1',
'PORT': '5432',
}
}
SearchVectorField and GinIndex
To perform full-text searches efficiently, a database has to preprocess the data and summerize them as search vectors. Different search techniques may have different search vector structures. In PostgreSQL, a search vector consists of several vocabularies and their locations. For example, in the table below, the inputs on the left column are converted to search vectors on the right column:
Id | Input | Search Vector |
---|---|---|
1 | John is reading a book. | 'book':5 'john':1 'read':3 |
2 | Johnson reads an article. | 'articl':4 'johnson':1 'read':2 |
Because it takes time to convert strings into search vectors, it would be better
to save these search vectors in the database as well. Since Django 1.10, you
may add SearchVectorField
to a model and save the search vector to this
column. It will be converted to tsvector
, which is a PostgreSQL
built-in text search type.
A search vector column summerizes a row. However, there are many rows in a
database table, thus indexes should be created so that the database can
select matching rows efficiently. For example, a naive index implementation may
map words to row IDs, so that the database can easily answer 1
and
2
when a user searches for read
:
Words | Rows |
---|---|
articl | 2 |
book | 1 |
john | 1 |
johnson | 2 |
read | 1, 2 |
In the real world, the structures of database indexes are much more complex.
PostgreSQL provides GinIndex and GistIndex. GinIndex is based on
Generalized Inverted Index (GIN) and GistIndex is based on
Generalized Search Tree (GiST). They have
different trade-offs and performance characteristics. Django supports both
of them and their usages are similar. For the sake of brevity, GinIndex
is chosen for our demo app.
Define a Model
In our demo app, we would like to save several articles in the database. Each article consists of a headline and some content. And we would like to search these articles by some keywords.
Let's open fts/models.py
and add a Article
model:
from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex
class Article(models.Model):
headline = models.TextField()
content = models.TextField()
search_vector = SearchVectorField(null=True)
class Meta(object):
indexes = [GinIndex(fields=['search_vector'])]
Then create a migration for this model:
$ ./manage.py makemigrations fts
And migrate the database with:
$ ./manage.py migrate fts
As described earlier, the search_vector
field is created to hold the
preprocessed search vectors. The search_vector
field has to be nullable
(null=True
) because due to some limitation SearchVectorField
has
to be updated after an object is created.
In addition, a GinIndex
is created for the search_vector
field
to enable efficient search queries. It specified in the Meta
class.
Update Search Vectors
Before performing any full-text search, you should update search vectors. If you don't update search vectors, no results will be found.
For example, if you run the following code in ./manage.py shell
:
from fts.models import Article
Article.objects.create(
headline='Progressive tesne', content='John is reading a book.')
Article.objects.create(
headline='Present tense', content='Johnson reads an article.')
Article.objects.create(
headline='Article for piano', content='Bob plays a piano.')
Article.objects.filter(search_vector='book').values_list('content')
Then, you will see this output (no results):
<QuerySet []>
To update the search vectors, run the code below:
from django.contrib.postgres.search import SearchVector
Article.objects.update(search_vector=SearchVector('headline', 'content'))
The arguments of SearchVector()
are the names of the columns that
contain the input strings. The search vectors will be derived from these
columns. In our example, we would like to derive search vectors the from
headline and content columns.
Now, the full-text search should work:
>>> Article.objects.filter(search_vector='book').values_list('content')
<QuerySet [('John is reading a book.',)]>
Note
SearchVectorField
must be updated after an object is created
because SearchVector()
is an F() expression,
which refers another column. It is not available when you are creating a
new object.
Full-text Search Queries
To search for a keyword, add filter(search_vector=query)
to the database
query. filter()
will select the relevant objects that match with the
query.
For example, the example below searches for the objects with book
:
>>> Article.objects.filter(search_vector='book').values_list('content')
<QuerySet [('John is reading a book.',)]>
The example below shows that prefixes don't match (i.e. book
in the
content
column doesn't match with the query keyword boo
):
>>> Article.objects.filter(search_vector='boo').values_list('content')
<QuerySet []>
The example below shows that John
and Johnson
are different:
>>> Article.objects.filter(search_vector='john').values_list('content')
<QuerySet [('John is reading a book.',)]>
The example below shows how stemming works. A query with the keyword
read
matches both reading
and reads
:
>>> Article.objects.filter(search_vector='read').values_list('content')
<QuerySet [
('John is reading a book.',),
('Johnson reads an article.',)
]>
Using SearchQuery
More advanced queries can be built with SearchQuery:
from django.contrib.postgres.search import SearchQuery
First of all, wrapping a string with SearchQuery()
is equal to the
queries mentioned earlier:
>>> query = SearchQuery('book')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('John is reading a book.',)]>
Two SearchQuery()
instances can be combined with an OR operator
(|
). The example below matches all articles with either read
or piano
:
>>> query = SearchQuery('read') | SearchQuery('piano')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [
('John is reading a book.',),
('Johnson reads an article.',),
('Bob plays a piano.',)
]>
Two SearchQuery()
instances can be combined with an AND operator
(&
). The example below matches all articles with both read
and article
):
>>> query = SearchQuery('read') & SearchQuery('article')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('Johnson reads an article.',)]>
A SearchQuery()
instance can be qualified by a NOT operator
(~
). The example below matches all articles without johnson
:
>>> query = ~SearchQuery('johnson')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('John is reading a book.',), ('Bob plays a piano.',)]>
Of course, operators can be combined together. For example, the example below
matches all articles with read
but without johnson
:
>>> query = SearchQuery('read') & ~SearchQuery('johnson')
>>> Article.objects.filter(search_vector=query).values_list('content')
<QuerySet [('John is reading a book.',)]>
Finally, after testing, you may clear the database by rolling back the migrations:
$ ./manage.py migrate fts zero
Update Search Vectors with a Trigger
A trigger is a database hook that calls the registered procedure when the
specified events occur. An application of triggers is to keep the database in a
consistent state. For example, one may create a trigger to restore the
invariant when a query invalidates the database invariant. If the trigger fails
to restore the invariant, the query will fail and the database will remain
unchanged. In our demo app, we would like to create a trigger which always
keeps search_vector
in sync with title
and content
.
This is the SQL syntax to create a trigger:
CREATE TRIGGER [trigger_name]
[BEFORE | AFTER] [INSERT | UPDATE | UPDATE OF column| DELETE]
ON [table_name]
FOR EACH [ROW | STATEMENT]
EXECUTE PROCEDURE [procedure];
A trigger may listen to following events:
INSERT
-- A new row is inserted into the table.UPDATE
-- A row in the table is updated.UPDATE OF column1, column2
-- The value ofcolumn1
orcolumn2
of a row in the table is updated.DELETE
-- A row is deleted from a table.
A trigger may listen to multiple events. To listen to multiple events, combine
the events with OR
operators. For example, a trigger with
INSERT OR UPDATE
listens to both INSERT
and UPDATE
events.
A trigger may call the procedure BEFORE
or AFTER
the
event is effective. A trigger that runs before the insert event won't find
the to-be-inserted row in the table. Similarly, a trigger that runs
after the delete event will not find the deleted row in the table.
A SQL statement may insert, update, or delete multiple rows. If a trigger
specifies FOR EACH STATEMENT
, then the callback procedure is called only
once per query. If a trigger specifies FOR EACH ROW
, then the callback
procedure is called for each rows.
tsvector_update_trigger()
is a PostgreSQL built-in function that reads
the strings from the specified columns and write the computed search vector to
the destination search vector column:
tsvector_update_trigger(tsvector, lang_config, column[, ...])
The first argument is the destination column to save the computed search vector. The second argument is the language of the input strings. The rest of the arguments are the source columns from which the search vector is derived.
For example, the code below collects the strings from the headline
and
content
, treat them as English, convert them into a search vector, and
save the result to the search_vector
column:
tsvector_update_trigger(
search_vector, 'pg_catalog.english', headline, content)
Combining the knowledge above, the SQL statement below will create a trigger that updates the search vector:
CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
search_vector, 'pg_catalog.english', headline, content);
This trigger will run tsvector_update_trigger()
when a row is inserted
into fts_article
or when the headline
, content
, or
search_vector
column of a row is updated.
Manage Triggers with Django Migrations
To integrate the SQL statement into our Django app, let's create an empty migration:
$ ./manage.py makemigrations fts -n create_trigger --empty
And then, open fts/migrations/0002_create_trigger.py
and add a
RunSQL
operation:
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('fts', '0001_initial'),
]
operations = [
migrations.RunSQL(
sql='''
CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
search_vector, 'pg_catalog.english', headline, content);
UPDATE fts_article SET search_vector = NULL;
''',
reverse_sql='''
DROP TRIGGER IF EXISTS article_update_trigger
ON fts_article;
'''),
]
The forward migration SQL statement (sql
) creates a trigger named
article_update_trigger
and updates all existing search vectors by
triggering the article_update_trigger
.
The backward migration SQL statement simply drops the
article_update_trigger
.
Run the command below to test the forward migration:
$ ./manage.py migrate fts 0002_create_trigger
Operations to perform:
Target specific migration: 0002_create_trigger, from fts
Running migrations:
Applying fts.0002_create_trigger... OK
And run the command below to test the rollback migration:
$ ./manage.py migrate fts 0001_initial
Operations to perform:
Target specific migration: 0001_initial, from fts
Running migrations:
Rendering model states... DONE
Unapplying fts.0002_create_trigger... OK
Finally, run the forward migration again so that you can run some test code later:
$ ./manage.py migrate fts
Now, run the following code in ./manage.py shell
:
from fts.models import Article
Article.objects.create(
headline='Progressive tesne', content='John is reading a book.')
Article.objects.create(
headline='Present tense', content='Johnson reads an article.')
Article.objects.create(
headline='Article for piano', content='Bob plays a piano.')
Article.objects.filter(search_vector='book').values_list('content')
The output will be:
<QuerySet [('John is reading a book.',)]>
The trigger will update the search vector column whenever a row is inserted or updated, thus you don't have to update the search vector manually anymore. Comparing to the search results in the previous section, the search results shall remain unchanged. You may run the search queries in the Full-text Search Queries section and verify it.
Rank and Order
The filter()
function mentioned above only selects the matching objects.
It would be helpful to sort the matching objects by relevance. For example,
objects with more matching terms should be displayed before objects with less
matching terms.
In Django, you can annotate a query with SearchRank()
and then sort the
matching objects by the value in the annotated field:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F
query = SearchQuery('hello')
rank = SearchRank(F('search_vector'), query)
Article.objects.annotate(rank=rank) \
.filter(search_vector=query).order_by('-rank') \
.values_list('id', 'content', 'rank')
To compute the rank of each objects, the SearchRank
class takes three
arguments:
- The first argument should be a search vector field. It can be either a SearchVector instance or an F() expression. A SearchVector instance computes search vectors from other columns on-the-fly. On the other hand, an F() expression refers an existing search vector field in a model.
- The second argument should be a search query.
After creating a SearchRank
instance, the function
annotate(rank=rank)
is called to create a temporary field named
rank
which contains the ranks computed by SearchRank
. It can
be further sorted by the order_by('-rank')
function.
To test the code above, let's create some test data in
./manage.py shell
:
from fts.models import Article
Article.objects.create(headline='x4', content='hello hello hello hello')
Article.objects.create(headline='x1', content='hello')
Article.objects.create(headline='x3', content='hello hello hello')
Article.objects.create(headline='x2', content='hello hello')
The code below shows how to sort matching objects by relevance:
query = SearchQuery('hello')
rank = SearchRank(F('search_vector'), query)
Article.objects.annotate(rank=rank) \
.filter(search_vector=query).order_by('-rank') \
.values_list('id', 'content', 'rank')
The output looks like:
<QuerySet [
(1, 'hello hello hello hello', 0.0865452),
(3, 'hello hello hello', 0.0827456),
(4, 'hello hello', 0.0759909),
(2, 'hello', 0.0607927)
]>
That's all. In the next section, we will add weights to different columns.
Add Weights to Search Vectors
So far, all columns are equal. An occurrance in one column is equal to an occurrance from another column. However, in the real world, the importance of columns are unequal. An occurrance in one column may outweigh occurrances from other columns. In our demo app, headline is much more important than content, thus a heavier weight should be given to headline.
Add Weights in Django
To update the search vector manually, rollback the migration which adds the trigger:
$ ./manage.py migrate fts 0001_initial
Run the code below in ./manage.py shell
:
from django.contrib.postgres.search import SearchVector
sv = SearchVector('headline', weight='A') + \
SearchVector('content', weight='B')
Article.objects.update(search_vector=sv)
And then, list the updated search vectors with:
>>> Article.objects.all().values_list('id', 'search_vector')
<QuerySet [
(1, "'book':7B 'john':3B 'progress':1A 'read':5B 'tesn':2A"),
(2, "'articl':6B 'johnson':3B 'present':1A 'read':4B 'tens':2A"),
(3, "'articl':1A 'bob':4B 'piano':3A,7B 'play':5B")
]>
The biggest difference is that the weights are appended to each locations. For
example, articl
in the second row is 6B
and articl
in
the third row is 1A
. Similarly, piano
in the third row includes
both 3A
and 7B
.
With these weights, the third row will have higher rank in the following query:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F
query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query)
Article.objects.annotate(rank=rank) \
.filter(search_vector=query).order_by('-rank')\
.values_list('id', 'rank')
The output becomes:
<QuerySet [(3, 0.607927), (2, 0.243171)]>
Note
Before adding weights, both rows have the same rank 0.0607927
.
Add Weights in Triggers
It is much more difficult to add weights in triggers because
tsvector_update_trigger()
does not support weights. To build search
vectors with weights, you have to write a PostgreSQL function. Writing
functions for triggers is not an easy task. We will only cover some topics as
needed. Please read the PostgreSQL manual for further information.
First of all, this is the SQL syntax to create a PostgreSQL trigger function:
CREATE [OR REPLACE] FUNCTION function_name()
RETURNS TRIGGER
LANGUAGE plpgsql AS $$
BEGIN
-- ... code ...
RETURN NEW;
END;
$$;
This SQL statement creates a function named function_name
, without
arguments, and returns a trigger. The LANGUAGE
clause specifies that
the code in the quotation $$
is PL/pgSQL, which is a SQL extension
defined by PostgreSQL.
NEW
and OLD
are special global variables if a trigger has the
FOR EACH ROW
clause. NEW
refers the row which will be inserted
into the table by an insert query or the new row which will replace the old row
in an update query. On the other hand, OLD
refers the old row which
will be replaced by an update query or the row which will be deleted from the
table by a delete query.
If the procedure is triggered by an insert query or an update query, then
the procedure should return NEW
. If the procedure is triggered by a
delete query, then the procedure should return OLD
. It is fine to
return NULL
but you will get different results. In our demo app, we
don't want such behavior, thus we won't return NULL
.
To compute search vectors, three functions are required:
to_tsvector([lang_config, ]string)
-- This function converts a string into a search vector. An optional language configuration may be given. The search vector can be concatenated with||
operators.setweight(tsvector, weight)
-- This function adds weights to a search vector. The weight may beA
,B
,C
, orD
.coalesce(value, alt)
-- This function returnsalt
ifvalue
isNULL
. This function is important becauseto_tsvector()
will returnNULL
if the input string isNULL
.
To save the computed search vector into a column, SELECT ... INTO ...
statement may be used. It computes the expression after SELECT
and save
the result to the column specified after INTO
.
Combining the knowledge above, update_article_search_vector()
can be
created with the SQL statement below:
CREATE OR REPLACE FUNCTION update_article_search_vector()
RETURNS TRIGGER
LANGUAGE plpgsql AS $$
BEGIN
SELECT
setweight(to_tsvector(coalesce(NEW.headline, '')), 'A') ||
setweight(to_tsvector(coalesce(NEW.content, '')), 'B')
INTO NEW.search_vector;
RETURN NEW;
END;
$$;
And the article_update_trigger
should call
update_article_search_vector()
instead:
CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW EXECUTE PROCEDURE update_article_search_vector();
To integrate these SQL statements into our demo app, create an empty migration with:
$ ./manage.py makemigrations fts -n add_weights --empty
Then open fts/migrations/0003_add_weights.py
and add the code below:
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('fts', '0002_create_trigger'),
]
operations = [
migrations.RunSQL(
sql='''
DROP TRIGGER article_update_trigger ON fts_article;
CREATE OR REPLACE FUNCTION update_article_search_vector()
RETURNS TRIGGER
LANGUAGE plpgsql AS $$
BEGIN
SELECT
setweight(to_tsvector(
coalesce(NEW.headline, '')), 'A') ||
setweight(to_tsvector(
coalesce(NEW.content, '')), 'B')
INTO NEW.search_vector;
RETURN NEW;
END;
$$;
CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW
EXECUTE PROCEDURE update_article_search_vector();
UPDATE fts_article SET search_vector = NULL;
''',
reverse_sql='''
DROP TRIGGER article_update_trigger ON fts_article;
DROP FUNCTION update_article_search_vector();
CREATE TRIGGER article_update_trigger
BEFORE INSERT OR UPDATE OF headline, content, search_vector
ON fts_article
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
search_vector, 'pg_catalog.english', headline, content);
UPDATE fts_article SET search_vector = NULL;
'''),
]
The forward migration drops the existing article_update_trigger
, creates
a new function named update_article_search_vector()
, create a new
article_update_trigger
which calls
update_article_search_vector()
, and then update all search vectors.
Finally, you may migrate the database:
$ ./manage.py migrate fts
And run the following code in ./manage.py shell
to test whether the
trigger works:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F
query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query)
Article.objects.annotate(rank=rank) \
.filter(search_vector=query).order_by('-rank')\
.values_list('id', 'rank')
The output must be:
<QuerySet [(3, 0.607927), (2, 0.243171)]>
Map to Numeric Weights
Weights in search vectors are represented by alphabets A
, B
,
C
, and D
. This representation gives the flexibility to map them
into different numeric weights. For example, some users may think headlines
are more important than contents but other users may think conversely. An
option may be provided to users so that users can pick their preferences.
A weights
argument may be passed to SearchRank()
. The
weights
argument should be a list with 4 floating point numbers ranging
from 0.0 to 1.0. The floating point numbers stand for the numeric weight for
D
, C
, B
, and A
respectively. If the
weights
is not specified, then the default values are 0.1
,
0.2
, 0.4
, and 1.0
respectively.
For example, if headlines are more important than contents, then you may use the default weights:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F
query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query,
weights=[0.1, 0.2, 0.4, 1.0])
Article.objects.annotate(rank=rank) \
.filter(search_vector=query).order_by('-rank')\
.values_list('id', 'rank')
And the output will show that row 3
comes before row 2
:
<QuerySet [(3, 0.607927), (2, 0.243171)]>
Conversely, if contents are more important than headlines, then you may swap the third and fourth weight:
query = SearchQuery('article')
rank = SearchRank(F('search_vector'), query,
weights=[0.1, 0.2, 1.0, 0.4]) # SWAPPED
Article.objects.annotate(rank=rank) \
.filter(search_vector=query).order_by('-rank')\
.values_list('id', 'rank')
And the output will show that row 2
comes before row 3
:
<QuerySet [(2, 0.607927), (3, 0.243171)]>
Epilogue
In this post, I started from explaining the purpose of SearchVectorField and GinIndex. Next, I discussed how to update SearchVectorField and make queries with SearchQuery. Then, I went through more advanced topics, such as updating search vectors with PostgreSQL triggers, sorting objects by relevance, and adding weights. Finally, I ended up with a technique to derive search vectors from related objects. I believe this article covered all common use cases. All of the code can be found in the GitHub repository loganchien/django-fts-demo.
This is a long article. I hope this article is helpful and enjoyable. Feel free to write me e-mails if you have any comments.
References
- Django 2.0 Documentation, Migrations
- Django 2.0 Documentation, PostgreSQL specific features, Full text search
- PostgreSQL 9.6 Manual, Chapter 12. Full Text Search
- PostgreSQL 9.6 Manual, Chapter 41. PL/pgSQL -- SQL Procedural Language
- PostgreSQL 9.6 Manual, Reference, SQL Commands, CREATE TRIGGER
- Postgres Full-Text Search With Django