Knowledge

10 Jan 2024 Back to all articles ↵

Full-text search engine with PostgreSQL

#databases #nlp #postgresql

PostgreSQL contains a series of mechanisms, which can be used to build a high-speed, easily maintainable and customizable full-text search engine. Such a solution should be relevant for anyone in the need of such an engine, and especially for those, who are already using Postgres in their system. In addition, it is also a great example of the wealth of features in PostgreSQL, that make it potentially more useful than a traditional, relational database management system.

In this article, we present an overview of all advantages of such an approach, and help you build an understanding of how it works in practice by briefly analyzing the most interesting implementation details.

Table of contents

I. Introduction

(back to ToC ↑)

Semi-formally defined, full-text search is a process of finding text documents that are matching a user-provided query. It is easy to see that such engines have a multitude of applications - from advanced searching in e-commerce, through interactive retrieval of posts in blogs, all the way up to responsive fuzzy navigation in web applications.

Consequently, a plethora of solutions for this problem has been developed, both open-source and commercial. However, having a separate, dedicated system to solve a specific problem always comes with a lot of headaches. Most importantly, your data has to be digested. For this, you will either have to rely on your tool’s ability to directly query your data sources (which will probably introduce another bit of maintenance overhead), or you will have to build an internal ETL pipeline. In addition, if you go the open-source, self-hosted way, you will have to make your way through the initial setup (which will probably be non-trivial), and later on maintain the extra infrastructure. If you rely on a commercial solution, you will have to mind the costs. They will probably be pretty inviting at first, but will most likely skyrocket once you’re properly locked in the given vendor’s services.

Quite naturally, all of the aforementioned drawbacks sound pretty typical - if you want to enrich your platform with new features, you have to pay for them somehow. However, ensuring simplicity in your architecture can facilitate future improvements. One of the ways, in which you can try to avoid unnecessary complexity, is a continuous, but reasonable reuse of existing parts of your tech-stack.

This is why PostgreSQL is a relevant contender the text-searching world. Mostly known as a relational database, but also a true software engineering powerhouse, it is full of useful features. And one of these features happens to be native, production-ready, and extremely efficient support for text-searching and natural language processing.

An assumption that the source data for the search engine lives in the main relational database is obviously heavily application-dependent, but it is definitely not completely unreasonable. And if that relational database happens to be PostgreSQL (which is the case for ~44% of the systems, according to studies), you get a chance to implement a proper text search engine with minimal effort.

Obviously, you might wonder whether such an engine could be an actual solution to your searching needs, or if it is just a fun challenge for devoted PostgreSQL enthusiasts. Here, at Fibertide, we have extensive experience of using such an approach in practice, and are convinced that with all the tools that PostgreSQL puts at your disposal, the final engine can easily:

  • be incredibly fast. The order of magnitude, which we are talking about here, is searches over millions of documents taking single digit milliseconds. This makes it perfectly suitable to handle big data;
  • provide highly relevant results, tailored to your business logic.

II. Trigram-based retrieval

(back to ToC ↑)

To get a clear view of the historical context and understand the limitations we’re trying to combat, we’ll first analyze the more basic approaches for text-searching in relational databases. If you are already familiar with basic text comparison operators and trigram-based searching in PostgreSQL, feel free to skip to the next section, where we utilize more advanced mechanisms to overcome their inherent shortcomings.

Every database has some kind of textual comparison operators - in Postgres’ case, those are ~, ~*, LIKE and ILIKE. Unfortunately, most of the time, they have no support for indexing. In the worst case, this would result in the need to process every stored document in every search. And as we’re aiming to build solutions capable of handling big data (so at least millions of documents), our database would run out of breath pretty quickly. In addition, they have no support for result ranking - which might render them useless in a wide variety of use-cases, where we aim to obtain high quality matches, rather than just any matches.

Luckily, this does not mean that basic text searching in Postgres is complex.

PostgreSQL lets you implement a more effective, yet still simple, text-search by exposing a series of utilities for working on trigrams. A trigram is a sequence of 3 consecutive written units. In Postgres’ case, we assume this unit to be a text character. For instance, this is the set of the trigrams of the word “Fibertide”:

postgres=# select unnest(show_trgm('Fibertide')) as trigrams order by trigrams;
 trigrams
----------
 ber
 de
 ert
   f
  fi
 fib
 ibe
 ide
 rti
 tid
(10 rows)

As you can see, Postgres has automatically padded the word with spaces, to facilitate the localization in the middle of larger documents.

In practice, it turns out that the count of shared trigrams is effective when gauging string similarity. For two strings, we can define their similarity by the number of their shared trigrams normalized by the total number of their unique trigrams. PostgreSQL contains a utility function for this measure. For two similar strings, it returns a non-zero value:

postgres=# select similarity('Fibertide', 'fibers of the tides');
 similarity
------------
  0.3181818
(1 row)

But for two strings that have nothing to do with each other, it returns zero:

postgres=# select similarity('Fibertide', 'production outages');
 similarity
------------
          0
(1 row)

PostgreSQL defines two operators on top of this function:

  • % - a boolean returning true if the similarity of the strings exceeds a predefined threshold (globally configured).
  • <-> - a distance of the strings (1 - similarity).

But the most crucial feature of the trigram-based search is the indexing support. There exist index types, which allow an effective implementation of a reverse lookup of words that contain the given trigram. These index types are GIN and GIST - we’ll analyze them more closely later.

With a careful combination of those indexes, trigram-based queries can be easily run effectively on large datasets. However, trigrams can only support exact or fuzzy matching. This can be useful when querying strings that are not always easily human-readable, such as product SKUs, filenames or social media handles. This can also be utilized when it is reasonable to assume that the user might have an easy time remembering the exact name of the searched entity - such as movie titles.

The idea of leveraging trigrams in a relational database is not exclusive to Postgres. For instance, MySQL supports text-searching using n-grams (n-grams are a generalization of trigrams with a parametrized token length).

III. NLP-enriched text search

Why?

(back to ToC ↑)

In reality, n-grams are not enough. Exact matching is not human-like. Human speech, irrespective of language, is usually a mashup of convoluted rules, plagued with exceptions. This means that organic, exact matches are very rare. Consider the common case of data retrieval for sentiment analysis. If we look for “love”, we probably want to see “loved”, “lovable”, “lovely”, “beloved” as well. Obviously, we can achieve this by issuing a separate query for every form, but the maintainability, performance, and explainability will start to suffer really soon. Most of the large volume text datasets containing business-wise relevant information, such as blogposts, consumer reviews or product descriptions usually consist of heavily “natural” speech.

In addition, the ranking capabilities of the trigram approach are pretty lackluster. It can fall victim to false positives (automation and tomato share trigrams, while not being strongly related), and there is no out-of-the-box support for customizable weighing. When querying a product list, we probably want to favor matches in the product name over the ones in the product description, without completely neglecting the latter.

How?

(back to ToC ↑)

Luckily, Postgres developers have also realized that, and we can easily extend our queries with basic NLP (natural language processing), without leaving our DBMS, and without losing any of the remarkable speed.

The NLP support in PostgreSQL is based on the notion of a lexeme. Formally, a lexeme is the most basic abstract unit of meaning. In practice, we want to map every word to a “representative”, (which, in Postgres’ convention is called a “lexeme”). We obviously want words conveying similar meaning to be assigned the same representative.

In practice, this boils down to the reversal of word’s inflection, or a substitution with a more general synonym. Here is a quick example:

postgres=# \t
Tuples only is on.

postgres=# select ts_lexize('english_stem', 'chlorinated');
 {chlorin}

The stopwords (words that are commonly used, but are considered semantically insignificant in natural language processing) are assumed to not be conveying any meaning at all, so they don’t have corresponding lexemes:

postgres=# select ts_lexize('english_stem', stopword)
from unnest(
        array['the', 'themselves', 'a', 'should', 'can', 'both', 'further', 'more']
) as stopword;

 {}
 {}
 {}
 {}
 {}
 {}
 {}
 {}

“But how can Postgres do that?” - one may ask? PostgreSQL actually contains sets of stemming rules for whole languages. At the time of writing of this article (and Postgres version 16.1), a raw, default distribution of Postgres contains 28 different such configurations:

postgres=# \dF
               List of text search configurations
   Schema   |    Name    |              Description
------------+------------+---------------------------------------
 pg_catalog | arabic     | configuration for arabic language
 pg_catalog | armenian   | configuration for armenian language
 pg_catalog | basque     | configuration for basque language
 pg_catalog | catalan    | configuration for catalan language
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | greek      | configuration for greek language
 pg_catalog | hindi      | configuration for hindi language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | indonesian | configuration for indonesian language
 pg_catalog | irish      | configuration for irish language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | lithuanian | configuration for lithuanian language
 pg_catalog | nepali     | configuration for nepali language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | serbian    | configuration for serbian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | tamil      | configuration for tamil language
 pg_catalog | turkish    | configuration for turkish language
 pg_catalog | yiddish    | configuration for yiddish language
(29 rows)

You can add your own languages, but the existing list is pretty impressive on its own.

With such a tool at our disposal, it is straightforward to see how a trigram-based search can be significantly enhanced - we could use lexemes to generate the underlying trigrams.

However, relative positions of words are also relevant in the natural speech. To incorporate that observation, Postgres introduces a dedicated datatype: tsvector. A tsvector is a sorted map of lexemes to their absolute positions in the text (measured in words):

postgres=# select to_tsvector(
        'english',
        'fibertide - cloud optimization, cloud computing, research software optimization'
);

'cloud':2,4 'comput':5 'fibertid':1 'optim':3,8 'research':6 'softwar':7

An initialization of such a structure is not computationally expensive - a linear parse with a series of constant-time lookups is all it really needs. An interesting property of tsvectors is that their addition is not commutative - the reason being that such an operation carries the semantics of text concatenation:

postgres=# select to_tsvector(
        'english',
        'fibertide - cloud optimization, cloud computing, research software optimization'
) || to_tsvector('english', 'fibertide - machine learning operations');

'cloud':2,4 'comput':5 'fibertid':1,9 'learn':11 'machin':10 'oper':12 'optim':3,8 'research':6 'softwar':7

Having such a datatype creates a lot of opportunities for querying. To capture this abundance of possibilities, while not strictly necessary, it might be handy to have an analogous data type for querying - ts_query.

A query can be a single word, but we can use operators to build more complex query trees. Currently, the following operators are supported by ts_query:

  • AND (&) - query A & B will match documents that satisfy both queries A and B;
  • OR (|) - query A | B will match documents that satisfy either query A or query B;
  • NOT (!) - query !A will match documents that do not satisfy the query A;
  • FOLLOWED BY (<->) - query A <-> B will match documents where a match of the query B can be found right after a match of the query A. In addition, the distance between those matches can be customized. For instance, a query A <5> B will look for matches that are exactly 5 words apart.

During the creation of ts_queries, words are also automatically translated into lexemes:

postgres=# select to_tsquery('english', 'fibertide & (cloud | sciops)');

'fibertid' & ( 'cloud' | 'sciop' )

This touch increases the number of relevant matches.

To match a query against a tsvector, we can use the @@ operator. Here are a couple examples:

postgres=# select to_tsvector('english',
	'fibertide - cloud optimization, cloud computing, research software optimization'
) @@ to_tsquery('english', 'software & fibertide');

 t
postgres=# select to_tsvector('english',
	'fibertide - cloud optimization, cloud computing, research software optimization'
) @@ to_tsquery('english', 'software & bad');

 f
postgres=# select to_tsvector(
	'english',
	'fibertide - cloud optimization, cloud computing, research software optimization')
@@ to_tsquery('english', 'fibertide <5> research');

 t
postgres=# select to_tsvector(
	'english',
	'fibertide - cloud optimization, cloud computing, research software optimization')
@@ to_tsquery('english', 'fibertide <3> research');

 f

In addition, this query mechanism can boast high levels of explainability. The ts_headline utility function can be used to obtain a basic highlight of the matching parts of the document:

postgres=# SELECT ts_headline('english',
	'fibertide - cloud optimization, cloud computing, research software optimization',
	to_tsquery('english', 'cloud')
);
                                          ts_headline
-----------------------------------------------------------------------------------------------
 fibertide - <b>cloud</b> optimization, <b>cloud</b> computing, research software optimization
(1 row)

Speed

(back to ToC ↑)

The common theme of the solutions we’re considering is the efficiency at scale. The tsvectors surely have a lot of potential when it comes to querying, but how fast are they? The answer is that they are extremely fast, and that is possible thanks to indexing.

There are two index types that can be utilized to boost performance on the text columns that are frequently searched over:

  • GIN (Generalized Inverted Index) - a basic reverse lookup of the lexemes in the rows. It maintains a map of lexemes to lists of pointers to rows with occurences. A quick lookup can significantly reduce the number of candidates. Slower to update and generate, but ensures faster query times.
  • GIST (Generalized Search Tree) - also implements a reverse lookup, but hashes the lexemes beforehand. This ensures a constant memory usage of the index with faster update times, at the cost of possible false positives. Query engine can only rule out false positives by loading the data of the candidate rows from the disk.

It is easy to see that such lookups turn basic queries into (almost) constant-time operations. For instance, here’s an analysis of a basic query run (on a consumer-grade laptop) against a table with 10 million rows, with only 10 matches:

postgres=# explain analyze select reviewer from reviews where vector @@ to_tsquery('english', 'awesome');
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on reviews  (cost=366.64..132620.87 rows=50003 width=52) (actual time=0.032..0.038 rows=10 loops=1)
   Recheck Cond: (vector @@ '''awesom'''::tsquery)
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on vector_search_idx  (cost=0.00..354.14 rows=50003 width=0) (actual time=0.023..0.023 rows=10 loops=1)
         Index Cond: (vector @@ '''awesom'''::tsquery)
 Planning Time: 0.147 ms
 Execution Time: **6.430 ms**
(7 rows)

Crucially, in the case of more complicated queries, these indexes open up a lot of possibilities for the query optimizer.

Consider a query of red & blue. We can utilize the index to obtain the sets of numbers of rows, which contain each of the words, and then compute their intersection. In the case of the GIN index, this result would be ready to return. In the case of the GIST index, the database engine needs to load the remaining candidates and verify them for false matches. However, when considering realistic data, this might still significantly decrease numbers of reads from the disk.

A set intersection approach is only one way to speed up the query of red & blue. Another possibility would be to reverse look-up the indices of rows containing the word red, and then check their tsvectors for the presence of blue. In general case, we are risking an increased amount of reads from the disk, but there are situations, where query planner might deduce that this way will be quicker.

This kind of optimizations is what makes PostgreSQL’s text-searching so performant - capable of handling millions of documents in milliseconds.

Maintenance

(back to ToC ↑)

There is a plethora of functions that simplify the creation of ts_queries:

  • to_tsquery allows the usage of a shorthand syntax for the querying operators. This also means that the query text has to be parsed to ensure correctness. For instance, we can very concisely say to_tsquery('english', 'fibertide & (software | computation)');
  • plainto_tsquery treats the input as a set of words and concatenates them with the & operator;
  • phraseto_tsquery preserves the original distance between words to try and maintain the meaning as faithfully as possible. For instance:
postgres=# select phraseto_tsquery('english', 'fibertide can help you utilize the cloud');
               phraseto_tsquery
----------------------------------------------
 'fibertid' <2> 'help' <2> 'util' <2> 'cloud'
(1 row)
  • websearch_to_tsquery supports syntax typical for web searching. For instance, quotes are given a special meaning, and so are or and -.

On top of that, there is a number of operators, which allow the combination of queries in many different ways.

For tsvectors, the actual creation is handled very well by the to_tsvector function and the concatenation operator (||)

However, to query across a large volume of documents, we need access to the corresponding tsvectors. The most natural implementation is to keep them in a separate, dedicated column. Obviously, with such an approach, we need to update this column on every change to the original document.

In the newer versions of Postgres (12 and higher), this can be heavily facilitated by using the automatically generated columns. For example, we can store a tsvector of the “document” column by using the following column definition:

document_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', document)) STORED

What’s more, we can use this mechanism to concatenate tsvectors of multiple columns:

document_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', document_title) || to_tsvector('english', document)) STORED

For older versions, there are some trigger utilities: tsvector_update_trigger and tsvector_update_trigger_column. You can find more information on those in the documentation.

Ranking

(back to ToC ↑)

In text-searching at scale, the queries will very quickly start returning more results than the consuming application needs or can handle. That is why a ranking mechanism, which will sort the results by relevance to the original query, is usually necessary.

And that is why the flexibility and customizability of the ranking in the ts_*-style querying is another huge advantage.

To get you started, PostgreSQL exposes two functions:

  • ts_rank - a simple frequency measure of the pattern in the document;
  • ts_rank_cd - a “cover density” measure, which, intuitively, tries to establish how tightly the document is covered by the pattern.

These existing ranking functions offer a high level of customizability, which means you can perform some fine-tuning with little effort.

First of all, we can control the normalization of the results. Most notably, we can ensure that the results are divided by the total length of the document, which allows us to focus on the actual frequency, as for collections of documents of highly varying lengths, the absolute count of a phrase can be unrepresentative.

Moreover, PostgreSQL gives us some flexibility in terms of weighing different matches of the pattern, which is extremely powerful. Typically, you leverage them in a 2-phase process:

  • during the creation of a tsvector, we assign labels to lexemes. We can use 4 labels: A, B, C and D. This can be done with the function setweight(tsvector, label), which assigns the provided label to all of the lexemes in the passed tsvector.
  • at query time, we assign numerical values to the labels. Every match will be weighed with the numerical value assigned to the label of the matching lexeme.

Let’s take a quick look at an example. Consider a basic reviews table with the following schema:

postgres=# CREATE TABLE reviews (
	reviewer text, 
	text text,
	vector tsvector GENERATED ALWAYS AS (setweight(to_tsvector('english', reviewer), 'A')
	|| setweight(to_tsvector('english', text), 'B')) STORED
);

We use the automatically generated columns mechanism to create the tsvector instances in this table. Notice how we assign the label A to the reviewer, and a label B to the review itself.

We now can, alongside the query, pass an array of numerical values, that will be used to weigh the matches:

postgres=# \t
Tuples only is off.

postgres=# SELECT reviewer, text, ts_rank('{0, 0, 1, 0.01}', vector, query) AS rank
FROM reviews, to_tsquery('english', 'important') query
ORDER BY rank DESC;
                   reviewer                   |                         text                          |     rank
----------------------------------------------+-------------------------------------------------------+--------------
 crucial customer                             | fibertide is an important partner                     |    0.6079271
 important customer from an important company | fibertide is great, and can only be beat by fibertide | 0.0075990884
 important customer                           | fibertide is awesome                                  |  0.006079271
(3 rows)

In this case, we used this mechanism to heavily promote the matches in the review itself, rather than in the name of the reviewer.

This feature clearly addresses the main shortcoming of the ranking capabilities in the trigram-based search.

However, we need to mention that the ranking operation is expensive, as it can require a full read of the document itself from the disk. To mitigate this, we recommend a hierarchical structure to the queries - the searched domain should be iteratively expanded, if you are unable to find good enough matches in a smaller subset of the most relevant documents.

Additional features

(back to ToC ↑)

The truth is, we’ve barely scraped the surface of the features of Postgres’ text-searching with NLP. Here is a concise list of topics that you might wish to consider for your implementation:

  • prefix-searching. Considering the exact implementation of GIN indexes (b-trees), it is a natural corollary that efficient prefix searching is available;
  • label-based matching. Not only can we use the labels to weigh the matches, but we can use them to outright filter the results at query level (fibertide:AB will only match lexemes with labels A or B);
  • advanced query manipulation. We can compose queries, replacing parts of an existing query with other queries. For instance, this can help normalize synonymous phrases in more compex queries;
  • document statistics. There are utilities to gather information on lexeme distribution in whole tables;
  • dictionaries. The dictionary mechanisms control the recognition and handling of stop words, synonyms and stemming. There is also a thesaurus dictionary, which allows the creation of custom phrase substitutions;
  • (almost) infinite customizability. It is possible to inject your own logic into pretty much any mechanism mentioned in this article. If that’s not enough, you can also capitalize on the fact that Postgres is open-source, and inject your logic at an even lower level.

IV. Conclusion

(back to ToC ↑)

Simplicity is highly desirable in the modern world of rapid software development. As dedicated PostgreSQL fans, we feel delighted to spread awareness on the wealth of its features, which can translate to a noticeable simplification of your architecture.

This time, we’ve shown how you can build a complex text-search engine, that is incredibly rapid, has a reasonable support for natural language processing and boasts excellent maintainability. This approach can mitigate the need for another data ingestion pipeline in your system, and relieve you from the effort of sharing your data with an external vendor. All of these advantages come with little to no hindrance on the quality of the engine itself. Surely, use-cases not covered by Postgres do exist, but they are (most likely) so advanced, that the introduction of a new, dedicated system is fully justified.

If you would like to learn about more lesser known, but production-ready features of PostgreSQL, stay tuned - we’ll be covering them in the future.

Get in touch

We can be your team of problem solvers.
We'd love to hear from you.
Contact us!