Semantic Matchmaking

This blog describes the approach adopted in AI4PublicPolicy for making queries to the database where the content of several Data Models is stored. This task determines the capability of the matchmaking solution to efficiently retrieve similar content, regardless of it being structured differently in each Data Model. This means, for instance, that it is expected that the algorithm can identify different units of temperature across several Data Models while being able to unify duplicate units despite their distinct symbols or representations.

When it comes to search algorithms, vector search emerges as one of the most promising technologies due to four key reasons:

Semantic: Users increasingly expect search engines to deliver results according to the meaning of their input, rather than comparing the words or characters themselves. This stands as a great challenge because people often find different ways to ask for the same thing.

Input flexibility: It is sometimes preferable to provide input in alternative formats other than text (e.g., images or even speech).

Context and domain specificity: Users expect relevance to be tightly coupled with the subject at hand, or with the context where a context is issued.

Precise and unique: The search results should be as accurate as possible. The less time it takes to find what the user is looking for the better.

To achieve its results, vector search relies on vector similarity which essentially consists of encoding both the universe of results and the user input using the same Machine Learning model (usually a Deep Learning Neural Network) and then returning the results with the highest score (i.e., similarity), as represented in the following figure:

As shown in the figure, it is evident that a preparation stage is required before being able to make queries:

Firstly, it is essential to ensure that each piece of data contains the vectorial representation, also known as embedding, of the field(s) that will be used for searching purposes. For that, a Machine Learning model is required. It goes without saying that identifying the context of the data to be searched is essential, as it influences the choice of the model, the quality of the embeddings and, consequently, the accuracy of the results.

Once the model is selected, it is fed with the values of the fields we want to be searchable and returns a mathematical representation for each one, i.e., a dense vector. Depending on the model, the generated vectors can have different dimensions (i.e., positions). The higher the vector’s dimensionality, the more precise will be the results because there is more information to compare against. On the other hand, more dimensions require more computational resources and lead to increased processing time.

Finally, the data is inserted into the database along with the corresponding embeddings for the desired fields. After this setup, the infrastructure is ready to deliver meaningful results.

Similarly, when it comes to making queries, the search input provided by the user is run through the same Machine Learning model. As soon as the embedding of the search query is available, it is possible to compute the similarity with each of the embeddings stored in the database during the preparation stage. This calculation can be carried out through different mathematical formulas.

The described approach is expected to deliver accurate matches for the AI4PublicPolicy use case because, as it relies on vectorial representations of the data, so that the algorithm can find semantically equivalent sentences, i.e., similarities beyond the words themselves or the characters that form them.