Accessing datasets is essential for AI models used in policymaking. Having more complete and diverse datasets can lead to more accurate policy evaluations. To increase the number of datasets, a Dataset Discovery mechanism is required to identify relevant sources and make use of their data.
However, there’s a challenge when dealing with datasets from different sources that use varying data models and semantic concepts. The issue is how to bring together data from these diverse sources that meet a query’s criteria, how to access that data, and how to present it in a way that aligns with the query’s semantic model. This ensures that the results are understandable to the user or machine that initiated the query, regardless of the data’s original semantic model.
This blog explains how AI4PublicPolicy deals with this challenge. It mainly focuses on finding equivalent data concepts described on other semantic models, the required data conversions, and the collection of data from semantically heterogenous data sources.
The approach followed in Ai4PublicPolicy to support the execution of semantic data queries starts with the usage of GraphQL4 to express desirable queries. GraphQL allows to express queries in an easy-to-understand approach by using a JSON-like language and providing a broad ecosystem of tools that support the handling and execution of such queries. Moreover, GraphQL provides a flexible way to allow users to define queries and define the desirable structure of the query output. After defining the GraphQL query for relevant datasets, it needs to be processed to support the collection of the datasets that meet the query conditions.
In the figure below, an overview of the Semantic Query System is presented. The system presented corresponds to a customized setup, highly inspired by the typical architecture of a GraphQL system. The system is composed of three main modules: the Query Server, the Semantic Matcher and Data Converter and the Model Data Resolver.
This module is responsible for receiving the query request and performing the first processing over it. It starts by identifying the Data Model that serves as context for the data requested. Then, it performs the usual behavior of a GraphQL server, analyses the query, and extracts the Measurable Quantities, i.e., types of data described in the specific Data Model. Alongside the Measurable Quantities, we also extract any filtering or constraint information that applies to the collection of data from that measurable content (e.g., locations, thresholds, etc.) and any operation that should be applied to the collected data sets (average of values, minimum value, maximum value, etc.). This module is also responsible for the aggregation of the data of the query response according to the structure defined in the GraphQL query. Before providing it to the query requester, any operation requested in the query (such as averages, sum, etc.) is applied to the data.
Semantic Matcher and Data Converter
This module is responsible for performing the semantic matching between different data models and the execution of any data conversion required when matching data between different data models. When the Query Server analyses the GraphQL query, it extracts all the Measurable Quantities and any limitations that should be applied during the process of data discovery. Then, the Semantic Matcher receives the Measurable Quantities and discovers their correspondence (if it exists) on other Data Models. This process is represented in the following figure.
To make this matching possible, the module uses a database with a list of semantic mappings. This database helps identify all known connections between a pair of (Data Model, Measurable Quantity). When a pair of (Data Model, Measurable Quantity) is extracted from the query, the database gives back all the semantic pairs that match with the query pair.
As a result of this matching process, a query initially set to search for data sources that match a specific data model can be translated into other data models. This broadens the range of data sources that can be accessed. Additionally, filtering and constraint information must be converted to fit the structure of the mapped data model. This ensures that the restrictions can be applied to the data collected from the sources. This conversion can be as simple as changing property names or more complex, like altering the structure and units used to define the restrictions.
When the identified information is extracted from the data sources, the results are sent back to Semantic Matcher and Data Converter to convert them according to the specification of Data Model of the original query. This conversion may imply the conversion of data units or more complex operation like changing the way the information is structured.
Model Data Resolver
After generating different query equivalent representations on other data models, carried out by the Semantic Matcher and Data Converter, the original query and its equivalents need to be executed. The process of query execution is carried out by a component called Model Data Resolver. This resolver implements the logic associated with the process of collecting data from a specific data source and presenting it according to a specific Data Model.
Each Data Model can correspond to multiple Model Data Resolvers and Data Source pairs. This information is recorded in an index, where for each Model all the available Data Resolvers and the Measurable Quantities that each Data Resolver can resolve are identified. To get the list of resolvers a query must be submitted to this Index where the pair (Model, Measurable Quantity) of interest is identified. This process is described in the figure below.
Each data source has its own Resolver, which contains information on how to access the data source. This includes the method for obtaining information (like using specific queries, REST endpoints, etc.) and any required authentication methods (such as usernames and passwords, authentication tokens, certificates, etc.).
Since some data sources offer data aggregations for multiple measurable quantities, the Data Resolver is also responsible for extracting the information relevant to the initial query from the received data aggregation. The details of the operation of a Data Resolver are presented in the figure below.