Policy and Data Management Implementation

Policy and Data Management Component Architecture

The Policy and Data Management component is made out of five subcomponents: the Web Interface, the Dataset Management component, the Policy Management component, a Database (DB) and EGI DataHub as depicted in the following figure.

The Web Interface allows non-technical users to define, update and use datasets and policies in an easy way through a set of web forms. The Dataset Management and the Policy Management components are in charge of the management of datasets and policies, respectively. The datasets, policies and AI models definitions are stored in a database. Finally, EGI DataHub is a remote file system provided by EGI based on EOSC that stores all AI4PublicPolicy dataset files.

Dataset Management

The Dataset Management subcomponent provides all the necessary tools to define new datasets, upload and download data files from the AI4PublicPolicy platform. A dataset is defined by its schema. All data files in a dataset have to respect the corresponding schema.

Dataset Management Architecture

The Dataset Management component is made out of four sub-components: a REST API, the dataset management subcomponent, a database and EGI DataHub. The following figure shows the sub-components and the interaction between them.

The REST API can be used by both end users and other components of the AI4PublicPolicy platform submitting CRUD operations on datasets that are used by the different policies defined in the context of the project. The Dataset Management component handles all requests received through the REST API, stores the dataset schemas in a database and allows uploading and downloading datasets files stored in EGI DataHub. The database (DB) – in this case MySQL – is used for storing the schemas of the datasets.

Datasets Definition Persistence

Dataset schemas are defined or updated using the POST and PUT operations respectively. These operations receive as a parameter a JSON object. The JSON object is stored in the database (MySQL) to ensure persistence. The tables in the database that store the dataset schema are dataset and field. The dataset table stores the meta information associated with a given dataset: name, source_type, owner, pilot, description, image_format, and language provided in the JSON object. The field table stores the actual schema of the dataset, that is, the description of the different fields: name, type, description, default_value and dataset_id. The figure below shows the entity relationship diagram (ERD) of both tables, the dataset and parameter tables.

Datasets Persistence

When a dataset schema is defined, a directory is created in EGI DataHub. All datasets with that schema are stored in that folder either using the PUT /datasetManagement/dataset operation or by uploading the files directly through EGI DataHub. When a dataset is uploaded into EGI DataHub using the operation PUT operation, the Dataset Management component receives a notification, accesses the dataset and checks if the dataset satisfies the dataset schema. The Dataset Management component subscribes for changes into a particular EGI DataHub space to receive these events.

Both, the process to configure the Dataset Management to access the AI4PublicPolicy EGI DataHub space and the process to subscribe to changes will be explained in more detail in our upcoming blog next week!