DataGraft (https://datagraft.io) provides a collection of tools for integrated management of data transformations and hosting and access of graph data. It is organized as a set of cloud services that are delivered through the DataGraft portal. Its latest version has been extended with new features and capabilities aimed at easing the burden of data workers and data scientists!
Features of the DataGraft platform
Building a knowledge graph using the EW-Shopp toolkit – Grafterizer, ASIA and ABSTAT
Did you know that data preparation accounts for about 80% of the work of data scientists? Preparing and transforming large amounts of data from a raw tabular format to semantically enriched data can be time consuming and difficult. Most data scientists also find this task to be one of the least enjoyable. Moreover, the process of integrating business data in EW-Shopp with events and weather data requires specific knowledge about the content of the knowledge graph and how to map data schemas to shared vocabularies that can enrich the data.
To ease this process of preparing and enriching data, three tools have been developed as part of EW-Shopp to assist users in:
- Grafterizer – Cleaning and transforming business data from tabular format to linked data
- ASIA – Enriching tabular business data with events and weather data that semantically enhance the content of a knowledge graph
- ABSTAT – Understanding the content of the knowledge graphs (the linked product data) by providing statistical profiles and data quality insights.
Let’s have a closer look at how the different tools can contribute to more effective work processes and free up time for data scientists to focus on more important tasks such as data analysis. After all, this is where we want to spend more time in EW-Shopp to really understand how events and weather can target marketing along the shopper journey.
Our focus has been on providing users with an integrated solution that can both clean and prepare data, semantically enrich data, and give useful insights about data quality. The result is a data preparation and enrichment service that combines all three functionalities in one user interface. That is three needs met by one solution. The process of onboarding data to the knowledge graph, starts with cleaning and transforming the raw tabular data to a scheme and format that can be mapped to a data model.
New interactive user interface in Grafterizer
By selecting the first tab of the user interface, you will see that Grafterizer features interactive specification of data transformations along with a back-end for management and execution of data transformations. Transformation steps on rows (add, drop, filter, duplicate detection, etc.), columns (add, drop, rename, merge, etc.) and entire data set (sort, aggregate, etc.) are provided together with visualization of the result after each step. To further assist the user in understanding the data, we have added visual data profiling capabilities that analyse and determine data quality based on statistical properties, semantics and structure of data. The data quality assessment is presented to the user by means of statistical and scientific charts and visualizations:
Visual profiling of data selections in Grafterizer
After the user has finished cleaning and transforming the business data, time has come to transform the tabular data to a graph format that defines the semantic relations and properties in the knowledge graph. Selecting the RDF Mapping tab, the prepared tabular data from our first step of the process can easily be mapped to a graph format by building a tree structure of RDF triples:
RDF mapping in Grafterizer
Since an important aspect of EW-Shopp is the integration of business data with event and weather data, ASIA provides an interface that guides the user through the semantic annotation, reconciliation and enrichment of the tabular data. Semantic annotations are used to generate mappings from the table of data to a knowledge graph (in RDF or in the ArangoDB JSON format) using one or more vocabularies. ASIA adopts a column-wise approach to semantic annotation, allowing users to define annotations based on smart suggestions provided by the tool. Currently ASIA incorporates suggestions from ABSTAT, the knowledge graph profiling tool, but can be configured to use other terminology recommendation services like the ones based on LOV (https://lov.linkeddata.es/dataset/lov/). The profiles created by ABSTAT, also named summaries, describe the content of RDF datasets in a synthetic manner, and have proved to be helpful for a variety of application domains such as data understanding, quality assessment, analytical modelling, and vocabulary suggestion. Moreover, ASIA links data values to shared systems of identifiers, which enables the extraction of additional data from third-party sources and their fusion into the original tabular data. ASIA supports both schema-level and instance-level annotations of a table:
Schema-level linking widget in ASIA
Finally, Grafterizer, through the ASIA tool, now enables users to reconcile and extend data in various ways by the use of knowledge graphs (GeoNames, Google GeoTargets, Wikifier) and weather data repositories (ECMWF). More additions and extensions to this feature will be coming in the future!
Column reconciliation in Grafterizer
Managing heterogeneous data using the EW-Shopp toolkit – ArangoDB support
DataGraft’s original target has been towards RDF data stored in triple stores using GraphDB and the GraphDB Cloud database-as-a-service. As part of the EW-Shopp toolkit, the services have now been extended to provide transformation and hosting of graph data in the ArangoDB multi-model store (https://www.arangodb.com/). In ArangoDB node data in tabular form and the edge data (graph relationships) can be stored and queried in the same database.
The transformation of data into ArangoDB graph format is different from the standardised triple store due to the structure of the database. Grafterizer is now able to produce transformed collection data (i.e., node and edge collections) in JSON format to ArangoDB that can be downloaded and directly stored.
The DataGraft portal can now manage ArangoDB database instances by using administrative credentials to a database. These login credentials and the databases themselves are registered by the user using the ArangoDB Web interface and copied into the DBMS admin page.
Adding a new ArangoDB instance in DataGraft
Two sets of user credentials are handled: full access and read access (these are automatically generated by DataGraft). Full access to the database (read and write) is only available for the asset owner, while the read-only access can be used for the public ArangoDB databases in DataGraft (i.e., when exposing a database as a public asset on DataGraft). Using the DataGraft asset creation and editing features, users are now able to directly upload JSON collections (either ones produced by Grafterizer, or others) to their managed ArangoDB instances, as well as provide metadata, descriptions and others.
Editing an ArangoDB DataGraft asset
Acknowledgement: The work on the EW-Shopp toolkit (specifically the Grafterizer 2.0 tool) has been conducted in cooperation with the euBusinessGraph project (http://eubusinessgraph.eu/) also co-funded by the EC under HORIZON 2020, The EU Framework Programme for Research and Innovation.