Extracting value from data has become one of the hottest topics in ICT over the past few years. Data-driven insights have the potential to steer decision-making and even generate completely new lines of revenue for businesses. Although there are many ways to extract knowledge from data – for example, using analytics or machine learning, it is often a challenge to do so.
With the inconceivably large amounts of heterogeneous data being generated and stored daily on a global scale, finding the best ways to handle diverse data is becoming one of the most critical issues to address before attempting to apply modern data analysis approaches. Furthermore, the domain-specific semantic descriptions of entities and their attributes are often implicit and, therefore, not readily available for machine processing. This hinders the sharing and reuse of data across application boundaries.
Representing meaning in data
Understanding the meaning of entities and properties of data in a machine-readable and traversable way is a crucial prerequisite of knowledge extraction. The Linked Data paradigm (using the RDF format) enables publishing semantically enriched data on the Web through the use of a self-describing data/relations and interlinking through associating the data’s globally unique identifiers.
Every entity in RDF is represented by a Uniform Resource Identifier (URI) that can be dereferenced, which allows to integrate data in a cross-domain graph. The dereferenceable URIs point to either the described entities themselves, or to ontologies or vocabularies that provide a machine-readable description of the domain and semantics. Data are organised in triples with a subject (representing an entity), predicate (representing a relationship or attribute of the entity) and object (representing a connected entity or attribute value).
RDF uses a standard query language called SPARQL, which relies on describing graph patterns that are matched against the underlying triple store.
The adoption of the linked data paradigm and the RDF format has grown significantly over the past decade – the Linked Open Data (LOD) cloud initiative reports close to 1200 datasets (up from just 32 in 2008), and the current total size of the Data Web is estimated at almost 3000 distinct datasets and around 150 billion triples. Even though RDF data is getting a wider acceptance, there are still challenges with the practical use of RDF. Graph pattern matching, regardless of the underlying implementation, always comes with a performance penalty. This is especially evident when it comes to querying large numbers of relatively homogeneous entities for facts about their attributes. In such cases, the graph model implementation of Linked Data becomes cumbersome and query times suffer the most.
As businesses tend to use very large amounts messy data in complex ways, there is a need for databases and data structures that can handle more diverse data. This is where (non-RDF) NoSQL databases come in.
NoSQL databases are able to handle large volumes of data (some of them are built specifically to address issues of distribution and scalability) without restricting value types and data structure. This enables new possibilities for database storage and representation where the data model is not restricted to specific value types within specific columns.
Furthermore, many NoSQL solutions are able to distribute a query over multiple nodes of a cluster and even let each node process the entire query locally before returning the result. This decreases amount of processing power needed on a single database node. These functionalities of NoSQL databases address some of the issues of flexible data storage, but not all. Most NoSQL databases support a single data representation – either document, key-value, or graph. Therefore, they either cannot handle relations between data very well (in the case of key-value and document stores), or they don’t perform as well when it comes to querying large amounts of homogeneous data (in the case of graph stores). Semantics of data, which are an inseparable part of RDF data, are not natively supported by the data model, so NoSQL databases provide a flexible approach only when it comes to modelling data within one domain.
Triple stores are similar to NoSQL graph stores in that they both rely on a graph model for representing data. In RDF, most relationships tend to express attributes of entities, which results in a very granular graph where most nodes are primitive values (e.g., strings, numbers, etc.). NoSQL graph databases use relationships more sparingly – a node in a NoSQL graph database represents an object that contains its attributes and relationships are used only to associate entities through their database ID.
Over the past few years, new types of NoSQL solutions have emerged, referred to as multi-model databases, which attempt to combine the benefits of multiple storage methods from traditional NoSQL databases. Allowing a combination of a “flat” representation of data (i.e., as key-value pairs or documents) and inter-node/document associations (thereby producing graphs), opens new opportunities for building efficient data storage solutions.
Multi-model databases support multiple connected storage models – some of the more recent examples of which are ArangoDB and OrientDB. Both ArangoDB and OrientDB support document, key/value and graph data representations. Thereby, such databases are able to simultaneously support the benefits of all data models – the scalability and query performance of document and key/value databases, as well as the flexibility and easy extensibility of graph databases. This allows performing queries over large volumes of data without the performance penalty of graph traversal/matching. On the other hand, it is still possible to formulate graphs and take advantage of their expressiveness, flexibility and extensibility.
These multi-model solutions organise data in collections of documents and, based on them, provide support for all three NoSQL storage methods. Whereas in RDF each entity and property is by design in itself a node, in the data model adopted by multi-model databases, this is not necessarily true. Entities are represented by JSON objects with a (numeric) key attribute, and optionally other attribute-value pairs. This allows for the properties of an entity to be stored as attributes of the JSON object that is representing it. A graph is formed by connecting individual JSON objects through their keys. These connections are declared in separate JSON objects in special collections called edge/link collections.
OrientDB also supports embedded relationships that can be declared directly as part of the documents. Apart from traditional indexing and query functions that are typical for the different NoSQL storage models, multi-model database solutions support geospatial indexing, as well as respective querying functionality.
Furthermore, both ArangoDB and OrientDB provide a library of standard graph traversal functions that can be used to take advantage of the graph data model. The query languages of both solutions are standardised – ArangoDB uses its own query language called AQL, whereas OrientDB uses and extension of SQL.
Although OrientDB supports declaring schemas of data (ArangoDB is fully schema-less), multi-model stores do not support semantic enrichment and representation out-of-the-box. However, an appropriate approach to store self-describing, semantically enriched multi-domain RDF data in a multi-model database can be used to combine the benefits of both back-ends.
Consider the following example RDF graph:
The RDF graph contains facts about one entity (in this case, a set of Google Analytics matches) including its type, a representative label, and a set of attributes with their values. With the RDF structure, this is a graph with seven nodes and six edges between nodes. Querying for all of the data about all sample entities would require a graph pattern description query and subsequent matching against an underlying database. If there are a large number of such entities, the query performance would be sub-optimal.
However, we can map the same data to a multi-model format (we use ArangoDB for this example) using the following set of rules:
- URI nodes are mapped JSON objects, which serve as nodes in ArangoDB
- Edges between nodes are generated based only on the links between URI nodes (not literals) in the RDF mapping template.
- URIs, which in RDF uniquely identify nodes, are used to generate unique numeric keys for the JSON object (numeric keys enable more efficient storage and lookup). This can be done, for example, using a standard hash function.
- Exception: rdf:type mappings – in RDF, these are used to specify type mappings for RDF entities. Types in RDF are URI nodes, which point to the semantic classes in an ontology or vocabulary. The classes specified are instead stored in a ‘type’ attribute, which is an array of types.
- RDF literals are mapped to JSON attributes for the URI node objectso
- Exception: rdfs:label – in RDF, these mappings are used to denote textual labels to denote entities. In the ArangoDB mapping, the values of these mappings are stored in a ‘label’ attribute and used to display labels in the Graph interface of ArangoDB.
- Prefixes and fully qualified RDF URIs are also stored in the resulting JSON object in ArangoDB. The specified prefixes in the mapping are additionally kept in separate JSON objects in the node collection to avoid overlaps with other prefixes and for enabling namespace-based lookups (based on the RDF namespaces defined in the mapping).
Applying these rules on the example data would result in the following single JSON document:
Querying a large set of such documents would naturally be much more efficient. Indeed, our preliminary validation shows that equivalent queries over the data in ArangoDB takes an order of magnitude less time than querying a SPARQL server (Apache Jena – Fuseki).
With the emergence of NoSQL multi-model databases, natively supporting scalable and unified storage and querying of various data models such as graph, document, key-value, arise new opportunities for effectively representing and efficiently storing RDF data. The appropriate usage of the technology can help overcome the limitations of the individual data models and representations. Multi-model databases can thus be used to produce scalable and semantically rich representations of data that can help in extracting the most out of the sea of information available in the modern enterprise.