Session, FRI 10:30 - 12:00

Semantic Information Management II

DataID: Towards Semantically Rich Metadata For Complex Datasets

The constantly growing amount of Linked Open Data (LOD) datasets constitutes the need for rich metadata descriptions, enabling users to discover, understand and process the available data. This metadata is often created, maintained and stored in diverse data repositories featuring disparate data models that are often unable to provide the metadata necessary to automatically process the datasets described.

This paper proposes DataID, a best-practice for LOD dataset descriptions which utilize RDF files hosted together with the datasets, under the same domain. We are describing the data model, which is based on the widely used DCAT and VoID vocabularies, as well as supporting tools to create and publish DataIDs and use cases that show the benefits of providing semantically rich metadata for complex datasets. As a proof of concept, we generated a DataID for the DBpedia dataset, which we will present in the paper.

Martin Brümmer, Ciro Baron, Ivan Ermilov, Markus Freudenberg and Sebastian Hellmann

Affiliation

Representing Dataset Quality Metadata using Multi-Dimensional Views

Data quality is commonly defined as fitness for use. The problem of identifying quality of data is faced by many data consumers. On the other hand, data publishers do not have the means to identify quality problems in their data. To make the task for both stakeholders easier, we extend the Dataset Quality Ontology (daQ) with multi-dimensional and statistical properties from the Data Cube.

The daQ is a light-weight, extensible vocabulary for attaching the results of quality benchmarking of a linked open dataset to the dataset. We discuss the design considerations, give examples for extending daQ by custom quality metrics, and present use cases such as analysing data versions, browsing datasets by quality and link identification. We also discuss how visualisation tools enable data publishers to analyse better the quality of their data.