Examples exploring the use of classes/type hierarchy to represent statistical classifications

codelists Used https://en.wikipedia.org/wiki/Terminology_of_the_British_Isles to derive two overlapping hierarchies. Uses Wikidata IDs as codes. 5 years ago
Makefile Add targets for generated *-owl.ttl 5 years ago
README.md in / with 5 years ago
codelists-metadata.json Use gss-utils create-transform to build CSVW transforms from table2qb config. Some fiddling. 5 years ago
columns.csv Use gss-utils create-transform to build CSVW transforms from table2qb config. Some fiddling. 5 years ago
components.csv Use gss-utils create-transform to build CSVW transforms from table2qb config. Some fiddling. 5 years ago
owl_classification.py Get rid of spurious typo 5 years ago
population-british-islands.csv Use queries like this to grab data on the given Wikidata entities: 5 years ago
population-british-islands.csv-metadata.json Use gss-utils create-transform to build CSVW transforms from table2qb config. Some fiddling. 5 years ago
population-british-isles.csv Use queries like this to grab data on the given Wikidata entities: 5 years ago
population-british-isles.csv-metadata.json Use gss-utils create-transform to build CSVW transforms from table2qb config. Some fiddling. 5 years ago
prefixes.ttl Add common namespaces. 5 years ago
skos.rdf Fix up SKOS so it's ok for reasoning. 5 years ago
README.md

Managing statistical classifications with OWL

Some explorations and noodlings on adding more explicit semantics to statistical classifications as a way to help manage the implications of relating classifications.

We've been representing statisical classifications in CSV and then converting to SKOS Concept Schemes by way of table2qb and CSV2RDF. The RDF Data Cube vocabulary uses SKOS concepts as the values of dimensions of observations in a data cube. SKOS, by design, doesn't provide much in the way of semantics, leaving it to an application to decide what skos:Concepts and relations between them logically mean.

The issue is that these semantics (and their logical implications) are directly coded into applications, rather than being explicit, separate logical rules. As such, it's hard to reason about what what the implications are when we want to relate classifications to each other.

Since a statistical classification divides a statistical population into subsets, normally MECE, it makes sense to model the classification (and hierarchy) as disjoint subsets (of subsets, etc.). OWL gives us the tools to model with sets and relations between them and to reason about the consequences of any restrictions.

By way of example, we've taken two overlapping breakdowns of geography, the British Isles and the British Islands and created simple datasets about the populations of the various parts.

An Euler diagram with an overview of the terminology (public domain, TWCarlson, via Wikipedia)

This directory contains the following:

population-british-isles.csv contains the observations as Tidy Data in the style acceptable to table2qb.

population-british-islands.csv contains the observations as Tidy Data in a simplified style.

population-british-isles.csv-metadata.json gives the CSVW needed to convert the data into an RDF data cube using the W3C standard csv2rdf.

population-british-islands.csv-metadata.json is similar, with some changes to cope with the simpler representation.

owl_classification.py takes a typical CSV file as above, representing a statistical classification, expected to be MECE, and creates a hierarchy of classes (just sets really) of disjoint subclasses to represent the classificaion. Each class is defined "intensionally" as having instances those qb:Observations whose dimension property has the corresponding SKOS concept as its value.

usage: owl_classification.py [-h] codelist classification codes property

Create statistical classification as OWL

positional arguments:
  codelist        Codelist CSV file.
  classification  Base URI for this classification.
  codes           Base URI for the codelist.
  property        Defining property.

codelists/british-islands.csv and codelists/british-isles.csv provide separate breakdowns of the two overlapping hierarchies in a table2qb style.

codelists-metadata.json, columns.csv and the blank components.csv are configuration files used by [table2qb][table2q].

prefixes.ttl is used to make the Turtle files more readable.

skos.rdf is a copy of the SKOS ontology with some small changes to remove the "lints" that might break reasoners.

Makefile used to record the various steps used to create/target the following:

british-isles.ttl and british-islands.ttl using table2qb to create SKOS concept schemes, then some sed to reference Wikidata URIs and remove some straggling skos:member statements that shouldn't be there, and Apache Jena's riot to tidy things into readable Turtle.

population-british-isles.ttl and population-british-islands.ttl using csv2rdf.clj.

british-isles-owl.ttl and british-islands-owl.ttl using owl_classification.py.

test: checks the resulting data against the Data Cube Integrity Constraints.

Using Pellet

  • as a general ontology lint tool (lint)

  • running queries (query)

  • checking unsatisfiability (unsat)

  • showing the class hierarchy (classify) and where the instances fit (realization)

  • materializing inferences (extract)

  • as a query engine in Fuseki to act as a small SPARQL server with reasoning.