Newer
Older
example_classification / README.md
@Alex Tucker Alex Tucker on 5 Jul 2019 4 KB in / with
Managing statistical classifications with OWL
===========================================

*Some explorations and noodlings on adding more explicit semantics to
statistical classifications as a way to help manage the implications
of relating classifications.*

We've been representing statisical classifications in CSV and then
converting to SKOS Concept Schemes by way of [table2qb][table2qb] and
CSV2RDF. The RDF Data Cube vocabulary uses SKOS concepts as the values
of dimensions of observations in a data cube. SKOS, by design, doesn't
provide much in the way of semantics, leaving it to an application to
decide what skos:Concepts and relations between them logically mean.

The issue is that these semantics (and their logical implications) are
directly coded into applications, rather than being explicit, separate
logical rules. As such, it's hard to reason about what what the
implications are when we want to relate classifications to each other.

Since a statistical classification divides a statistical population
into subsets, normally
[MECE](https://en.wikipedia.org/wiki/MECE_principle), it makes sense
to model the classification (and hierarchy) as disjoint subsets (of
subsets, etc.). OWL gives us the tools to model with sets and
relations between them and to reason about the consequences of any
restrictions.

By way of example, we've taken two overlapping breakdowns of
geography, the [British
Isles](https://en.wikipedia.org/wiki/Terminology_of_the_British_Isles)
and the British Islands and created simple datasets about the
populations of the various parts.

![An Euler diagram with an overview of the terminology (public domain, TWCarlson, via Wikipedia)](https://upload.wikimedia.org/wikipedia/commons/2/28/British_Isles_Euler_diagram_15.svg)

This directory contains the following:

[population-british-isles.csv](population-british-isles.csv) contains
the observations as Tidy Data in the style acceptable to [table2qb][table2qb].

[population-british-islands.csv](population-british-islands.csv)
contains the observations as Tidy Data in a simplified style.

[population-british-isles.csv-metadata.json](population-british-isles.csv-metadata.json)
gives the CSVW needed to convert the data into an RDF data cube using
the W3C standard csv2rdf.

[population-british-islands.csv-metadata.json](population-british-islands.csv-metadata.json)
is similar, with some changes to cope with the simpler representation.

[owl_classification.py](owl_classification.py) takes a typical CSV
file as above, representing a statistical classification, expected to
be MECE, and creates a hierarchy of classes (just sets really) of
disjoint subclasses to represent the classificaion. Each class is
defined "intensionally" as having instances those qb:Observations
whose dimension property has the corresponding SKOS concept as its
value.

```
usage: owl_classification.py [-h] codelist classification codes property

Create statistical classification as OWL

positional arguments:
  codelist        Codelist CSV file.
  classification  Base URI for this classification.
  codes           Base URI for the codelist.
  property        Defining property.
```

[codelists/british-islands.csv](codelists/british-islands.csv) and
[codelists/british-isles.csv](codelists/british-isles.csv) provide
separate breakdowns of the two overlapping hierarchies in a
[table2qb][table2qb] style.

[codelists-metadata.json](codelists-metadata.json),
[columns.csv](columns.csv) and the blank
[components.csv](components.csv) are configuration files used by [table2qb][table2q].

[prefixes.ttl](prefixes.ttl) is used to make the Turtle files more readable.

[skos.rdf](skos.rdf) is a copy of the SKOS ontology with some small
changes to remove the "lints" that might break reasoners.

[Makefile](Makefile) used to record the various steps used to create/target
the following:

[british-isles.ttl](british-isles.ttl) and
[british-islands.ttl](british-islands.ttl) using [table2qb][table2qb]
to create SKOS concept schemes, then some `sed` to reference Wikidata
URIs and remove some straggling `skos:member` statements that
shouldn't be there, and Apache Jena's `riot` to tidy things into
readable Turtle.

[population-british-isles.ttl](population-british-isles.ttl) and
[population-british-islands.ttl](population-british-islands.ttl) using [csv2rdf.clj][csv2rdf.clj].

[british-isles-owl.ttl](british-isles-owl.ttl) and [british-islands-owl.ttl](british-islands-owl.ttl) using `owl_classification.py`.

test: checks the resulting data against the Data Cube Integrity Constraints.

** Using Pellet **

* as a general ontology lint tool (lint)

* running queries (query)

* checking unsatisfiability (unsat)

* showing the class hierarchy (classify) and where the instances fit (realization)

* materializing inferences (extract)

* as a query engine in Fuseki to act as a small SPARQL server with reasoning.

[table2qb]: https://upload.wikimedia.org/wikipedia/commons/2/28/British_Isles_Euler_diagram_15.svg
[csv2rdf.clj]: https://github.com/Swirrl/csv2rdf