diff --git a/README.md b/README.md new file mode 100644 index 0000000..dfb69c1 --- /dev/null +++ b/README.md @@ -0,0 +1,83 @@ +Managing statistical classifications in OWL +=========================================== + +*Some explorations and noodlings on adding more explicit semantics to +statistical classifications as a way to help manage the implications +of relating classifications.* + +We've been representing statisical classifications in CSV and then +converting to SKOS Concept Schemes by way of table2qb and CSV2RDF. The +RDF Data Cube vocabulary uses SKOS concepts as the values of +dimensions of observations in a data cube. SKOS, by design, doesn't +provide much in the way of semantics, leaving it to an application to +decide what skos:Concepts and relations between them logically mean. + +The issue is that these semantics (and their logical implications) are +directly coded into applications, rather than being explicit, separate +logical rules. As such, it's hard to reason about what what the +implications are when we want to relate classifications to each other. + +Since a statistical classification divides a statistical population +into subsets, normally +[MECE](https://en.wikipedia.org/wiki/MECE_principle), it makes sense +to model the classification (and hierarchy) as disjoint subsets (of +subsets, etc.). OWL gives us the tools to model with sets and +relations between them and to reason about the consequences of any +restrictions. + +By way of example, we've taken two overlapping breakdowns of +geography, the [British Isles](https://en.wikipedia.org/wiki/Terminology_of_the_British_Isles) and the British Islands and created +simple datasets about the populations of the various parts. + +![An Euler diagram with an overview of the terminology (public domain, TWCarlson, via Wikipedia)](https://en.wikipedia.org/wiki/Terminology_of_the_British_Isles#/media/File:British_Isles_Euler_diagram_15.svg) + +This directory contains the following: + +[population-british-isles.csv](population-british-isles.csv) contains +the observations as Tidy Data in the style acceptable to table2qb. + +[population-british-islands.csv](population-british-islands.csv) +contains the observations as Tidy Data in a simplified style. + +[population-british-isles.csv-metadata.json](population-british-isles.csv-metadata.json) +gives the CSVW needed to convert the data into an RDF data cube using +the W3C standard csv2rdf. + +[population-british-islands.csv-metadata.json](population-british-islands.csv-metadata.json) is similar, with some changes to cope with the simpler representation. + +[owl_classification.py](owl_classification.py) takes a typical CSV +file as above, representing a statistical classification, expected to +be MECE, and creates a hierarchy of classes (just sets really) of +disjoint subclasses to represent the classificaion. Each class is +defined "intensionally" as having instances those qb:Observations +whose dimension property has the corresponding SKOS concept as its +value. + +``` +usage: owl_classification.py [-h] codelist classification codes property + +Create statistical classification as OWL + +positional arguments: + codelist Codelist CSV file. + classification Base URI for this classification. + codes Base URI for the codelist. + property Defining property. +``` + +[codelists/british-islands.csv](codelists/british-islands.csv) and +[codelists/british-isles.csv](codelists/british-isles.csv) provide +separate breakdowns of the two overlapping hierarchies in a table2qb +style. + +[codelists-metadata.json](codelists-metadata.json), +[columns.csv](columns.csv) and the blank +[components.csv](components.csv) are configuration files used by table2qb. + +[prefixes.ttl](prefixes.ttl) is used to make the Turtle files more readable. + +[skos.rdf](skos.rdf) is a copy of the SKOS ontology with some small +changes to remove the "lints" that might break reasoners. + +[Makefile](Makefile) used to record the various steps used to create +the following: