Newer
Older
DataReport / family-scope / family-scope.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Investigating the coverage of the trade datasets family\n",
    "## Setup SPARQL API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from SPARQLWrapper import SPARQLWrapper2\n",
    "import pandas as pd\n",
    "from IPython.display import HTML\n",
    "\n",
    "endpoint = \"https://production-drafter-ons-alpha.publishmydata.com/v1/sparql/live\"\n",
    "sparql = SPARQLWrapper2(endpoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retreive Data Structure Definitions\n",
    "This retreives the codelist(s) for each dimension, for each dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./dataset-structure.sparql', 'r') as query_file:\n",
    "    query_str = query_file.read()\n",
    "\n",
    "sparql.setQuery(query_str)\n",
    "\n",
    "results = pd.DataFrame(sparql.query().bindings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create URI presenters\n",
    "We can create presenters that will provide, for each URI, an html link to the resource (i.e. the URI itself) with the `rdfs:label` used as the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def uri_to_label(uri_column,label_column):\n",
    "    return(dict(zip(results[uri_column].map(lambda x: x.value),\n",
    "                    results[label_column].map(lambda x: x.value))))\n",
    "\n",
    "labeller = {**uri_to_label('dataset_uri','dataset'),\n",
    "            **uri_to_label('dimension_uri','dimension'),\n",
    "            **uri_to_label('codelist_uri','codelist')}\n",
    "\n",
    "def presenter(uri):\n",
    "    return(f'<a href=\"{uri}\">{labeller[uri]}</a> ')\n",
    "\n",
    "def lister(presenters):\n",
    "    return(presenters.str.join(''))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Present Codelists in a matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prepare a matrix of datasets vs dimensions, with each cell summarising the codelists being used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "codelists = results[['dataset_uri','dimension_uri','codelist_uri']].applymap(lambda x: presenter(x.value))\n",
    "codelists_pivot = codelists.pivot_table(index='dataset_uri',columns='dimension_uri',values='codelist_uri',aggfunc=lister)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sort the matrix such that the most commonly re-used dimensions are to the left hand side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>dimension_uri</th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/flow\">Flow</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/product\">Product</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/trade-partner-geography\">Trade Partner Geography</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/seasonal-adjustment\">Seasonal Adjustment</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/price-classification\">Price Classification</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/employment\">Employment</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/age-of-business\">Age of Business</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/turnover\">Turnover</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/trade-reporter-geography\">Trade Reporter Geography</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/international-trade-basis\">International Trade Basis</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/export-and-import-activity\">Export and Import Activity</a></th>\n",
       "      <th><a href=\"http://gss-data.org.uk/def/dimension/country-of-ownership\">Country of Ownership</a></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dataset_uri</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/hmrc-overseas-trade-statistics\">HMRC Overseas Trade Statistics</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td>[<a href=\"https://trade.ec.europa.eu/def/cn_2012\">Combined Nomenclature, 2012</a> , <a href=\"https://trade.ec.europa.eu/def/cn_2013\">Combined Nomenclature, 2013</a> , <a href=\"https://trade.ec.europa.eu/def/cn_2014\">Combined Nomenclature, 2014</a> , <a href=\"https://trade.ec.europa.eu/def/cn_2015\">Combined Nomenclature, 2015</a> , <a href=\"https://trade.ec.europa.eu/def/cn_2016\">Combined Nomenclature, 2016</a> ]</td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/hmrc-geographies\">HMRC Geographies</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/hmrc-regional-trade-statistics\">HMRC Regional Trade Statistics</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/sitc-4\">SITC 4</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/hmrc-geographies\">HMRC Geographies</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/hmrc-regions\">HMRC Regions</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/hmrc-uk-trade-in-goods-statistics-by-business-characteristics-2015\">HMRC UK Trade in Goods Statistics by Business Characteristics 2015</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/hmrc-industry-groups\">HMRC Industry Groups</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/employment-size-bands\">Employment Size Bands</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/age-of-business\">Age of Business</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/ons-abs\">ONS ABS</a></th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/employment-size-bands\">Employment Size Bands</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/age-of-business\">Age of Business</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/turnover-size-bands\">Turnover Size Bands</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/export-and-import-activity\">Export and Import Activity</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/countries-of-ownership\">Countries of Ownership</a></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/ons-balance-of-payments\">ONS Balance of Payments</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/bop-services\">BOP Services</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/ons-bop-individual-country-data\">ONS BoP Individual Country Data</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/ons-trade-areas-cord\">ONS Trade Areas CORD</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/ons-cpa\">ONS CPA</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td><a href=\"http://stamina-project.org/codes/cpav2008/cpa\">Statistical Classification of Products by Activity in the European Economic Community, 2008 version</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/ons-trade-areas-legacy\">ONS Trade Areas Legacy</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/seasonal-adjustments\">Seasonal Adjustments</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/price-classifications\">Price Classifications</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/international-trade-basis\">International Trade Basis</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/ons-pink-book-chapter-3\">ONS Pink Book Chapter 3</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/pink-book-services\">Pink Book Services</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th><a href=\"http://gss-data.org.uk/data/ons-trade-in-goods-mrets\">ONS Trade in goods MRETS</a></th>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/flow-directions\">Flow Directions</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/cord-sitc\">CORD SITC</a></td>\n",
       "      <td>[<a href=\"http://gss-data.org.uk/def/concept-scheme/ons-trade-areas-cord\">ONS Trade Areas CORD</a> , <a href=\"http://gss-data.org.uk/def/concept-scheme/ons-trade-areas-legacy\">ONS Trade Areas Legacy</a> ]</td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/seasonal-adjustments\">Seasonal Adjustments</a></td>\n",
       "      <td><a href=\"http://gss-data.org.uk/def/concept-scheme/price-classifications\">Price Classifications</a></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dimension_counts = codelists_pivot.count('rows').sort_values(ascending=False)\n",
    "codelists_pivot = codelists_pivot.reindex(columns=dimension_counts.index).fillna('')\n",
    "pd.set_option('max_colwidth', -1) # don't truncate cells\n",
    "with open(\"dataset-dimensions.html\", \"w\") as f:\n",
    "    f.write(codelists_pivot.to_html(escape=False))\n",
    "HTML(codelists_pivot.to_html(escape=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that although multiple datasets re-use the same common dimensions, they often use their own codelists. We know that there exist correspondances between these (e.g. BPM6 codes can be used to relate BOP and PB definitions) although we've not published the links for those yet."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}