Ingesting the Reference Geoscape Datasets
This post forms part of the ongoing #TagJob project.
In the previous post I introduced two Geoscape datasets that have been made available on the Australian Government's data.gov.au website: National Roads and Administrative Boundaries. The datasets are distributed in two different formats, neither of which is optimal for my intended spatial processing model. A first task is then to transform the data to a common format, and one that has the right performance characteristics for the project.
My spatial processing model will cache required meta-data and geometry in RAM, trading significantly higher memory requirements in exchange for significantly faster data access. Loading data into memory requires a high-performance storage engine, and for that I have selected DuckDB.
DuckDB is highly performant in terms of storage and execution, and is further recommended for this application by a trait that might often be seen as a limitation: it's an embedded database. So, while it can't do the client-server dance, DuckDB will deliver data to my application without an intermediate network and the overheads that brings. Better still, DuckDB's spatial extension – and particularly GDAL integration – make it reasonably trivial to ingest both National Roads in GDB format and Administrative Boundaries in SHP format. For example:
create table map_feature_state_polygon as
select from ST_Read('ACT_STATE_POLYGON_shp.dbf');
However, the code I've written does get a just little more complicated. Firstly, the datasets are distributed in a hierarchical directory structure, sometimes with separate files (or actually sets of files) for each state or territory. So I'm fishing through the directory hierarchy for those files, and then joining their contents into single tables.
And secondly, I have elected to “normalise out” coded values and recurring text values (e.g. road names), replacing them with integer foreign keys. My rationale is thus:
- this is how I'll store the data in memory once loaded (to save space), and so it avoids doing any such conversion during the load
- notwithstanding DuckDBs storage smarts, I'm still hoping for space efficiency on disk
I am initially focusing on the following datasets / layers, but may add more down the track:
- National Roads (4,340,757 rows)
- Administrative Boundaries
- State Polygon (12,844 rows)
- Local Government Area Polygon (2,210 rows)
- Locality Polygon (15,782 rows)
The code for this article is in the TagJobSpatial repository here.
On my MacBook Pro M1 Max processor the load takes approximately 1 minute, and the resultant DuckDB database is about 2.5 gigabytes.
Tags: #TagJob #Geospatial #DuckDB