[Tile-serving] [osm2pgsql-dev/osm2pgsql] Improvements for clustering of output tables (Issue #2366)

Jochen Topf via Tile-serving Tue, 29 Jul 2025 13:42:11 -0700

joto created an issue (osm2pgsql-dev/osm2pgsql#2366)

osm2pgsql by default "clusters" all output tables by geometry. I have put 
"clusters" in quotes, because this doesn't use the CLUSTER functionality of the 
database, but we are doing this our own way.


How this works:

1. Data is first written in a kind of temporary table. It isn't a TEMP table in 
the SQL sense, but it is created as UNLOGGED, so no WAL is written, which 
speeds up writing. If the import is interrupted for any reason the table is 
automatically removed by the database, but that's fine, we have to restart from 
scratch anyway.
2. After the data import is finished a new "final table" is created as a normal 
table.
3. Then we do an INSERT on the new table getting all data from the temporary 
table with ORDER BY [geomcolumn]. So all data is copied over but in sorted form.
4. We delete the temporary table.
5. Then we build a geometry index on the final table.

(You can see here why we are not using the PostgreSQL CLUSTER commmand. For 
CLUSTER we'd have to build the table, build the index and then run the CLUSTER 
command. The effect is that a) we can not use an UNLOGGED table for the first 
version of the table, because it is the same as the final one and b) the index 
basically has to be built twice. This makes our approach more efficient, albeit 
a bit more complex.)

## Problems

What we are doing here has two problems:

A. While we are copying the table in step (3), we need twice the amount of disk 
space temporarily, because we have the old and new tables around at the same 
time. A table with, say, all the planets around 1 billion ways in it, can 
easily need 300 GB or more. That's quite some disk space we need extra. This is 
made worse, because usually (if `--disable-parallel-indexing` is not used) we 
do step (3) in parallel with several tables. So doing this step means we need a 
lot more disk space in the import phase which is not used in later day-to-day 
operations of the database which is an inefficient use of available disk space.
B. We also need to do the sorting of the table contents in step (3). As long as 
the data that's being sorted will fit into the memory set with `work_mem`, the 
database will do that in memory. If it does not fit, the database will sort the 
data in chunks and write those out to disk. Which means we'll have *another* 
temporary copy of the data in disk compounding the problem explained above. It 
also means we'll do a lot more disk I/O slowing down the whole process.

A way to improve this would be to write the data in step (1) not into one table 
but into many tables doing some kind of presorting. And then in step (3) sort 
and copy each of the temporary tables one after the other. This would solve 
problem A because we never have both temporary and final data together on disk. 
And it can help with problem B, because ideally the sorting in step (3) can now 
be done in memory.

## Why we are doing the clustering?

It probably makes sense to quickly talk about why we are doing this 
["clustering"](https://osm2pgsql.org/doc/manual.html#clustering-by-geometry).

The idea is that data that will often be queried together (because it is "near" 
each other) is located in the same or adjacent blocks on disk. This makes 
getting the data much more efficient.

Note that this doesn't have to be perfect. In normal operation, because data is 
changed all the time, table ordering will "degrade" over time anyway. 
Functionality will not be affected if the ordering is not perfect, but it will 
give us a nice performance bonus.

## What does "ordering" mean for a geometry

I have glossed over the question of what exactly ordering means in this 
context. How are geometries ordered? The answer ist that modern PostGIS 
implements a [Hilbert curve](https://en.wikipedia.org/wiki/Hilbert_curve) based 
on the center of the bounding box of each geometry. You can read about the 
details 
[here](https://www.crunchydata.com/blog/waiting-for-postgis-3-hilbert-geometry-sorting).

This is actually pretty convenient, because if we want to split the large 
temporary tables into small ones, we only have to split the world into 
quadrants (or 16 tiles or 64 tiles and so on) and write the data for each one 
into one table. Then sorting each of those tables individually and appending 
them to the final table in the order given by the Hilbert curve should produce 
the right order.

## Proposal

We can use 
[partitioning](https://www.postgresql.org/docs/17/ddl-partitioning.html) on the 
output tables. We create the partitions with one extra column that we fill with 
the "tile" the output is in. We'll write to the partitioned table as usual 
adding the "tile" from our own Hilbert curve calculation and PostgreSQL will 
put the data into the correct table. After the import we copy the ordered 
content of each table in turn (in the order given by the Hilbert curve) into 
the final table. After each copy we remove the table that was just copied.

Data is not spread uniformly around the globe, so we will need more than (all 
data / partitions) extra disk space, but it should be reduced enough with 16 or 
64 partitions that it doesn't matter any more.

What about the number of tables we are creating here? It seems there are no 
limits we have any chance of hitting. From all I have read having thousands of 
tables in PostgreSQL is totally fine and people routinely use that in 
production. And we'd only need them on import anyway. So we should not run into 
any problems even if we have, say, 50 output tables and split them up into 64 
or even more subtables. Empty tables seem to consume less than 1kByte of disk 
space, so having thousands of them around unnecessarily also doesn't seem to be 
a problem.

## Options

Other options of getting a similar result:

* Use several tables as described but without the partitioning feature. In this 
case we have to write to all the individual tables ourselves. That would mean a 
lot more changes in our code.
* Don't use temporary tables at all. Write the COPY output into files instead 
of directly into the database. Use the same partitioning schema. After all data 
is processed, sort each file ourselves and write all of them to the database in 
one go. This gives us a bit more control and should have the least overhead, 
because the database isn't involved at all until the end. But it needs more 
work and more code.



-- 
Reply to this email directly or view it on GitHub:
https://github.com/osm2pgsql-dev/osm2pgsql/issues/2366
You are receiving this because you are subscribed to this thread.

Message ID: <osm2pgsql-dev/osm2pgsql/issues/[email protected]>

_______________________________________________
Tile-serving mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/tile-serving

[Tile-serving] [osm2pgsql-dev/osm2pgsql] Improvements for clustering of output tables (Issue #2366)

Reply via email to