Re: [Geocoding] how to estimate hardware needed for nominatim osm2pgsql

2018-03-02 Thread Paul Norman

On 3/2/2018 2:01 AM, Josip Rodin wrote:

On Thu, Mar 01, 2018 at 09:41:54AM -0800, Paul Norman wrote:

when I import the nodes for all of Europe, the ways get processed at a
rate of 30/s

It's slow during the osm2pgsql import stage. General advice for
osm2pgsql applies here. For a large import, you want more RAM. Ideally,
you should have enough cache to fit all the node positions in RAM. For
Europe, this is probably 20GB to 25GB on a machine with 32GB of RAM.

Yesterday the Europe import told me that it processed 2045878k nodes.
At 8 bytes per lat and 8 per long, that sounds more like 30.5 GB? Not sure
where osm2pgsql reads it from... st_memsize(place.geometry) seems to return
32 bytes actually, would that imply 41 GB? That would seem to match the size
of the flatnode file, too.


Node positions take 8 bytes per node, and cache efficiency is about 85% 
for the full planet. I haven't done an import for Europe recently, but 
taking 60% as a guess, that would give 26GB cache needed for all node 
positions.


Because flat nodes are persistent, they're designed differently, and 
take 8 bytes * maximum node ID + a few hundred bytes for headers.



Anyway, a more pertinent point would be how does the size of osm2pgsql cache
correlate to that, i.e. how do we estimate that it would it organize itself
in a way that 20 to 25 GB would be enough to get a good hit rate?


The easiest way to get cache efficiency is to look at the log output 
after an import. You could write external software that calculates the 
efficiency for a given list of nodes, but it's easier to run osm2pgsql 
with excess cache (using -O null if you're doing it a lot). Using my 
data from 2015 and https://github.com/openstreetmap/osm2pgsql/pull/441 I 
got 84.5% efficiency for the planet, 62% for Europe, and 59-50% for 2GB 
PBFs and smaller.


___
Geocoding mailing list
Geocoding@openstreetmap.org
https://lists.openstreetmap.org/listinfo/geocoding


Re: [Geocoding] how to estimate hardware needed for nominatim osm2pgsql

2018-03-02 Thread Josip Rodin
On Thu, Mar 01, 2018 at 09:41:54AM -0800, Paul Norman wrote:
> > when I import the nodes for all of Europe, the ways get processed at a
> > rate of 30/s
> 
> It's slow during the osm2pgsql import stage. General advice for 
> osm2pgsql applies here. For a large import, you want more RAM. Ideally, 
> you should have enough cache to fit all the node positions in RAM. For 
> Europe, this is probably 20GB to 25GB on a machine with 32GB of RAM.

Yesterday the Europe import told me that it processed 2045878k nodes.
At 8 bytes per lat and 8 per long, that sounds more like 30.5 GB? Not sure
where osm2pgsql reads it from... st_memsize(place.geometry) seems to return
32 bytes actually, would that imply 41 GB? That would seem to match the size
of the flatnode file, too.

Anyway, a more pertinent point would be how does the size of osm2pgsql cache
correlate to that, i.e. how do we estimate that it would it organize itself
in a way that 20 to 25 GB would be enough to get a good hit rate?

And, conversely, if we know that it will order the operations in a way
that produces a good hit rate, what are the parameters behind that -
maybe going beyond 16 GB won't reduce the import time significantly...?

In retrospect, my 5 GB cache for 41 GB of data does seem way too optimistic.

> Keep in mind that even with regular use, database workloads like Nominatim
> perform best with plenty of RAM.

It would be useful to have some more info beforehand on that, too, like what
are the most relevant indexes for each use case (geocoding, reverse
geocoding, ...), what is their pg_total_relation_size(), how fragmented does
it get over time, ...

-- 
 2. That which causes joy or happiness.

___
Geocoding mailing list
Geocoding@openstreetmap.org
https://lists.openstreetmap.org/listinfo/geocoding


Re: [Geocoding] how to estimate hardware needed for nominatim osm2pgsql

2018-03-01 Thread Paul Norman

On 3/1/2018 4:46 AM, Josip Rodin wrote:

I observed an issue with Nominatim import performance that is described
athttps://github.com/openstreetmap/Nominatim/issues/954

Long story short - on the same machine, ways processing is egregiously
slower than node processing; when I import the nodes for all of Europe,
the ways get processed at a rate of 30/s; when I import the nodes for
just France, the ways get processed at 27700/s.

How does one go about debugging that? The documentation doesn't help much.

I could let it drag along once at 30/s, but I don't want the same situation
to persist with the updates later, which could render the whole setup useless.

There's reports online that avoiding OVH Ceph storage would help. Would it?
There doesn't seem to be any obvious variation to its behavior that could
be directly attributed to it.

There's an implication in the hardware requirements that having more memory
would be helpful. Would that do the trick? I'm seeing the same memory usage
graph with both inputs.


It's slow during the osm2pgsql import stage. General advice for 
osm2pgsql applies here. For a large import, you want more RAM. Ideally, 
you should have enough cache to fit all the node positions in RAM. For 
Europe, this is probably 20GB to 25GB on a machine with 32GB of RAM.


You cannot compare nodes processed per second to ways processed per 
second and say one is faster than the other. They're measuring different 
things.


If you're using some kind of cloud storage, IO latency is likely a big 
issue. Cloud storage typically can only support good IOPS with a high 
queue depth and/or many requests in parallel. Fortunately, if you're 
using cloud storage, it's normally easy to get a machine with enough RAM 
for the import, then switch it for regular use. Keep in mind that even 
with regular use, database workloads like Nominatim perform best with 
plenty of RAM.


___
Geocoding mailing list
Geocoding@openstreetmap.org
https://lists.openstreetmap.org/listinfo/geocoding