[Bug 60461] Replicate OSM to a database server accessible by Labs users

bugzilla-daemon Tue, 11 Feb 2014 10:00:52 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=60461


--- Comment #8 from kakrue...@gmail.com ---
Ptolemy hosts both the db and does rendering, which is a fairly common setup,
as the required resources of rendering and db are nicely complementary.

The toolserver (ptolemy) however has always had a massive performance problem
and we never managed to figure out quite why. Many of the database operations
are something like a factor of 10 or more slower than on the tileserver of OSM
or other tileservers.

My guess (and as I never managed to get real proof for it) is that it is a
combination of factors that cause the problems.

1) Part of it is likely the SSDs. From experience with the OSMF tileserver and
from many other installations of the tileserver, it has shown that moving to
SSDs from a HDD arrays is a huge boost for the type of workload done by the
postgres db. Granted most people have moved over from perhaps 4 - 8 HDD arrays
to an SSD. So if you have HDD arrays of e.g. 16 disks or more performance might
be on paar to a single SSD. However, given the OSM postgres db is currently
only about 400GB, a single consumer 512GB SSD is much cheaper than a 16 disk
array. So value for money is simply better with SSDs for the OSM workload.
Using consumer SSDs appears to be perfectly fine as well, despite a reasonably
heavy constant write load. E.g. on the OSM tileserver there is a single 600GB
Intel 320 SSD. I can't remember exactly when it was installed, but I think it
might be coming up to 3 years now and the media wearout indicator shows that
only 4% of its lifetime has been used. So at current rate, the consumer SSD
would support 60 years of writes.

2) Part of it however I believe is also software driven and my guess would be
the solaris kernel or filesystem has something to do with it (and this is more
or less pure speculation as I have no way of validating this hypothesis). If
you look at average read throughput in the munin graphs the values of ptolemy
are actually roughly on par with those of osmf's tileservers. Sure latency has
a lot to do with things, but I would expect that to still manifest itself in
average read rates if the processes were constantly waiting for disk access. So
my guess, given the disk read rates are roughly comparable, but the resultant
database performance is something like an order of magnitude worse, that either
the disk cache in RAM is extremely inefficient on ptolemy, or that there is
things read-ahead going on, and a lot of unnecessary data is read from disk.

3) Toolserver runs a number of stylesheets some of which I believe are much
less efficient than the main osm style, adding extra load on the db. However,
as the main style is much slower and also single SQL queries run directly in
postgres show the huge slowdown, the performance problems don't just stem from
overload of the system.

Overall, despite the huge performance problems and the extra load due to more
styles, ptolemy can just about manage to do what is needed. Probably, because
despite being included in a number of desktop Wikipedias (e.g. de, ru via the
osm-gadget and en via the WMA), the load they put on the servers is
surprisingly tiny. Reducing the update frequency, and a huge problem with
sockets in solaris and perl, which rejects something like 80% of rendering
requests due to failed socket initialisation, has also helped keep the workload
just about manageable on ptolemy. But it would be nice to do better.

Finally, regarding the 90Mb/s spikes. I can't say for sure what the cause is,
but rendering low zoom tiles (i.e. huge geographical areas) is much more
demanding on pulling in data, than if you render high zoom tiles, where the
spatial indexes make sure one only needs to read a tiny fraction of the
database. So Z0 - Z7 likely do seqscans instead of index scans and need to read
10s of GB of data per tile. Because of that and because of a world scale view
the map seldomly changes, they are rendere infrequently and are probably
responsible for those spikes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 60461] Replicate OSM to a database server accessible by Labs users

Reply via email to