[josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
I've been hacking on the JOSM validator plugin for a while. One of the repeating hard problems that comes up are doing the UnconnectedWays tests. You need to do searches for every segment in a way to see if there are any nearby nodes. This generally means that you do a number of searches on the same order as the number of nodes that you have. If you have 100k way nodes, you end up doing ~90k searches (one for each segment) each looking at a sizable number of those 100k nodes. Any way you slice it, you end up doing billions of searches since it's an O(n^2) algorithm. A single validation run with 50k nodes takes me 183 seconds. I indexed 500,000 nodes from Oregon in the US. I then took another 3,000,000 nodes and searched to find which of those 500,000 were near them. Each iteration over 500,000 nodes for me takes ~0.28 seconds. If I were to do 3,000,000 searches, it would take 1.63 days! So, I attempted to optimize this. The messages is lengthy one, but feel free to stop and look at the code now. If you want to know how I got that 183-second run down to 2 seconds, read down. :) -- My first attempt to optimize was to do binary searches. I indexed all of the nodes in two arrays, one by x coordinate and one by y. I did a binary search for each axis, iterated out from the place that the binary search found (representing the radius of the search). I then combined the two searches. Algorithmically, it's still an O(n^2) algorithm since there's an iteration, but it was significantly faster since the number of nodes that were touched was smaller. But, I still ran into problems with large data sets because I could still get thousands of nodes returned from each axis search. Optimizing, I instead kept just a single index x-axis. I then sorted the result of the x-axis binary search by the y-coordinate and binary searched on that. That's as fast as I ever got that algorithm to go. It ended up way better than the linear search (~600x!) but still takes ~200s for the 3,000,000 node search. Can we make the binary search better, or is there was still a fundamental limitation in that algorithm? I think I was up against a pretty fundamental limitation if I chose to sort by a single coordinate at a time: during the search, it touched a *LOT* of virtually random data. Since I indexed all of the coordinates by longitude, a point in Greenwich could be in the binary x-axis index next to a point in Algeria and next to another in Antarctica (all at 0 longitude). There's a lot of jumping around to pseudo-random points all over the data set. That thrashes the CPU caches and TLB. I was intrigued by the quadtiling algorithm that we currently use on the database servers. http://wiki.openstreetmap.org/wiki/QuadTiles If we were to make a data structure that represents the tiles, we could chop up the world into much smaller pieces. Searches would also potentially be faster since we could operate on subsets of the data much more easily. We would not be dealing with the whole world of random data. So, I went and implemented it. I call it QuadBuckets, and it's basically an unbalanced 4-way radix tree structure. We radix on the bits in the quad index. It stores individual data in quad bucket levels (QBLevel)s that can either contain pointers to 4 more child QBLevels or may act as a leaf node and store arbitrarily long lists of OSM Nodes. When a leaf node gets too large (my testing shows that 16 or 32 elements is the best maximum size for performance.. see todo list), the leaf is split into its 4 constituent child tiles and its contents are redistributed among them. Several splits (and new tree levels) may be required if the stored coordinates were very close together. The quad tiling in the database uses 16 bits of precision. This means that the least-significant quadtile bit represents an area about 610m (40,000km / 2^16) across (and half that tall) at the equator. On real data sets, this means the buckets are just too large. So, I added another 8 bits of precision so that we have buckets that are now ~2.4m across. Making sure buckets don't get too big is one of the keys to this algorithm, and these buckets are small enough. I also implemented a simple search cache. It records the last QBLevel node from which the last search found results. When the next search request is made, we check to see if this tree node can satisfy the search, and avoid walking too far down the tree each time. If the tree node can not satisfy the search, we walk back up the tree until the search can be satisfied. This optimization speeds things up by ~25%. For testing, I first used random data spanning the globe, basically creating nodes with random latitudes and longitudes. This is a near worst-case for QuadBuckets. Despite this, I got the QuadBuckets approach to be about 30% faster than the binary searches. But, using real OSM data, it really shines. I threw the 3,000,000 search workload at it. Remember,
[josm-dev] [PATCH 2/2] use QuadBuckets as node storage
QuadBuckets also happen to implement CollectionNode. So, we can just plug it in for Collection like in the DataSet class. --- core-dave/src/org/openstreetmap/josm/data/osm/DataSet.java |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff -puN src/org/openstreetmap/josm/data/osm/DataSet.java~QuadBuckets-as-node-storage src/org/openstreetmap/josm/data/osm/DataSet.java --- core/src/org/openstreetmap/josm/data/osm/DataSet.java~QuadBuckets-as-node-storage 2009-09-12 08:42:07.0 -0700 +++ core-dave/src/org/openstreetmap/josm/data/osm/DataSet.java 2009-09-12 08:42:07.0 -0700 @@ -16,6 +16,7 @@ import java.util.List; import java.util.Set; import org.openstreetmap.josm.data.SelectionChangedListener; +import org.openstreetmap.josm.data.osm.QuadBuckets; /** * DataSet is the data behind the application. It can consists of only a few points up to the whole @@ -37,7 +38,7 @@ public class DataSet implements Cloneabl * All nodes goes here, even when included in other data (ways etc). This enables the instant * conversion of the whole DataSet by iterating over this data structure. */ -public CollectionNode nodes = new LinkedListNode(); +public CollectionNode nodes = new QuadBuckets(); /** * All ways (Streets etc.) in the DataSet. _ ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
Dave Hansen writes: So, I went and implemented it. I call it QuadBuckets, and it's basically an unbalanced 4-way radix tree structure. Well done, Dave! -- --my blog is athttp://blog.russnelson.com Crynwr supports open source software 521 Pleasant Valley Rd. | +1 315-323-1241 Potsdam, NY 13676-3213 | Sheepdog ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Saturday 12 September 2009, Dave Hansen wrote: I've been hacking on the JOSM validator plugin for a while. One of the repeating hard problems that comes up are doing the UnconnectedWays tests. You need to do searches for every segment in a way to see if there are any nearby nodes. This generally means that you do a number of searches on the same order as the number of nodes that you have. ... Dave, I realize this is quite a crude reply to such a detailed email. Did you investigate kd-trees at all? http://en.wikipedia.org/wiki/Kd_tree robert. ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, 2009-09-12 at 17:58 +0100, Robert Scott wrote: On Saturday 12 September 2009, Dave Hansen wrote: I've been hacking on the JOSM validator plugin for a while. One of the repeating hard problems that comes up are doing the UnconnectedWays tests. You need to do searches for every segment in a way to see if there are any nearby nodes. This generally means that you do a number of searches on the same order as the number of nodes that you have. ... I realize this is quite a crude reply to such a detailed email. Did you investigate kd-trees at all? http://en.wikipedia.org/wiki/Kd_tree Nope. I did some searching to find multidimensional search algorithms before I started this, but none of them seemed too horribly nice. kd-trees do like quite nice. I do believe what I've implemented is relatively close to what they do, though. They have the advantage that they split in a single dimension at each level in the tree. QuadBuckets are hard-coded for 2 dimensions and just basically split into those two dimensions at once. But, I think there is a lot of similarity. It also looks like kd-trees are intelligent about the values at which they decide to split. This means that if you have a bunch of entries all bunched up into one end of a bucket and you split it, you'll always get them split in half. QuadBuckets will instead force a bunch more splits until the buckets are well-split. QuadBuckets are stupid about where the split: they only do it into geometric quarters of the original bucket. I basically chose the algorithm that I did since it modeled something else that I and other OSMers are familiar with. It's even possible that we could use the quadtile indexes in a way to make communication with the DB server faster, although I'm using slightly different calculations than it is as it stands now. If someone knows of any existing Java kd-tree implementations, I'd be happy to look into it and see if it could be applied here. I love nothing more than to throw my own code away. Seriously. ;) -- Dave ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, Sep 12, 2009 at 5:11 PM, Dave Hansen d...@sr71.net wrote: If someone knows of any existing Java kd-tree implementations, I'd be happy to look into it and see if it could be applied here. I love nothing more than to throw my own code away. Seriously. ;) Google turned this up for kd-tree java: http://stackoverflow.com/questions/253767/kdtree-implementation-in-java ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, 2009-09-12 at 17:25 +, Ævar Arnfjörð Bjarmason wrote: On Sat, Sep 12, 2009 at 5:11 PM, Dave Hansen d...@sr71.net wrote: If someone knows of any existing Java kd-tree implementations, I'd be happy to look into it and see if it could be applied here. I love nothing more than to throw my own code away. Seriously. ;) Google turned this up for kd-tree java: http://stackoverflow.com/questions/253767/kdtree-implementation-in-java One of those links is a 404, and the other looks like a C++ class that has a Java API. I'm not sure that'd be very easy to integrate into JOSM. Thanks for the links, though! -- Dave ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, Sep 12, 2009 at 5:31 PM, Dave Hansen d...@sr71.net wrote: On Sat, 2009-09-12 at 17:25 +, Ęvar Arnfjörš Bjarmason wrote: On Sat, Sep 12, 2009 at 5:11 PM, Dave Hansen d...@sr71.net wrote: If someone knows of any existing Java kd-tree implementations, I'd be happy to look into it and see if it could be applied here. I love nothing more than to throw my own code away. Seriously. ;) Google turned this up for kd-tree java: http://stackoverflow.com/questions/253767/kdtree-implementation-in-java One of those links is a 404, and the other looks like a C++ class that has a Java API. I'm not sure that'd be very easy to integrate into JOSM. Thanks for the links, though! You can find that 404-ed class on Google Code: http://google.com/codesearch?q=KDTree.java ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, 2009-09-12 at 21:04 +0200, Petr Nejedlý wrote: Dave Hansen napsal(a): Ooh, I forgot about josm-ng. That one looks very usable. If mine doesn't pan out, I'll certainly look at that one. Well, the QTree in josm-ng is similar to yours in the way it distributes the content, though I did pay quite some attention to memory usage (hint: e.g. LinkedList is not the collection you'd like to use unless you have specific reason to) For the leaf nodes you mean? I actually got better performance out of it than I did ArrayList. Honestly, I've been looking at performance a lot more than pure memory usage, so I bet you're right. But, when you have an absolute ton of ArrayLists around that you're iterating over a lot ArrayList.size() actually shows up in the profiles pretty high. and support also 2d entities (node is zero-d). The problem with importing it directly into josm is that it uses (in -ng) the projected (and integer) coordinates. While it doesn't care whether the coordinates are projected or not, it hugely benefit from their signed integer nature. You just mean that they're cheaper to deal with than doubles/floats? Dave Hansen napsal(a): QuadBuckets also happen to implement CollectionNode. So, we can just plug it in for Collection like in the DataSet class. I have not looked at the josm codebase for a while, but as long as Node has publicly mutable coordinates, you can't honestly do it. If anything moves a node, it has no way to automatically jump into the new bucket. And it makes no sense to try patching all the places which can move a node (MoveCommand is not the only one). Just recently, Node/Way/Relation require access to be via accessor functions. That should help out quite a bit. The one thing that we do need is for a list of PrimitiveChangeListeners or something to call when primitives do change. You're right, though. This can't simply be dropped in for the node list and retain its searching abilities. -- Dave ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
Just recently, Node/Way/Relation require access to be via accessor functions. That should help out quite a bit. The one thing that we do need is for a list of PrimitiveChangeListeners or something to call when primitives do change. This is the very motivation Jiri and myself have been working on replacing direct field access with accessors. Now we can start to work on features which will really improve JOSM, a spatial index for instance. -- Karl -Ursprüngliche Nachricht- Von: Dave Hansen [mailto:d...@sr71.net] Gesendet: Samstag, 12. September 2009 21:19 An: Petr Nejedlý Cc: karl.guggisb...@guggis.ch; 'Ævar Arnfjörð Bjarmason'; josm-dev@openstreetmap.org Betreff: Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM On Sat, 2009-09-12 at 21:04 +0200, Petr Nejedlý wrote: Dave Hansen napsal(a): Ooh, I forgot about josm-ng. That one looks very usable. If mine doesn't pan out, I'll certainly look at that one. Well, the QTree in josm-ng is similar to yours in the way it distributes the content, though I did pay quite some attention to memory usage (hint: e.g. LinkedList is not the collection you'd like to use unless you have specific reason to) For the leaf nodes you mean? I actually got better performance out of it than I did ArrayList. Honestly, I've been looking at performance a lot more than pure memory usage, so I bet you're right. But, when you have an absolute ton of ArrayLists around that you're iterating over a lot ArrayList.size() actually shows up in the profiles pretty high. and support also 2d entities (node is zero-d). The problem with importing it directly into josm is that it uses (in -ng) the projected (and integer) coordinates. While it doesn't care whether the coordinates are projected or not, it hugely benefit from their signed integer nature. You just mean that they're cheaper to deal with than doubles/floats? Dave Hansen napsal(a): QuadBuckets also happen to implement CollectionNode. So, we can just plug it in for Collection like in the DataSet class. I have not looked at the josm codebase for a while, but as long as Node has publicly mutable coordinates, you can't honestly do it. If anything moves a node, it has no way to automatically jump into the new bucket. And it makes no sense to try patching all the places which can move a node (MoveCommand is not the only one). Just recently, Node/Way/Relation require access to be via accessor functions. That should help out quite a bit. The one thing that we do need is for a list of PrimitiveChangeListeners or something to call when primitives do change. You're right, though. This can't simply be dropped in for the node list and retain its searching abilities. -- Dave ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, 12 Sep 2009, Karl Guggisberg wrote: Just recently, Node/Way/Relation require access to be via accessor functions. That should help out quite a bit. The one thing that we do need is for a list of PrimitiveChangeListeners or something to call when primitives do change. This is the very motivation Jiri and myself have been working on replacing direct field access with accessors. Now we can start to work on features which will really improve JOSM, a spatial index for instance. I really like some of the last patches which improve speed in JOSM and plugins. We are going into the right direction even if it is still a long way until JOSM is able to handle large datasets reliable and fast. Ciao -- http://www.dstoecker.eu/ (PGP key available) ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev
Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM
On Sat, 2009-09-12 at 21:36 +0200, Karl Guggisberg wrote: Just recently, Node/Way/Relation require access to be via accessor functions. That should help out quite a bit. The one thing that we do need is for a list of PrimitiveChangeListeners or something to call when primitives do change. This is the very motivation Jiri and myself have been working on replacing direct field access with accessors. Now we can start to work on features which will really improve JOSM, a spatial index for instance. Any idea how you want this to look? I'm starting to code some primitive things, and it would be nice if I could get whatever I do merged eventually. * Object updates are rare, and it is OK if the notification process is relatively slow * Rather than storing a list of all listeners that should be consulted on each object, store a (relatively) global list of listeners. Each listener would get all object updates for a particular type. That keeps us from having to store anything in *each* primitive. I also want to have a ReverseLookup (node - ways) cache. I've had it sitting in the validator, but it's hard to keep it coherent and I end up throwing it away too often to make it useful. Thinking about it, though... If we had ways in QuadBuckets, I probably wouldn't need the ReverseLookup cache. I'd just do a search for ways around a node and I'd just get the ways that are using the node back. -- Dave ___ josm-dev mailing list josm-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/josm-dev