[josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Dave Hansen

I've been hacking on the JOSM validator plugin for a while.  One of the
repeating hard problems that comes up are doing the UnconnectedWays tests.
You need to do searches for every segment in a way to see if there are any
nearby nodes.  This generally means that you do a number of searches on the
same order as the number of nodes that you have.

If you have 100k way nodes, you end up doing ~90k searches (one for each
segment) each looking at a sizable number of those 100k nodes.  Any way you
slice it, you end up doing billions of searches since it's an O(n^2) algorithm.
A single validation run with 50k nodes takes me 183 seconds.

I indexed 500,000 nodes from Oregon in the US.  I then took another 3,000,000
nodes and searched to find which of those 500,000 were near them.  Each
iteration over 500,000 nodes for me takes ~0.28 seconds.  If I were to do
3,000,000 searches, it would take 1.63 days!  So, I attempted to optimize this.

The messages is lengthy one, but feel free to stop and look at the code now.
If you want to know how I got that 183-second run down to 2 seconds, read
down. :)

--

My first attempt to optimize was to do binary searches.  I indexed all of the
nodes in two arrays, one by x coordinate and one by y.  I did a binary search
for each axis, iterated out from the place that the binary search found
(representing the radius of the search).  I then combined the two searches.
Algorithmically, it's still an O(n^2) algorithm since there's an iteration, but
it was significantly faster since the number of nodes that were touched was
smaller.

But, I still ran into problems with large data sets because I could still get
thousands of nodes returned from each axis search.  Optimizing, I instead kept
just a single index x-axis.  I then sorted the result of the x-axis binary
search by the y-coordinate and binary searched on that.  That's as fast as I
ever got that algorithm to go.  It ended up way better than the linear search
(~600x!) but still takes ~200s for the 3,000,000 node search.

Can we make the binary search better, or is there was still a fundamental
limitation in that algorithm?  I think I was up against a pretty fundamental
limitation if I chose to sort by a single coordinate at a time: during the
search, it touched a *LOT* of virtually random data.  Since I indexed all of
the coordinates by longitude, a point in Greenwich could be in the binary
x-axis index next to a point in Algeria and next to another in Antarctica (all
at 0 longitude).  There's a lot of jumping around to pseudo-random points all
over the data set.  That thrashes the CPU caches and TLB.  

I was intrigued by the quadtiling algorithm that we currently use on the
database servers.

http://wiki.openstreetmap.org/wiki/QuadTiles

If we were to make a data structure that represents the tiles, we could chop up
the world into much smaller pieces.  Searches would also potentially be faster
since we could operate on subsets of the data much more easily.  We would not
be dealing with the whole world of random data.

So, I went and implemented it.  I call it QuadBuckets, and it's basically an
unbalanced 4-way radix tree structure.  We radix on the bits in the quad index.
It stores individual data in quad bucket levels (QBLevel)s that can either
contain pointers to 4 more child QBLevels or may act as a leaf node and store
arbitrarily long lists of OSM Nodes.

When a leaf node gets too large (my testing shows that 16 or 32 elements is the
best maximum size for performance.. see todo list), the leaf is split into its
4 constituent child tiles and its contents are redistributed among them.
Several splits (and new tree levels) may be required if the stored coordinates
were very close together.

The quad tiling in the database uses 16 bits of precision.  This means that the
least-significant quadtile bit represents an area about 610m (40,000km / 2^16)
across (and half that tall) at the equator.  On real data sets, this means the
buckets are just too large.  So, I added another 8 bits of precision so that we
have buckets that are now ~2.4m across.  Making sure buckets don't get too big
is one of the keys to this algorithm, and these buckets are small enough.

I also implemented a simple search cache.  It records the last QBLevel node
from which the last search found results.  When the next search request is
made, we check to see if this tree node can satisfy the search, and avoid
walking too far down the tree each time.  If the tree node can not satisfy the
search, we walk back up the tree until the search can be satisfied.  This
optimization speeds things up by ~25%.

For testing, I first used random data spanning the globe, basically creating
nodes with random latitudes and longitudes.  This is a near worst-case for
QuadBuckets.  Despite this, I got the QuadBuckets approach to be about 30%
faster than the binary searches.  But, using real OSM data, it really shines.

I threw the 3,000,000 search workload at it.  Remember, 

[josm-dev] [PATCH 2/2] use QuadBuckets as node storage

2009-09-12 Thread Dave Hansen

QuadBuckets also happen to implement CollectionNode.  So, we can
just plug it in for Collection like in the DataSet class.

---

 core-dave/src/org/openstreetmap/josm/data/osm/DataSet.java |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN 
src/org/openstreetmap/josm/data/osm/DataSet.java~QuadBuckets-as-node-storage 
src/org/openstreetmap/josm/data/osm/DataSet.java
--- 
core/src/org/openstreetmap/josm/data/osm/DataSet.java~QuadBuckets-as-node-storage
   2009-09-12 08:42:07.0 -0700
+++ core-dave/src/org/openstreetmap/josm/data/osm/DataSet.java  2009-09-12 
08:42:07.0 -0700
@@ -16,6 +16,7 @@ import java.util.List;
 import java.util.Set;
 
 import org.openstreetmap.josm.data.SelectionChangedListener;
+import org.openstreetmap.josm.data.osm.QuadBuckets;
 
 /**
  * DataSet is the data behind the application. It can consists of only a few 
points up to the whole
@@ -37,7 +38,7 @@ public class DataSet implements Cloneabl
  * All nodes goes here, even when included in other data (ways etc). This 
enables the instant
  * conversion of the whole DataSet by iterating over this data structure.
  */
-public CollectionNode nodes = new LinkedListNode();
+public CollectionNode nodes = new QuadBuckets();
 
 /**
  * All ways (Streets etc.) in the DataSet.
_

___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Russ Nelson
Dave Hansen writes:
  So, I went and implemented it.  I call it QuadBuckets, and it's basically an
  unbalanced 4-way radix tree structure.

Well done, Dave!

-- 
--my blog is athttp://blog.russnelson.com
Crynwr supports open source software
521 Pleasant Valley Rd. | +1 315-323-1241
Potsdam, NY 13676-3213  | Sheepdog   

___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Robert Scott
On Saturday 12 September 2009, Dave Hansen wrote:
 
 I've been hacking on the JOSM validator plugin for a while.  One of the
 repeating hard problems that comes up are doing the UnconnectedWays tests.
 You need to do searches for every segment in a way to see if there are any
 nearby nodes.  This generally means that you do a number of searches on the
 same order as the number of nodes that you have.
...

Dave,

I realize this is quite a crude reply to such a detailed email.

Did you investigate kd-trees at all?

http://en.wikipedia.org/wiki/Kd_tree


robert.

___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Dave Hansen
On Sat, 2009-09-12 at 17:58 +0100, Robert Scott wrote:
 On Saturday 12 September 2009, Dave Hansen wrote:
  
  I've been hacking on the JOSM validator plugin for a while.  One of the
  repeating hard problems that comes up are doing the UnconnectedWays tests.
  You need to do searches for every segment in a way to see if there are any
  nearby nodes.  This generally means that you do a number of searches on the
  same order as the number of nodes that you have.
 ...
 
 I realize this is quite a crude reply to such a detailed email.
 
 Did you investigate kd-trees at all?
 
 http://en.wikipedia.org/wiki/Kd_tree

Nope.  I did some searching to find multidimensional search algorithms
before I started this, but none of them seemed too horribly nice.

kd-trees do like quite nice.  I do believe what I've implemented is
relatively close to what they do, though.  They have the advantage that
they split in a single dimension at each level in the tree.  QuadBuckets
are hard-coded for 2 dimensions and just basically split into those two
dimensions at once.  But, I think there is a lot of similarity.

It also looks like kd-trees are intelligent about the values at which
they decide to split.  This means that if you have a bunch of entries
all bunched up into one end of a bucket and you split it, you'll always
get them split in half.  QuadBuckets will instead force a bunch more
splits until the buckets are well-split.  QuadBuckets are stupid about
where the split: they only do it into geometric quarters of the original
bucket.

I basically chose the algorithm that I did since it modeled something
else that I and other OSMers are familiar with.  It's even possible that
we could use the quadtile indexes in a way to make communication with
the DB server faster, although I'm using slightly different calculations
than it is as it stands now.

If someone knows of any existing Java kd-tree implementations, I'd be
happy to look into it and see if it could be applied here.  I love
nothing more than to throw my own code away.  Seriously. ;)

-- Dave


___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Ævar Arnfjörð Bjarmason
On Sat, Sep 12, 2009 at 5:11 PM, Dave Hansen d...@sr71.net wrote:
 If someone knows of any existing Java kd-tree implementations, I'd be
 happy to look into it and see if it could be applied here.  I love
 nothing more than to throw my own code away.  Seriously. ;)

Google turned this up for kd-tree java:
http://stackoverflow.com/questions/253767/kdtree-implementation-in-java

___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Dave Hansen
On Sat, 2009-09-12 at 17:25 +, Ævar Arnfjörð Bjarmason wrote:
 On Sat, Sep 12, 2009 at 5:11 PM, Dave Hansen d...@sr71.net wrote:
  If someone knows of any existing Java kd-tree implementations, I'd be
  happy to look into it and see if it could be applied here.  I love
  nothing more than to throw my own code away.  Seriously. ;)
 
 Google turned this up for kd-tree java:
 http://stackoverflow.com/questions/253767/kdtree-implementation-in-java

One of those links is a 404, and the other looks like a C++ class that
has a Java API.  I'm not sure that'd be very easy to integrate into
JOSM.  Thanks for the links, though!

-- Dave


___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Ævar Arnfjörð Bjarmason
On Sat, Sep 12, 2009 at 5:31 PM, Dave Hansen d...@sr71.net wrote:
 On Sat, 2009-09-12 at 17:25 +, Ęvar Arnfjörš Bjarmason wrote:
 On Sat, Sep 12, 2009 at 5:11 PM, Dave Hansen d...@sr71.net wrote:
  If someone knows of any existing Java kd-tree implementations, I'd be
  happy to look into it and see if it could be applied here.  I love
  nothing more than to throw my own code away.  Seriously. ;)

 Google turned this up for kd-tree java:
 http://stackoverflow.com/questions/253767/kdtree-implementation-in-java

 One of those links is a 404, and the other looks like a C++ class that
 has a Java API.  I'm not sure that'd be very easy to integrate into
 JOSM.  Thanks for the links, though!

You can find that 404-ed class on Google Code:
http://google.com/codesearch?q=KDTree.java

___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Dave Hansen
On Sat, 2009-09-12 at 21:04 +0200, Petr Nejedlý wrote:
 Dave Hansen napsal(a):
   Ooh, I forgot about josm-ng.  That one looks very usable.  If mine
   doesn't pan out, I'll certainly look at that one.
 
 Well, the QTree in josm-ng is similar to yours in the way it distributes 
 the content, though I did pay quite some attention to memory usage 
 (hint: e.g. LinkedList is not the collection you'd like to use unless 
 you have specific reason to)

For the leaf nodes you mean?  I actually got better performance out of
it than I did ArrayList.  Honestly, I've been looking at performance a
lot more than pure memory usage, so I bet you're right.  But, when you
have an absolute ton of ArrayLists around that you're iterating over a
lot ArrayList.size() actually shows up in the profiles pretty high.

 and support also 2d entities (node is 
 zero-d). The problem with importing it directly into josm is that it 
 uses (in -ng) the projected (and integer) coordinates. While it doesn't 
 care whether the coordinates are projected or not, it hugely benefit 
 from their signed integer nature.

You just mean that they're cheaper to deal with than doubles/floats?

 Dave Hansen napsal(a):
   QuadBuckets also happen to implement CollectionNode.  So, we can
   just plug it in for Collection like in the DataSet class.
 
 I have not looked at the josm codebase for a while, but as long as
 Node has publicly mutable coordinates, you can't honestly do it.
 If anything moves a node, it has no way to automatically jump into the 
 new bucket. And it makes no sense to try patching all the places which 
 can move a node (MoveCommand is not the only one).

Just recently, Node/Way/Relation require access to be via accessor
functions.  That should help out quite a bit.  The one thing that we do
need is for a list of PrimitiveChangeListeners or something to call when
primitives do change.

You're right, though.  This can't simply be dropped in for the node list
and retain its searching abilities.

-- Dave


___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Karl Guggisberg
 Just recently, Node/Way/Relation require access to be via accessor
functions.  That should help out quite a bit.  
 The one thing that we do need is for a list of PrimitiveChangeListeners or
something to call when primitives do change.
This is the very motivation Jiri and myself have been working on replacing
direct field access with accessors. Now we can start to work on features
which will really improve JOSM, a spatial index for instance.

-- Karl  

-Ursprüngliche Nachricht-
Von: Dave Hansen [mailto:d...@sr71.net] 
Gesendet: Samstag, 12. September 2009 21:19
An: Petr Nejedlý
Cc: karl.guggisb...@guggis.ch; 'Ævar Arnfjörð Bjarmason';
josm-dev@openstreetmap.org
Betreff: Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node
storage in JOSM

On Sat, 2009-09-12 at 21:04 +0200, Petr Nejedlý wrote:
 Dave Hansen napsal(a):
   Ooh, I forgot about josm-ng.  That one looks very usable.  If mine  
  doesn't pan out, I'll certainly look at that one.
 
 Well, the QTree in josm-ng is similar to yours in the way it 
 distributes the content, though I did pay quite some attention to 
 memory usage
 (hint: e.g. LinkedList is not the collection you'd like to use unless 
 you have specific reason to)

For the leaf nodes you mean?  I actually got better performance out of it
than I did ArrayList.  Honestly, I've been looking at performance a lot more
than pure memory usage, so I bet you're right.  But, when you have an
absolute ton of ArrayLists around that you're iterating over a lot
ArrayList.size() actually shows up in the profiles pretty high.

 and support also 2d entities (node is zero-d). The problem with 
 importing it directly into josm is that it uses (in -ng) the projected 
 (and integer) coordinates. While it doesn't care whether the 
 coordinates are projected or not, it hugely benefit from their signed 
 integer nature.

You just mean that they're cheaper to deal with than doubles/floats?

 Dave Hansen napsal(a):
   QuadBuckets also happen to implement CollectionNode.  So, we can  
  just plug it in for Collection like in the DataSet class.
 
 I have not looked at the josm codebase for a while, but as long as 
 Node has publicly mutable coordinates, you can't honestly do it.
 If anything moves a node, it has no way to automatically jump into the 
 new bucket. And it makes no sense to try patching all the places which 
 can move a node (MoveCommand is not the only one).

Just recently, Node/Way/Relation require access to be via accessor
functions.  That should help out quite a bit.  The one thing that we do need
is for a list of PrimitiveChangeListeners or something to call when
primitives do change.

You're right, though.  This can't simply be dropped in for the node list and
retain its searching abilities.

-- Dave


___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Dirk Stöcker
On Sat, 12 Sep 2009, Karl Guggisberg wrote:

 Just recently, Node/Way/Relation require access to be via accessor 
 functions.  That should help out quite a bit. The one thing that we do 
 need is for a list of PrimitiveChangeListeners or something to call 
 when primitives do change.

 This is the very motivation Jiri and myself have been working on replacing
 direct field access with accessors. Now we can start to work on features
 which will really improve JOSM, a spatial index for instance.

I really like some of the last patches which improve speed in JOSM and 
plugins. We are going into the right direction even if it is still a long 
way until JOSM is able to handle large datasets reliable and fast.

Ciao
-- 
http://www.dstoecker.eu/ (PGP key available)


___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev


Re: [josm-dev] [PATCH 1/2] QuadBuckets (using quad tiling) for node storage in JOSM

2009-09-12 Thread Dave Hansen
On Sat, 2009-09-12 at 21:36 +0200, Karl Guggisberg wrote:
  Just recently, Node/Way/Relation require access to be via accessor
 functions.  That should help out quite a bit.  
  The one thing that we do need is for a list of PrimitiveChangeListeners or
 something to call when primitives do change.
 This is the very motivation Jiri and myself have been working on replacing
 direct field access with accessors. Now we can start to work on features
 which will really improve JOSM, a spatial index for instance.

Any idea how you want this to look?  I'm starting to code some primitive
things, and it would be nice if I could get whatever I do merged
eventually.

* Object updates are rare, and it is OK if the notification process is
  relatively slow
* Rather than storing a list of all listeners that should be consulted
  on each object, store a (relatively) global list of listeners.  Each
  listener would get all object updates for a particular type.  That
  keeps us from having to store anything in *each* primitive.

I also want to have a ReverseLookup (node - ways) cache.  I've had it
sitting in the validator, but it's hard to keep it coherent and I end up
throwing it away too often to make it useful.

Thinking about it, though...  If we had ways in QuadBuckets, I probably
wouldn't need the ReverseLookup cache.  I'd just do a search for ways
around a node and I'd just get the ways that are using the node back.

-- Dave


___
josm-dev mailing list
josm-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/josm-dev