Re: [Neo4j] Neo4j low-level data storage

Craig Taverner Fri, 07 Oct 2011 13:53:01 -0700

I think Daniels questions are very relevant, but not just to OSM. Any large
graph (of which OSM is simply a good example) will be affected by
fragmentation, and that can affect performance. I recently was hit by
performance of GIS queries (not OSM) related to fragmentation of the index
tree. I will describe that problem below, but first let me describe my view
on Daniels question.

It is true that if parts of the graph that are geographically close are also
close on disk the load time for bounding box queries will be faster.
However, this is not a problem that is easy to solve in a generic way,
because it requires knowledge of the domain. I can see two ways to create a
less fragmented graph:

   - Have a de-fragmenting algorithm that re-organizes an existing graph
   according to some rules. This does not exist in neo4j (yet?), but is
   probably easier to generalize, since it should be possible to first analyse
   the connectedness of the graph, and then defragment based on that. This
   means a reasonable solution might be possible without knowing the domain.
   - Be sure to load domain specific data in the order you wish to query it.
   In other words, create a graph that is already de-fragmented.

This second approach is the route I have started following (at least I've
taken one or two tiny baby-steps in that direction, but plan for more). In
the case of the OSM model produced by the OSMImporter in Neo4j-Spatial, we
do not do much here. We are importing the data in the order it was created
in the original postgres database (ie. in the order it was originally added
to open street map). However, since the XML format puts ways after all
nodes, we actually also store all ways after all nodes, which means that to
load any particular way completely from the database requires hitting disk
at at least two very different locations, the location of the way node and
the interconnects between the nodes, and the location(s) of the original
location nodes. This multiple hit will occur on the nodes, relationships and
properties tables in a similar way. So I can also answer a question Daniel
asked about the ids. The Neo4j nodes, relationships and properties have
their own id space. So you can have node 1, relationship 1 and property 1.
Lets consider a real example, a street made of 5 points, added early to OSM
(so low id's in both postgres and in neo4j). The OSM file will have these
nodes near the top, but the way that connects them together will be near the
bottom of the file. In Postgres the nodes and ways are in different tables,
and will both be near the top. In neo4j both osm-ways and osm-nodes are
neo4j-nodes (in the same 'table'). The osm-nodes will have low ids, but the
ways will have a high id. Also we use proxy nodes to connect osm-ways to
osm-nodes, and these will be created together with the way. So we will have
5 nodes with low ids, and 8 nodes with high id's (5 proxy nodes, 1 way node,
1 geometry node and 1 tags node). If the way was big and/or edited multiple
times, we could get even higher fragmentation. Personally I think that
fragmenting one geometry into a few specific locations is not a big problem
for the neo4j caches. However, when we are talking about a result-set or
traversal of thousands or hundreds of thousands of geometries, then doubling
or tripling the number of disk hits due to fragmentation can definitely have
a big impact.

How can this fragmentation situation be improved? One idea is to load the
data with two passes. The current loader is trying to optimize OSM import
speed, which is difficult already (and slower than in rdbms due to increased
explicit structure), and so we have a single pass loader, with a lucene
index for reconnecting ways to nodes. However, I think we could change this
to a two pass loader, with the first pass reading and indexing the point
nodes into a unique id-index (for fast postgres id lookup), and the second
pass would connect the ways, and store both the nodes and ways to the
database at the same time, in continuous disk space. This would improve
query performance, and if we make a good unique-id index faster than lucene,
we will actually also improve import speed ...... :-)

Now all of the above does not answer the original question regarding
bounding box queries. All we will have done with this is improve the load
time for complete OSM geometries (by reducing geometry fragmentation). But
what about the index itself. We are storing the index as part of the graph.
Today, Neo4j-spatial uses an RTree index that is created at the end of the
load in OSMImporter. This means we load the complete OSM file, and then we
index it. This is a good idea because it will store the entire RTree in
contiguous disk space. Sort of .... there is one issue with the RTree node
splitting that will cause slight fragmentation, but I think it is not too
serious. Now when performing bounding box queries, the main work done by the
RTree will hit the minimum amount of disk space, until we get to the final
geometry tests when we start evaluating the geometries themselves (no longer
just the rtree bounding boxes), and then we are faced with the geometry
fragmentation. Not just the fragmenting of individual geometries into
multiple disk locations, but the fact that geographically close geometries
are not close on disk.

To summarise the fragmentation issue with regard to OSMImporter
(Neo4j-Spatial):

   - In-geometry fragmentation: not ideal, we have two or more fragments per
   connected way, can be improved with two pass import
   - RTree fragmentation: not bad, we store the RTree in mostly contiguous
   space
   - Close geometries fragmentation: bad, we do not consider this at all in
   the importing.

So how do we solve this last issue? I think that if we do write a two pass
importer, we should create an external temporary index of locations during
the first pass, and then import into neo4j in approximate location order.
This will reduce fragmentation over all.

And finally, a disclaimer: all of the above is based on my knowledge of the
code of neo4j-spatial and thought experiments on what impact alternative
load order will have on graph performance. We have not demonstrated how much
improvement will occur with any of these ideas. It could help a lot, or
perhaps not much at all. There might be easier ways to make things faster.
But honestly right now we are not suffering badly on query time. Import time
is still a bigger issue, IMHO.

On Fri, Oct 7, 2011 at 3:59 PM, Peter Neubauer <
[email protected]> wrote:

> Daniel,
> for OSM data and GIS, have you looked at
> https://github.com/neo4j/spatial, especially the OSM examples at
>
> https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/pipes/GeoPipesTest.java
> and
> https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/OsmAnalysisTest.java
> ?
>
> Cheers,
>
> /peter neubauer
>
> GTalk:      neubauer.peter
> Skype       peter.neubauer
> Phone       +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter      http://twitter.com/peterneubauer
>
> http://www.neo4j.org               - Your high performance graph database.
> http://startupbootcamp.org/    - Öresund - Innovation happens HERE.
> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
>
>
>
> On Fri, Oct 7, 2011 at 3:46 PM, danielb <[email protected]> wrote:
> > Hello Chris,
> >
> > thanks for your postings, they are a great starting point for me to
> assume a
> > good performance of this database. However I have some questions left.
> > Lets say I have a node store and a property store. Both of them are
> > individual files. I am going to implement a GIS application which fetches
> > OSM data from the hard disk. When I load a bounding box I want to avoid
> to
> > many random reads from the disk.
> > More precisely I want to load nodes and properties inside a given
> bounding
> > box. It would be great if both the nodes and the properties are organized
> in
> > successive blocks. Is there one id-pool for both nodes and properties, so
> > that I can load for example the nodes with id 1 and 2 and the properties
> 3,
> > 4 and 5 with one block read? I can be totally wrong because if I save a
> new
> > node file with id 1, 2 and then save a new property file with id 3, it
> will
> > start on a new block (windows block size like 4K). When then writing a
> new
> > node id it would be saved in the first block I guess. What about
> > fragmentation? And is there an improvement when using a Linux system
> > (inodes? I don't know Linux well)? When I am finished with saving the
> nodes
> > and properties is there some way of reorganization on the hard disk? Lets
> > say I want to enter a new node which is connected to a low id. Will it
> get
> > the first free id (and it will be saved on the other end of the harddisk
> > perhaps) or does it just get an allready used id and the following
> records
> > will be reorganized (insert performance)?
> > Maybe I am totally wrong about this, but I would appreciate an efficient
> way
> > of storage for GIS data.
> >
> > best regards, Daniel
> >
> > --
> > View this message in context:
> http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3402827.html
> > Sent from the Neo4j Community Discussions mailing list archive at
> Nabble.com.
> > _______________________________________________
> > Neo4j mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Neo4j low-level data storage

Reply via email to