I think Daniels questions are very relevant, but not just to OSM. Any large graph (of which OSM is simply a good example) will be affected by fragmentation, and that can affect performance. I recently was hit by performance of GIS queries (not OSM) related to fragmentation of the index tree. I will describe that problem below, but first let me describe my view on Daniels question.
It is true that if parts of the graph that are geographically close are also close on disk the load time for bounding box queries will be faster. However, this is not a problem that is easy to solve in a generic way, because it requires knowledge of the domain. I can see two ways to create a less fragmented graph: - Have a de-fragmenting algorithm that re-organizes an existing graph according to some rules. This does not exist in neo4j (yet?), but is probably easier to generalize, since it should be possible to first analyse the connectedness of the graph, and then defragment based on that. This means a reasonable solution might be possible without knowing the domain. - Be sure to load domain specific data in the order you wish to query it. In other words, create a graph that is already de-fragmented. This second approach is the route I have started following (at least I've taken one or two tiny baby-steps in that direction, but plan for more). In the case of the OSM model produced by the OSMImporter in Neo4j-Spatial, we do not do much here. We are importing the data in the order it was created in the original postgres database (ie. in the order it was originally added to open street map). However, since the XML format puts ways after all nodes, we actually also store all ways after all nodes, which means that to load any particular way completely from the database requires hitting disk at at least two very different locations, the location of the way node and the interconnects between the nodes, and the location(s) of the original location nodes. This multiple hit will occur on the nodes, relationships and properties tables in a similar way. So I can also answer a question Daniel asked about the ids. The Neo4j nodes, relationships and properties have their own id space. So you can have node 1, relationship 1 and property 1. Lets consider a real example, a street made of 5 points, added early to OSM (so low id's in both postgres and in neo4j). The OSM file will have these nodes near the top, but the way that connects them together will be near the bottom of the file. In Postgres the nodes and ways are in different tables, and will both be near the top. In neo4j both osm-ways and osm-nodes are neo4j-nodes (in the same 'table'). The osm-nodes will have low ids, but the ways will have a high id. Also we use proxy nodes to connect osm-ways to osm-nodes, and these will be created together with the way. So we will have 5 nodes with low ids, and 8 nodes with high id's (5 proxy nodes, 1 way node, 1 geometry node and 1 tags node). If the way was big and/or edited multiple times, we could get even higher fragmentation. Personally I think that fragmenting one geometry into a few specific locations is not a big problem for the neo4j caches. However, when we are talking about a result-set or traversal of thousands or hundreds of thousands of geometries, then doubling or tripling the number of disk hits due to fragmentation can definitely have a big impact. How can this fragmentation situation be improved? One idea is to load the data with two passes. The current loader is trying to optimize OSM import speed, which is difficult already (and slower than in rdbms due to increased explicit structure), and so we have a single pass loader, with a lucene index for reconnecting ways to nodes. However, I think we could change this to a two pass loader, with the first pass reading and indexing the point nodes into a unique id-index (for fast postgres id lookup), and the second pass would connect the ways, and store both the nodes and ways to the database at the same time, in continuous disk space. This would improve query performance, and if we make a good unique-id index faster than lucene, we will actually also improve import speed ...... :-) Now all of the above does not answer the original question regarding bounding box queries. All we will have done with this is improve the load time for complete OSM geometries (by reducing geometry fragmentation). But what about the index itself. We are storing the index as part of the graph. Today, Neo4j-spatial uses an RTree index that is created at the end of the load in OSMImporter. This means we load the complete OSM file, and then we index it. This is a good idea because it will store the entire RTree in contiguous disk space. Sort of .... there is one issue with the RTree node splitting that will cause slight fragmentation, but I think it is not too serious. Now when performing bounding box queries, the main work done by the RTree will hit the minimum amount of disk space, until we get to the final geometry tests when we start evaluating the geometries themselves (no longer just the rtree bounding boxes), and then we are faced with the geometry fragmentation. Not just the fragmenting of individual geometries into multiple disk locations, but the fact that geographically close geometries are not close on disk. To summarise the fragmentation issue with regard to OSMImporter (Neo4j-Spatial): - In-geometry fragmentation: not ideal, we have two or more fragments per connected way, can be improved with two pass import - RTree fragmentation: not bad, we store the RTree in mostly contiguous space - Close geometries fragmentation: bad, we do not consider this at all in the importing. So how do we solve this last issue? I think that if we do write a two pass importer, we should create an external temporary index of locations during the first pass, and then import into neo4j in approximate location order. This will reduce fragmentation over all. And finally, a disclaimer: all of the above is based on my knowledge of the code of neo4j-spatial and thought experiments on what impact alternative load order will have on graph performance. We have not demonstrated how much improvement will occur with any of these ideas. It could help a lot, or perhaps not much at all. There might be easier ways to make things faster. But honestly right now we are not suffering badly on query time. Import time is still a bigger issue, IMHO. On Fri, Oct 7, 2011 at 3:59 PM, Peter Neubauer < [email protected]> wrote: > Daniel, > for OSM data and GIS, have you looked at > https://github.com/neo4j/spatial, especially the OSM examples at > > https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/pipes/GeoPipesTest.java > and > https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/OsmAnalysisTest.java > ? > > Cheers, > > /peter neubauer > > GTalk: neubauer.peter > Skype peter.neubauer > Phone +46 704 106975 > LinkedIn http://www.linkedin.com/in/neubauer > Twitter http://twitter.com/peterneubauer > > http://www.neo4j.org - Your high performance graph database. > http://startupbootcamp.org/ - Ă–resund - Innovation happens HERE. > http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. > > > > On Fri, Oct 7, 2011 at 3:46 PM, danielb <[email protected]> wrote: > > Hello Chris, > > > > thanks for your postings, they are a great starting point for me to > assume a > > good performance of this database. However I have some questions left. > > Lets say I have a node store and a property store. Both of them are > > individual files. I am going to implement a GIS application which fetches > > OSM data from the hard disk. When I load a bounding box I want to avoid > to > > many random reads from the disk. > > More precisely I want to load nodes and properties inside a given > bounding > > box. It would be great if both the nodes and the properties are organized > in > > successive blocks. Is there one id-pool for both nodes and properties, so > > that I can load for example the nodes with id 1 and 2 and the properties > 3, > > 4 and 5 with one block read? I can be totally wrong because if I save a > new > > node file with id 1, 2 and then save a new property file with id 3, it > will > > start on a new block (windows block size like 4K). When then writing a > new > > node id it would be saved in the first block I guess. What about > > fragmentation? And is there an improvement when using a Linux system > > (inodes? I don't know Linux well)? When I am finished with saving the > nodes > > and properties is there some way of reorganization on the hard disk? Lets > > say I want to enter a new node which is connected to a low id. Will it > get > > the first free id (and it will be saved on the other end of the harddisk > > perhaps) or does it just get an allready used id and the following > records > > will be reorganized (insert performance)? > > Maybe I am totally wrong about this, but I would appreciate an efficient > way > > of storage for GIS data. > > > > best regards, Daniel > > > > -- > > View this message in context: > http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3402827.html > > Sent from the Neo4j Community Discussions mailing list archive at > Nabble.com. > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

