Lots of thoughts, I just want to add a side note on store "defragmentation".
Mattias and I wrote a little tool for a customer who was in need of recreation of his store with some properties/relationships omitted but using the original node-ids. We did that by iterating through the existing store and using the batch-inserters createNode and createRelationship methods which take an _explict_ id value. So you can control which id's are assigned for which nodes. That would allow e.g. for the R-Trees to be layed out in a way that they fit in a single Persistence-Window and can be read in one go (that's just the index) it would probably interesting to co-locate the way-nodes inside the bounding-box frame too. Cheers Michael Am 07.10.2011 um 22:52 schrieb Craig Taverner: > I think Daniels questions are very relevant, but not just to OSM. Any large > graph (of which OSM is simply a good example) will be affected by > fragmentation, and that can affect performance. I recently was hit by > performance of GIS queries (not OSM) related to fragmentation of the index > tree. I will describe that problem below, but first let me describe my view > on Daniels question. > > It is true that if parts of the graph that are geographically close are also > close on disk the load time for bounding box queries will be faster. > However, this is not a problem that is easy to solve in a generic way, > because it requires knowledge of the domain. I can see two ways to create a > less fragmented graph: > > - Have a de-fragmenting algorithm that re-organizes an existing graph > according to some rules. This does not exist in neo4j (yet?), but is > probably easier to generalize, since it should be possible to first analyse > the connectedness of the graph, and then defragment based on that. This > means a reasonable solution might be possible without knowing the domain. > - Be sure to load domain specific data in the order you wish to query it. > In other words, create a graph that is already de-fragmented. > > This second approach is the route I have started following (at least I've > taken one or two tiny baby-steps in that direction, but plan for more). In > the case of the OSM model produced by the OSMImporter in Neo4j-Spatial, we > do not do much here. We are importing the data in the order it was created > in the original postgres database (ie. in the order it was originally added > to open street map). However, since the XML format puts ways after all > nodes, we actually also store all ways after all nodes, which means that to > load any particular way completely from the database requires hitting disk > at at least two very different locations, the location of the way node and > the interconnects between the nodes, and the location(s) of the original > location nodes. This multiple hit will occur on the nodes, relationships and > properties tables in a similar way. So I can also answer a question Daniel > asked about the ids. The Neo4j nodes, relationships and properties have > their own id space. So you can have node 1, relationship 1 and property 1. > Lets consider a real example, a street made of 5 points, added early to OSM > (so low id's in both postgres and in neo4j). The OSM file will have these > nodes near the top, but the way that connects them together will be near the > bottom of the file. In Postgres the nodes and ways are in different tables, > and will both be near the top. In neo4j both osm-ways and osm-nodes are > neo4j-nodes (in the same 'table'). The osm-nodes will have low ids, but the > ways will have a high id. Also we use proxy nodes to connect osm-ways to > osm-nodes, and these will be created together with the way. So we will have > 5 nodes with low ids, and 8 nodes with high id's (5 proxy nodes, 1 way node, > 1 geometry node and 1 tags node). If the way was big and/or edited multiple > times, we could get even higher fragmentation. Personally I think that > fragmenting one geometry into a few specific locations is not a big problem > for the neo4j caches. However, when we are talking about a result-set or > traversal of thousands or hundreds of thousands of geometries, then doubling > or tripling the number of disk hits due to fragmentation can definitely have > a big impact. > > How can this fragmentation situation be improved? One idea is to load the > data with two passes. The current loader is trying to optimize OSM import > speed, which is difficult already (and slower than in rdbms due to increased > explicit structure), and so we have a single pass loader, with a lucene > index for reconnecting ways to nodes. However, I think we could change this > to a two pass loader, with the first pass reading and indexing the point > nodes into a unique id-index (for fast postgres id lookup), and the second > pass would connect the ways, and store both the nodes and ways to the > database at the same time, in continuous disk space. This would improve > query performance, and if we make a good unique-id index faster than lucene, > we will actually also improve import speed ...... :-) > > Now all of the above does not answer the original question regarding > bounding box queries. All we will have done with this is improve the load > time for complete OSM geometries (by reducing geometry fragmentation). But > what about the index itself. We are storing the index as part of the graph. > Today, Neo4j-spatial uses an RTree index that is created at the end of the > load in OSMImporter. This means we load the complete OSM file, and then we > index it. This is a good idea because it will store the entire RTree in > contiguous disk space. Sort of .... there is one issue with the RTree node > splitting that will cause slight fragmentation, but I think it is not too > serious. Now when performing bounding box queries, the main work done by the > RTree will hit the minimum amount of disk space, until we get to the final > geometry tests when we start evaluating the geometries themselves (no longer > just the rtree bounding boxes), and then we are faced with the geometry > fragmentation. Not just the fragmenting of individual geometries into > multiple disk locations, but the fact that geographically close geometries > are not close on disk. > > To summarise the fragmentation issue with regard to OSMImporter > (Neo4j-Spatial): > > - In-geometry fragmentation: not ideal, we have two or more fragments per > connected way, can be improved with two pass import > - RTree fragmentation: not bad, we store the RTree in mostly contiguous > space > - Close geometries fragmentation: bad, we do not consider this at all in > the importing. > > So how do we solve this last issue? I think that if we do write a two pass > importer, we should create an external temporary index of locations during > the first pass, and then import into neo4j in approximate location order. > This will reduce fragmentation over all. > > And finally, a disclaimer: all of the above is based on my knowledge of the > code of neo4j-spatial and thought experiments on what impact alternative > load order will have on graph performance. We have not demonstrated how much > improvement will occur with any of these ideas. It could help a lot, or > perhaps not much at all. There might be easier ways to make things faster. > But honestly right now we are not suffering badly on query time. Import time > is still a bigger issue, IMHO. > > > On Fri, Oct 7, 2011 at 3:59 PM, Peter Neubauer < > [email protected]> wrote: > >> Daniel, >> for OSM data and GIS, have you looked at >> https://github.com/neo4j/spatial, especially the OSM examples at >> >> https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/pipes/GeoPipesTest.java >> and >> https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/OsmAnalysisTest.java >> ? >> >> Cheers, >> >> /peter neubauer >> >> GTalk: neubauer.peter >> Skype peter.neubauer >> Phone +46 704 106975 >> LinkedIn http://www.linkedin.com/in/neubauer >> Twitter http://twitter.com/peterneubauer >> >> http://www.neo4j.org - Your high performance graph database. >> http://startupbootcamp.org/ - Ă–resund - Innovation happens HERE. >> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. >> >> >> >> On Fri, Oct 7, 2011 at 3:46 PM, danielb <[email protected]> wrote: >>> Hello Chris, >>> >>> thanks for your postings, they are a great starting point for me to >> assume a >>> good performance of this database. However I have some questions left. >>> Lets say I have a node store and a property store. Both of them are >>> individual files. I am going to implement a GIS application which fetches >>> OSM data from the hard disk. When I load a bounding box I want to avoid >> to >>> many random reads from the disk. >>> More precisely I want to load nodes and properties inside a given >> bounding >>> box. It would be great if both the nodes and the properties are organized >> in >>> successive blocks. Is there one id-pool for both nodes and properties, so >>> that I can load for example the nodes with id 1 and 2 and the properties >> 3, >>> 4 and 5 with one block read? I can be totally wrong because if I save a >> new >>> node file with id 1, 2 and then save a new property file with id 3, it >> will >>> start on a new block (windows block size like 4K). When then writing a >> new >>> node id it would be saved in the first block I guess. What about >>> fragmentation? And is there an improvement when using a Linux system >>> (inodes? I don't know Linux well)? When I am finished with saving the >> nodes >>> and properties is there some way of reorganization on the hard disk? Lets >>> say I want to enter a new node which is connected to a low id. Will it >> get >>> the first free id (and it will be saved on the other end of the harddisk >>> perhaps) or does it just get an allready used id and the following >> records >>> will be reorganized (insert performance)? >>> Maybe I am totally wrong about this, but I would appreciate an efficient >> way >>> of storage for GIS data. >>> >>> best regards, Daniel >>> >>> -- >>> View this message in context: >> http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3402827.html >>> Sent from the Neo4j Community Discussions mailing list archive at >> Nabble.com. >>> _______________________________________________ >>> Neo4j mailing list >>> [email protected] >>> https://lists.neo4j.org/mailman/listinfo/user >>> >> _______________________________________________ >> Neo4j mailing list >> [email protected] >> https://lists.neo4j.org/mailman/listinfo/user >> > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

