Thanks for taking the time to look over my example, Johan. I was hoping that the batch inserter's memory costs would not be directly linear with the volume of data inserted - sounds like it is?. My assumption was that the indexing service was the service with the comparatively hard task of random-lookups on arbitrary keys on an ever-changing index with sub-linear memory cost, sub-linear lookup speed and background merge tasks to avoid fragmentation over time. I kind of hoped the graph db could have similar qualities in its tasks of allocating new node ids and storage/retrieval of related edges.
Cheers, Mark On Mon, Feb 21, 2011 at 2:30 PM, Johan Svensson <[email protected]> wrote: > Mark, > > I had a look at this and you try to inject 130M relationships with a > relationship store configured to 700M. That will not be an efficient > insert. If your relationships and data are not sorted the batch > inserter would have to unload and load blocks of data as soon as you > get over around 22M relationships. To inject 130M relationships at > full speed with random connections would require around 4G for the > relationship store. > > -Johan > > On Fri, Feb 18, 2011 at 8:07 AM, Mark @ Gmail <[email protected]> wrote: >> Hi Johan and others >>>>I am having a hard time to follow what the problems really are since >>>>conversation is split up in several thread >> My fault, sorry. I was replying to a message posted before I subscribed to >> the list so didn't have the orginal poster's email. >> >>>>as I understand it you are saying that it is the index lookups that are >>>>taking to long time? >> >> In your current implementation, "Yes" - in the indexing implementation I >> provide on that Google code project there is no performance issue. >> However, having fixed the Lucene indexing issue it only reveals that the >> *database* is now the bottleneck and blows up after 30 million edge inserts. >> That is now the issue here. >> >> See the test results here : >> http://code.google.com/p/graphdb-load-tester/wiki/TestResults >> >>>>For example inserting 500M relationships >>>>requiring 1B index lookups (one for each node) with an avg index >>>>lookup time of 1ms is 11 days worth of index lookup time. >> That is why I suggested to Peter when he asked for help with indexing that a >> Bloom filter helps "know what you don't know" and an LRU Cache helps hang >> onto popular nodes. These are in my implementation and both avoid reads. >> Re your suggestion about avoiding indexes by inserting in batches - I can't >> see how that will help because you can sort input data by from node key or >> to node key but will not necessarily end up with node pairs that are joined >> by edges conveniently located in the same batch and will therefore need an >> index service to add any edges - but as I say this is fixed in my >> implementation andindexing is not the remaining issue - the database is. >> I do encourage you to try run it. >> >> Cheers, >> Mark > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

