Re: [Neo4j] Batch Inserter - db scaling issue (not index scaling issue)

Mark Harwood Mon, 21 Feb 2011 09:35:17 -0800

Thanks for taking the time to look over my example, Johan.

I was hoping that the batch inserter's memory costs would not be
directly linear with the volume of data inserted - sounds like it is?.
My assumption was that the indexing service was the service with the
comparatively hard task of random-lookups on arbitrary keys on an
ever-changing index with sub-linear memory cost, sub-linear lookup
speed and background merge tasks to avoid fragmentation over time. I
kind of hoped the graph db could have similar qualities in its tasks
of allocating new node ids and storage/retrieval of related edges.


Cheers,
Mark

On Mon, Feb 21, 2011 at 2:30 PM, Johan Svensson <[email protected]> wrote:
> Mark,
>
> I had a look at this and you try to inject 130M relationships with a
> relationship store configured to 700M. That will not be an efficient
> insert. If your relationships and data are not sorted the batch
> inserter would have to unload and load blocks of data as soon as you
> get over around 22M relationships. To inject 130M relationships at
> full speed with random connections would require around 4G for the
> relationship store.
>
> -Johan
>
> On Fri, Feb 18, 2011 at 8:07 AM, Mark @ Gmail <[email protected]> wrote:
>> Hi Johan and others
>>>>I am having a hard time to follow what the problems really are since 
>>>>conversation is split up in several thread
>> My fault, sorry. I was replying to a message posted before I subscribed to 
>> the list so didn't have the orginal poster's email.
>>
>>>>as I understand it you are saying that it is the index lookups that are 
>>>>taking to long time?
>>
>> In your current implementation, "Yes" - in the indexing implementation I 
>> provide on that Google code project there is no performance issue.
>> However, having fixed the Lucene indexing issue it only reveals that the 
>> *database* is now the bottleneck and blows up after 30 million edge inserts. 
>> That is now the issue here.
>>
>> See the test results here : 
>> http://code.google.com/p/graphdb-load-tester/wiki/TestResults
>>
>>>>For example inserting 500M relationships
>>>>requiring 1B index lookups (one for each node) with an avg index
>>>>lookup time of 1ms is 11 days worth of index lookup time.
>> That is why I suggested to Peter when he asked for help with indexing that a 
>> Bloom filter helps "know what you don't know" and an LRU Cache helps hang 
>> onto popular nodes. These are in my implementation and both avoid reads.
>> Re your suggestion about avoiding indexes by inserting in batches - I can't 
>> see how that will help because you can sort input data by from node key or 
>> to node key but will not necessarily end up with node pairs that are joined 
>> by edges conveniently located in the same batch and will therefore need an 
>> index service to add any edges - but as I say this is fixed in my 
>> implementation andindexing is not the remaining issue - the database is.
>> I do encourage you to try run it.
>>
>> Cheers,
>> Mark
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Batch Inserter - db scaling issue (not index scaling issue)

Reply via email to