Re: [Neo] Batch inserter performance

2010-05-20 Thread Johan Svensson
I did some benchmarking SSD vs "mechanical disk" using the batch
inserter injecting a dataset (social graph) that required ~10G to fit
in RAM on a 4G machine. Preliminary results indicate that there is no
difference.

The SSD used has about the same sequential write speed as the
mechanical disk and I think that is why (the batch inserter tries to
read and write large chunks of data sequentially).

(In normal non batch inserter mode the SSD is 50-100x faster on non
cached reads).

-Johan

On Tue, May 18, 2010 at 7:48 PM, Lorin Halpert  wrote:
> I'm curious how performance would differ/degrade when using SSDs instead of
> old standard HDDs after RAM is saturated. Anyone have numbers?
>
> On Tue, May 18, 2010 at 8:30 AM, Johan Svensson 
> wrote:
>
>> Working with a 250M relationships graph you need better hardware (more
>> RAM) to get good performance. The batch inserter tries to write as
>> much as possible to memory then write sequentially to disk but since
>> you have so little RAM it can not do that and will instead have to
>> load data from disk and write out whenever needed. You may even get
>> better performance not using batch inserter at all to insert the last
>> 150M relationships.
>>
>> When there is enough RAM you should get more than 100k relationship
>> inserts/s on "standard server hardware" using the batch inserter.
>>
>> -Johan
>>
>> On Tue, May 18, 2010 at 2:03 PM, Alex Averbuch 
>> wrote:
>> > Hi Johan,
>> > Thanks.
>> > At the moment I'm not using the property file but I'll start doing so
>> next
>> > time I load a graph like this.
>> >
>> > The machine that's creating the database is actually my old one (my new
>> > one's power supply died a few days ago so I'm waiting on a new one to
>> > arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment.
>> >
>> > Given my configuration is the performance I described typical?
>> >
>> > Alex
>> >
>> > On Tue, May 18, 2010 at 1:50 PM, Johan Svensson > >wrote:
>> >
>> >> Alex,
>> >>
>> >> How large heap and what configuration setting do you use? To inject
>> >> 250M random relationships at highest possible speed would require at
>> >> least a 8GB heap with most of it assigned to the relationship store.
>> >> See
>> >>
>> http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly
>> >> for more information.
>> >>
>> >> -Johan
>> >>
>> >> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch <
>> alex.averb...@gmail.com>
>> >> wrote:
>> >> > Correction, the performance had degraded from ~3500
>> Relationships/Second
>> >> to
>> >> > ~1500 Relationships/Second. Sloppy math... :)
>> >> >
>> >> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch <
>> alex.averb...@gmail.com
>> >> >wrote:
>> >> >
>> >> >> Hey,
>> >> >> I'm loading a graph from a proprietary binary file format into Neo4j
>> >> using
>> >> >> the batch inserter.
>> >> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
>> >> >> Relationships.
>> >> >>
>> >> >> Here's what I'm doing:
>> >> >>
>> >> >> (1) Insert all Nodes first. While doing so I also add 1 property
>> (lets
>> >> call
>> >> >> is CUSTOM_ID) and index it with Lucene.
>> >> >>
>> >> >> (2) Call "optimize()" on the index
>> >> >>
>> >> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start
>> &
>> >> end
>> >> >> Nodes. Relationships have no properties.
>> >> >>
>> >> >> The problem is that the insertion performance seems to decay quite
>> >> quickly
>> >> >> as the size increases.
>> >> >> I'm keeping track of how long it takes to insert the records.
>> >> >> In the beginning it took about 5 minutes to insert 1,000,000
>> >> >> Relationships.
>> >> >> After about 50,000,000 inserted Relationships it was close to 10
>> minutes
>> >> to
>> >> >> insert 1,000,000 Relationships.
>> >> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert
>> >> >> 1,000,000 Relationships.
>> >> >> That's a drop from ~7,000 Relationships/Second to ~3000
>> >> >> Relationships/Second and I'm worried that if this continues it could
>> >> take
>> >> >> over a week to load this dataset.
>> >> >>
>> >> >> Can you think of anything that I'm doing wrong?
>> >> >>
>> >> >> I have a neo.prop file but I'm not using it... I create the batch
>> >> inserter
>> >> >> with only 1 parameter (database directory).
>> >> >>
>> >> >> Cheers,
>> >> >> Alex
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Batch inserter performance

2010-05-18 Thread Lorin Halpert
I'm curious how performance would differ/degrade when using SSDs instead of
old standard HDDs after RAM is saturated. Anyone have numbers?

On Tue, May 18, 2010 at 8:30 AM, Johan Svensson wrote:

> Working with a 250M relationships graph you need better hardware (more
> RAM) to get good performance. The batch inserter tries to write as
> much as possible to memory then write sequentially to disk but since
> you have so little RAM it can not do that and will instead have to
> load data from disk and write out whenever needed. You may even get
> better performance not using batch inserter at all to insert the last
> 150M relationships.
>
> When there is enough RAM you should get more than 100k relationship
> inserts/s on "standard server hardware" using the batch inserter.
>
> -Johan
>
> On Tue, May 18, 2010 at 2:03 PM, Alex Averbuch 
> wrote:
> > Hi Johan,
> > Thanks.
> > At the moment I'm not using the property file but I'll start doing so
> next
> > time I load a graph like this.
> >
> > The machine that's creating the database is actually my old one (my new
> > one's power supply died a few days ago so I'm waiting on a new one to
> > arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment.
> >
> > Given my configuration is the performance I described typical?
> >
> > Alex
> >
> > On Tue, May 18, 2010 at 1:50 PM, Johan Svensson  >wrote:
> >
> >> Alex,
> >>
> >> How large heap and what configuration setting do you use? To inject
> >> 250M random relationships at highest possible speed would require at
> >> least a 8GB heap with most of it assigned to the relationship store.
> >> See
> >>
> http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly
> >> for more information.
> >>
> >> -Johan
> >>
> >> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch <
> alex.averb...@gmail.com>
> >> wrote:
> >> > Correction, the performance had degraded from ~3500
> Relationships/Second
> >> to
> >> > ~1500 Relationships/Second. Sloppy math... :)
> >> >
> >> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch <
> alex.averb...@gmail.com
> >> >wrote:
> >> >
> >> >> Hey,
> >> >> I'm loading a graph from a proprietary binary file format into Neo4j
> >> using
> >> >> the batch inserter.
> >> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
> >> >> Relationships.
> >> >>
> >> >> Here's what I'm doing:
> >> >>
> >> >> (1) Insert all Nodes first. While doing so I also add 1 property
> (lets
> >> call
> >> >> is CUSTOM_ID) and index it with Lucene.
> >> >>
> >> >> (2) Call "optimize()" on the index
> >> >>
> >> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start
> &
> >> end
> >> >> Nodes. Relationships have no properties.
> >> >>
> >> >> The problem is that the insertion performance seems to decay quite
> >> quickly
> >> >> as the size increases.
> >> >> I'm keeping track of how long it takes to insert the records.
> >> >> In the beginning it took about 5 minutes to insert 1,000,000
> >> >> Relationships.
> >> >> After about 50,000,000 inserted Relationships it was close to 10
> minutes
> >> to
> >> >> insert 1,000,000 Relationships.
> >> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert
> >> >> 1,000,000 Relationships.
> >> >> That's a drop from ~7,000 Relationships/Second to ~3000
> >> >> Relationships/Second and I'm worried that if this continues it could
> >> take
> >> >> over a week to load this dataset.
> >> >>
> >> >> Can you think of anything that I'm doing wrong?
> >> >>
> >> >> I have a neo.prop file but I'm not using it... I create the batch
> >> inserter
> >> >> with only 1 parameter (database directory).
> >> >>
> >> >> Cheers,
> >> >> Alex
> ___
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Batch inserter performance

2010-05-18 Thread Johan Svensson
Working with a 250M relationships graph you need better hardware (more
RAM) to get good performance. The batch inserter tries to write as
much as possible to memory then write sequentially to disk but since
you have so little RAM it can not do that and will instead have to
load data from disk and write out whenever needed. You may even get
better performance not using batch inserter at all to insert the last
150M relationships.

When there is enough RAM you should get more than 100k relationship
inserts/s on "standard server hardware" using the batch inserter.

-Johan

On Tue, May 18, 2010 at 2:03 PM, Alex Averbuch  wrote:
> Hi Johan,
> Thanks.
> At the moment I'm not using the property file but I'll start doing so next
> time I load a graph like this.
>
> The machine that's creating the database is actually my old one (my new
> one's power supply died a few days ago so I'm waiting on a new one to
> arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment.
>
> Given my configuration is the performance I described typical?
>
> Alex
>
> On Tue, May 18, 2010 at 1:50 PM, Johan Svensson 
> wrote:
>
>> Alex,
>>
>> How large heap and what configuration setting do you use? To inject
>> 250M random relationships at highest possible speed would require at
>> least a 8GB heap with most of it assigned to the relationship store.
>> See
>> http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly
>> for more information.
>>
>> -Johan
>>
>> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch 
>> wrote:
>> > Correction, the performance had degraded from ~3500 Relationships/Second
>> to
>> > ~1500 Relationships/Second. Sloppy math... :)
>> >
>> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch > >wrote:
>> >
>> >> Hey,
>> >> I'm loading a graph from a proprietary binary file format into Neo4j
>> using
>> >> the batch inserter.
>> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
>> >> Relationships.
>> >>
>> >> Here's what I'm doing:
>> >>
>> >> (1) Insert all Nodes first. While doing so I also add 1 property (lets
>> call
>> >> is CUSTOM_ID) and index it with Lucene.
>> >>
>> >> (2) Call "optimize()" on the index
>> >>
>> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start &
>> end
>> >> Nodes. Relationships have no properties.
>> >>
>> >> The problem is that the insertion performance seems to decay quite
>> quickly
>> >> as the size increases.
>> >> I'm keeping track of how long it takes to insert the records.
>> >> In the beginning it took about 5 minutes to insert 1,000,000
>> >> Relationships.
>> >> After about 50,000,000 inserted Relationships it was close to 10 minutes
>> to
>> >> insert 1,000,000 Relationships.
>> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert
>> >> 1,000,000 Relationships.
>> >> That's a drop from ~7,000 Relationships/Second to ~3000
>> >> Relationships/Second and I'm worried that if this continues it could
>> take
>> >> over a week to load this dataset.
>> >>
>> >> Can you think of anything that I'm doing wrong?
>> >>
>> >> I have a neo.prop file but I'm not using it... I create the batch
>> inserter
>> >> with only 1 parameter (database directory).
>> >>
>> >> Cheers,
>> >> Alex
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Batch inserter performance

2010-05-18 Thread Alex Averbuch
Hi Johan,
Thanks.
At the moment I'm not using the property file but I'll start doing so next
time I load a graph like this.

The machine that's creating the database is actually my old one (my new
one's power supply died a few days ago so I'm waiting on a new one to
arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment.

Given my configuration is the performance I described typical?

Alex

On Tue, May 18, 2010 at 1:50 PM, Johan Svensson wrote:

> Alex,
>
> How large heap and what configuration setting do you use? To inject
> 250M random relationships at highest possible speed would require at
> least a 8GB heap with most of it assigned to the relationship store.
> See
> http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly
> for more information.
>
> -Johan
>
> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch 
> wrote:
> > Correction, the performance had degraded from ~3500 Relationships/Second
> to
> > ~1500 Relationships/Second. Sloppy math... :)
> >
> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch  >wrote:
> >
> >> Hey,
> >> I'm loading a graph from a proprietary binary file format into Neo4j
> using
> >> the batch inserter.
> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
> >> Relationships.
> >>
> >> Here's what I'm doing:
> >>
> >> (1) Insert all Nodes first. While doing so I also add 1 property (lets
> call
> >> is CUSTOM_ID) and index it with Lucene.
> >>
> >> (2) Call "optimize()" on the index
> >>
> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start &
> end
> >> Nodes. Relationships have no properties.
> >>
> >> The problem is that the insertion performance seems to decay quite
> quickly
> >> as the size increases.
> >> I'm keeping track of how long it takes to insert the records.
> >> In the beginning it took about 5 minutes to insert 1,000,000
> >> Relationships.
> >> After about 50,000,000 inserted Relationships it was close to 10 minutes
> to
> >> insert 1,000,000 Relationships.
> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert
> >> 1,000,000 Relationships.
> >> That's a drop from ~7,000 Relationships/Second to ~3000
> >> Relationships/Second and I'm worried that if this continues it could
> take
> >> over a week to load this dataset.
> >>
> >> Can you think of anything that I'm doing wrong?
> >>
> >> I have a neo.prop file but I'm not using it... I create the batch
> inserter
> >> with only 1 parameter (database directory).
> >>
> >> Cheers,
> >> Alex
> ___
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Batch inserter performance

2010-05-18 Thread Johan Svensson
Alex,

How large heap and what configuration setting do you use? To inject
250M random relationships at highest possible speed would require at
least a 8GB heap with most of it assigned to the relationship store.
See 
http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly
for more information.

-Johan

On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch  wrote:
> Correction, the performance had degraded from ~3500 Relationships/Second to
> ~1500 Relationships/Second. Sloppy math... :)
>
> On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch 
> wrote:
>
>> Hey,
>> I'm loading a graph from a proprietary binary file format into Neo4j using
>> the batch inserter.
>> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
>> Relationships.
>>
>> Here's what I'm doing:
>>
>> (1) Insert all Nodes first. While doing so I also add 1 property (lets call
>> is CUSTOM_ID) and index it with Lucene.
>>
>> (2) Call "optimize()" on the index
>>
>> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & end
>> Nodes. Relationships have no properties.
>>
>> The problem is that the insertion performance seems to decay quite quickly
>> as the size increases.
>> I'm keeping track of how long it takes to insert the records.
>> In the beginning it took about 5 minutes to insert 1,000,000
>> Relationships.
>> After about 50,000,000 inserted Relationships it was close to 10 minutes to
>> insert 1,000,000 Relationships.
>> By the time I was up to 70,000,000 it was taking 12 minutes to insert
>> 1,000,000 Relationships.
>> That's a drop from ~7,000 Relationships/Second to ~3000
>> Relationships/Second and I'm worried that if this continues it could take
>> over a week to load this dataset.
>>
>> Can you think of anything that I'm doing wrong?
>>
>> I have a neo.prop file but I'm not using it... I create the batch inserter
>> with only 1 parameter (database directory).
>>
>> Cheers,
>> Alex
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Batch inserter performance

2010-05-18 Thread Alex Averbuch
Correction, the performance had degraded from ~3500 Relationships/Second to
~1500 Relationships/Second. Sloppy math... :)

On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch wrote:

> Hey,
> I'm loading a graph from a proprietary binary file format into Neo4j using
> the batch inserter.
> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
> Relationships.
>
> Here's what I'm doing:
>
> (1) Insert all Nodes first. While doing so I also add 1 property (lets call
> is CUSTOM_ID) and index it with Lucene.
>
> (2) Call "optimize()" on the index
>
> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & end
> Nodes. Relationships have no properties.
>
> The problem is that the insertion performance seems to decay quite quickly
> as the size increases.
> I'm keeping track of how long it takes to insert the records.
> In the beginning it took about 5 minutes to insert 1,000,000
> Relationships.
> After about 50,000,000 inserted Relationships it was close to 10 minutes to
> insert 1,000,000 Relationships.
> By the time I was up to 70,000,000 it was taking 12 minutes to insert
> 1,000,000 Relationships.
> That's a drop from ~7,000 Relationships/Second to ~3000
> Relationships/Second and I'm worried that if this continues it could take
> over a week to load this dataset.
>
> Can you think of anything that I'm doing wrong?
>
> I have a neo.prop file but I'm not using it... I create the batch inserter
> with only 1 parameter (database directory).
>
> Cheers,
> Alex
>
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo] Batch inserter performance

2010-05-18 Thread Alex Averbuch
Hey,
I'm loading a graph from a proprietary binary file format into Neo4j using
the batch inserter.
The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000
Relationships.

Here's what I'm doing:

(1) Insert all Nodes first. While doing so I also add 1 property (lets call
is CUSTOM_ID) and index it with Lucene.

(2) Call "optimize()" on the index

(3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & end
Nodes. Relationships have no properties.

The problem is that the insertion performance seems to decay quite quickly
as the size increases.
I'm keeping track of how long it takes to insert the records.
In the beginning it took about 5 minutes to insert 1,000,000 Relationships.
After about 50,000,000 inserted Relationships it was close to 10 minutes to
insert 1,000,000 Relationships.
By the time I was up to 70,000,000 it was taking 12 minutes to insert
1,000,000 Relationships.
That's a drop from ~7,000 Relationships/Second to ~3000 Relationships/Second
and I'm worried that if this continues it could take over a week to load
this dataset.

Can you think of anything that I'm doing wrong?

I have a neo.prop file but I'm not using it... I create the batch inserter
with only 1 parameter (database directory).

Cheers,
Alex
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user