Re: [Neo] Batch inserter performance
I did some benchmarking SSD vs "mechanical disk" using the batch inserter injecting a dataset (social graph) that required ~10G to fit in RAM on a 4G machine. Preliminary results indicate that there is no difference. The SSD used has about the same sequential write speed as the mechanical disk and I think that is why (the batch inserter tries to read and write large chunks of data sequentially). (In normal non batch inserter mode the SSD is 50-100x faster on non cached reads). -Johan On Tue, May 18, 2010 at 7:48 PM, Lorin Halpert wrote: > I'm curious how performance would differ/degrade when using SSDs instead of > old standard HDDs after RAM is saturated. Anyone have numbers? > > On Tue, May 18, 2010 at 8:30 AM, Johan Svensson > wrote: > >> Working with a 250M relationships graph you need better hardware (more >> RAM) to get good performance. The batch inserter tries to write as >> much as possible to memory then write sequentially to disk but since >> you have so little RAM it can not do that and will instead have to >> load data from disk and write out whenever needed. You may even get >> better performance not using batch inserter at all to insert the last >> 150M relationships. >> >> When there is enough RAM you should get more than 100k relationship >> inserts/s on "standard server hardware" using the batch inserter. >> >> -Johan >> >> On Tue, May 18, 2010 at 2:03 PM, Alex Averbuch >> wrote: >> > Hi Johan, >> > Thanks. >> > At the moment I'm not using the property file but I'll start doing so >> next >> > time I load a graph like this. >> > >> > The machine that's creating the database is actually my old one (my new >> > one's power supply died a few days ago so I'm waiting on a new one to >> > arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment. >> > >> > Given my configuration is the performance I described typical? >> > >> > Alex >> > >> > On Tue, May 18, 2010 at 1:50 PM, Johan Svensson > >wrote: >> > >> >> Alex, >> >> >> >> How large heap and what configuration setting do you use? To inject >> >> 250M random relationships at highest possible speed would require at >> >> least a 8GB heap with most of it assigned to the relationship store. >> >> See >> >> >> http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly >> >> for more information. >> >> >> >> -Johan >> >> >> >> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch < >> alex.averb...@gmail.com> >> >> wrote: >> >> > Correction, the performance had degraded from ~3500 >> Relationships/Second >> >> to >> >> > ~1500 Relationships/Second. Sloppy math... :) >> >> > >> >> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch < >> alex.averb...@gmail.com >> >> >wrote: >> >> > >> >> >> Hey, >> >> >> I'm loading a graph from a proprietary binary file format into Neo4j >> >> using >> >> >> the batch inserter. >> >> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 >> >> >> Relationships. >> >> >> >> >> >> Here's what I'm doing: >> >> >> >> >> >> (1) Insert all Nodes first. While doing so I also add 1 property >> (lets >> >> call >> >> >> is CUSTOM_ID) and index it with Lucene. >> >> >> >> >> >> (2) Call "optimize()" on the index >> >> >> >> >> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start >> & >> >> end >> >> >> Nodes. Relationships have no properties. >> >> >> >> >> >> The problem is that the insertion performance seems to decay quite >> >> quickly >> >> >> as the size increases. >> >> >> I'm keeping track of how long it takes to insert the records. >> >> >> In the beginning it took about 5 minutes to insert 1,000,000 >> >> >> Relationships. >> >> >> After about 50,000,000 inserted Relationships it was close to 10 >> minutes >> >> to >> >> >> insert 1,000,000 Relationships. >> >> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert >> >> >> 1,000,000 Relationships. >> >> >> That's a drop from ~7,000 Relationships/Second to ~3000 >> >> >> Relationships/Second and I'm worried that if this continues it could >> >> take >> >> >> over a week to load this dataset. >> >> >> >> >> >> Can you think of anything that I'm doing wrong? >> >> >> >> >> >> I have a neo.prop file but I'm not using it... I create the batch >> >> inserter >> >> >> with only 1 parameter (database directory). >> >> >> >> >> >> Cheers, >> >> >> Alex ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Batch inserter performance
I'm curious how performance would differ/degrade when using SSDs instead of old standard HDDs after RAM is saturated. Anyone have numbers? On Tue, May 18, 2010 at 8:30 AM, Johan Svensson wrote: > Working with a 250M relationships graph you need better hardware (more > RAM) to get good performance. The batch inserter tries to write as > much as possible to memory then write sequentially to disk but since > you have so little RAM it can not do that and will instead have to > load data from disk and write out whenever needed. You may even get > better performance not using batch inserter at all to insert the last > 150M relationships. > > When there is enough RAM you should get more than 100k relationship > inserts/s on "standard server hardware" using the batch inserter. > > -Johan > > On Tue, May 18, 2010 at 2:03 PM, Alex Averbuch > wrote: > > Hi Johan, > > Thanks. > > At the moment I'm not using the property file but I'll start doing so > next > > time I load a graph like this. > > > > The machine that's creating the database is actually my old one (my new > > one's power supply died a few days ago so I'm waiting on a new one to > > arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment. > > > > Given my configuration is the performance I described typical? > > > > Alex > > > > On Tue, May 18, 2010 at 1:50 PM, Johan Svensson >wrote: > > > >> Alex, > >> > >> How large heap and what configuration setting do you use? To inject > >> 250M random relationships at highest possible speed would require at > >> least a 8GB heap with most of it assigned to the relationship store. > >> See > >> > http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly > >> for more information. > >> > >> -Johan > >> > >> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch < > alex.averb...@gmail.com> > >> wrote: > >> > Correction, the performance had degraded from ~3500 > Relationships/Second > >> to > >> > ~1500 Relationships/Second. Sloppy math... :) > >> > > >> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch < > alex.averb...@gmail.com > >> >wrote: > >> > > >> >> Hey, > >> >> I'm loading a graph from a proprietary binary file format into Neo4j > >> using > >> >> the batch inserter. > >> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 > >> >> Relationships. > >> >> > >> >> Here's what I'm doing: > >> >> > >> >> (1) Insert all Nodes first. While doing so I also add 1 property > (lets > >> call > >> >> is CUSTOM_ID) and index it with Lucene. > >> >> > >> >> (2) Call "optimize()" on the index > >> >> > >> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start > & > >> end > >> >> Nodes. Relationships have no properties. > >> >> > >> >> The problem is that the insertion performance seems to decay quite > >> quickly > >> >> as the size increases. > >> >> I'm keeping track of how long it takes to insert the records. > >> >> In the beginning it took about 5 minutes to insert 1,000,000 > >> >> Relationships. > >> >> After about 50,000,000 inserted Relationships it was close to 10 > minutes > >> to > >> >> insert 1,000,000 Relationships. > >> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert > >> >> 1,000,000 Relationships. > >> >> That's a drop from ~7,000 Relationships/Second to ~3000 > >> >> Relationships/Second and I'm worried that if this continues it could > >> take > >> >> over a week to load this dataset. > >> >> > >> >> Can you think of anything that I'm doing wrong? > >> >> > >> >> I have a neo.prop file but I'm not using it... I create the batch > >> inserter > >> >> with only 1 parameter (database directory). > >> >> > >> >> Cheers, > >> >> Alex > ___ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Batch inserter performance
Working with a 250M relationships graph you need better hardware (more RAM) to get good performance. The batch inserter tries to write as much as possible to memory then write sequentially to disk but since you have so little RAM it can not do that and will instead have to load data from disk and write out whenever needed. You may even get better performance not using batch inserter at all to insert the last 150M relationships. When there is enough RAM you should get more than 100k relationship inserts/s on "standard server hardware" using the batch inserter. -Johan On Tue, May 18, 2010 at 2:03 PM, Alex Averbuch wrote: > Hi Johan, > Thanks. > At the moment I'm not using the property file but I'll start doing so next > time I load a graph like this. > > The machine that's creating the database is actually my old one (my new > one's power supply died a few days ago so I'm waiting on a new one to > arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment. > > Given my configuration is the performance I described typical? > > Alex > > On Tue, May 18, 2010 at 1:50 PM, Johan Svensson > wrote: > >> Alex, >> >> How large heap and what configuration setting do you use? To inject >> 250M random relationships at highest possible speed would require at >> least a 8GB heap with most of it assigned to the relationship store. >> See >> http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly >> for more information. >> >> -Johan >> >> On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch >> wrote: >> > Correction, the performance had degraded from ~3500 Relationships/Second >> to >> > ~1500 Relationships/Second. Sloppy math... :) >> > >> > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch > >wrote: >> > >> >> Hey, >> >> I'm loading a graph from a proprietary binary file format into Neo4j >> using >> >> the batch inserter. >> >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 >> >> Relationships. >> >> >> >> Here's what I'm doing: >> >> >> >> (1) Insert all Nodes first. While doing so I also add 1 property (lets >> call >> >> is CUSTOM_ID) and index it with Lucene. >> >> >> >> (2) Call "optimize()" on the index >> >> >> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & >> end >> >> Nodes. Relationships have no properties. >> >> >> >> The problem is that the insertion performance seems to decay quite >> quickly >> >> as the size increases. >> >> I'm keeping track of how long it takes to insert the records. >> >> In the beginning it took about 5 minutes to insert 1,000,000 >> >> Relationships. >> >> After about 50,000,000 inserted Relationships it was close to 10 minutes >> to >> >> insert 1,000,000 Relationships. >> >> By the time I was up to 70,000,000 it was taking 12 minutes to insert >> >> 1,000,000 Relationships. >> >> That's a drop from ~7,000 Relationships/Second to ~3000 >> >> Relationships/Second and I'm worried that if this continues it could >> take >> >> over a week to load this dataset. >> >> >> >> Can you think of anything that I'm doing wrong? >> >> >> >> I have a neo.prop file but I'm not using it... I create the batch >> inserter >> >> with only 1 parameter (database directory). >> >> >> >> Cheers, >> >> Alex ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Batch inserter performance
Hi Johan, Thanks. At the moment I'm not using the property file but I'll start doing so next time I load a graph like this. The machine that's creating the database is actually my old one (my new one's power supply died a few days ago so I'm waiting on a new one to arrive), so I only have 2GB of RAM. Heap is set to 1.5GB at the moment. Given my configuration is the performance I described typical? Alex On Tue, May 18, 2010 at 1:50 PM, Johan Svensson wrote: > Alex, > > How large heap and what configuration setting do you use? To inject > 250M random relationships at highest possible speed would require at > least a 8GB heap with most of it assigned to the relationship store. > See > http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly > for more information. > > -Johan > > On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch > wrote: > > Correction, the performance had degraded from ~3500 Relationships/Second > to > > ~1500 Relationships/Second. Sloppy math... :) > > > > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch >wrote: > > > >> Hey, > >> I'm loading a graph from a proprietary binary file format into Neo4j > using > >> the batch inserter. > >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 > >> Relationships. > >> > >> Here's what I'm doing: > >> > >> (1) Insert all Nodes first. While doing so I also add 1 property (lets > call > >> is CUSTOM_ID) and index it with Lucene. > >> > >> (2) Call "optimize()" on the index > >> > >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & > end > >> Nodes. Relationships have no properties. > >> > >> The problem is that the insertion performance seems to decay quite > quickly > >> as the size increases. > >> I'm keeping track of how long it takes to insert the records. > >> In the beginning it took about 5 minutes to insert 1,000,000 > >> Relationships. > >> After about 50,000,000 inserted Relationships it was close to 10 minutes > to > >> insert 1,000,000 Relationships. > >> By the time I was up to 70,000,000 it was taking 12 minutes to insert > >> 1,000,000 Relationships. > >> That's a drop from ~7,000 Relationships/Second to ~3000 > >> Relationships/Second and I'm worried that if this continues it could > take > >> over a week to load this dataset. > >> > >> Can you think of anything that I'm doing wrong? > >> > >> I have a neo.prop file but I'm not using it... I create the batch > inserter > >> with only 1 parameter (database directory). > >> > >> Cheers, > >> Alex > ___ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Batch inserter performance
Alex, How large heap and what configuration setting do you use? To inject 250M random relationships at highest possible speed would require at least a 8GB heap with most of it assigned to the relationship store. See http://wiki.neo4j.org/content/Batch_Insert#How_to_configure_the_batch_inserter_properly for more information. -Johan On Tue, May 18, 2010 at 10:50 AM, Alex Averbuch wrote: > Correction, the performance had degraded from ~3500 Relationships/Second to > ~1500 Relationships/Second. Sloppy math... :) > > On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch > wrote: > >> Hey, >> I'm loading a graph from a proprietary binary file format into Neo4j using >> the batch inserter. >> The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 >> Relationships. >> >> Here's what I'm doing: >> >> (1) Insert all Nodes first. While doing so I also add 1 property (lets call >> is CUSTOM_ID) and index it with Lucene. >> >> (2) Call "optimize()" on the index >> >> (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & end >> Nodes. Relationships have no properties. >> >> The problem is that the insertion performance seems to decay quite quickly >> as the size increases. >> I'm keeping track of how long it takes to insert the records. >> In the beginning it took about 5 minutes to insert 1,000,000 >> Relationships. >> After about 50,000,000 inserted Relationships it was close to 10 minutes to >> insert 1,000,000 Relationships. >> By the time I was up to 70,000,000 it was taking 12 minutes to insert >> 1,000,000 Relationships. >> That's a drop from ~7,000 Relationships/Second to ~3000 >> Relationships/Second and I'm worried that if this continues it could take >> over a week to load this dataset. >> >> Can you think of anything that I'm doing wrong? >> >> I have a neo.prop file but I'm not using it... I create the batch inserter >> with only 1 parameter (database directory). >> >> Cheers, >> Alex ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Batch inserter performance
Correction, the performance had degraded from ~3500 Relationships/Second to ~1500 Relationships/Second. Sloppy math... :) On Tue, May 18, 2010 at 10:46 AM, Alex Averbuch wrote: > Hey, > I'm loading a graph from a proprietary binary file format into Neo4j using > the batch inserter. > The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 > Relationships. > > Here's what I'm doing: > > (1) Insert all Nodes first. While doing so I also add 1 property (lets call > is CUSTOM_ID) and index it with Lucene. > > (2) Call "optimize()" on the index > > (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & end > Nodes. Relationships have no properties. > > The problem is that the insertion performance seems to decay quite quickly > as the size increases. > I'm keeping track of how long it takes to insert the records. > In the beginning it took about 5 minutes to insert 1,000,000 > Relationships. > After about 50,000,000 inserted Relationships it was close to 10 minutes to > insert 1,000,000 Relationships. > By the time I was up to 70,000,000 it was taking 12 minutes to insert > 1,000,000 Relationships. > That's a drop from ~7,000 Relationships/Second to ~3000 > Relationships/Second and I'm worried that if this continues it could take > over a week to load this dataset. > > Can you think of anything that I'm doing wrong? > > I have a neo.prop file but I'm not using it... I create the batch inserter > with only 1 parameter (database directory). > > Cheers, > Alex > ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo] Batch inserter performance
Hey, I'm loading a graph from a proprietary binary file format into Neo4j using the batch inserter. The graph (Twitter crawl results) has 2,500,000 Nodes & 250,000,000 Relationships. Here's what I'm doing: (1) Insert all Nodes first. While doing so I also add 1 property (lets call is CUSTOM_ID) and index it with Lucene. (2) Call "optimize()" on the index (3) Insert all the Relationships. I use CUSTOM_ID to lookup the start & end Nodes. Relationships have no properties. The problem is that the insertion performance seems to decay quite quickly as the size increases. I'm keeping track of how long it takes to insert the records. In the beginning it took about 5 minutes to insert 1,000,000 Relationships. After about 50,000,000 inserted Relationships it was close to 10 minutes to insert 1,000,000 Relationships. By the time I was up to 70,000,000 it was taking 12 minutes to insert 1,000,000 Relationships. That's a drop from ~7,000 Relationships/Second to ~3000 Relationships/Second and I'm worried that if this continues it could take over a week to load this dataset. Can you think of anything that I'm doing wrong? I have a neo.prop file but I'm not using it... I create the batch inserter with only 1 parameter (database directory). Cheers, Alex ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user