Re: [Neo4j] OutOfMemory while populating large graph
Great, so maybe neo4j-index should be updated to depend on Lucene 2.9.3. 2010/7/9 Bill Janssen jans...@parc.com Note that a couple of memory issues are fixed in Lucene 2.9.3. Leaking when indexing big docs, and indolent reclamation of space from the FieldCache. Bill Arijit Mukherjee ariji...@gmail.com wrote: I've a similar problem. Although I'm not going out of memory yet, I can see the heap constantly growing, and JProfiler says most of it is due to the Lucene indexing. And even if I do the commit after every X transactions, once the population is finished, the final commit is done, and the graph db closed - the heap stays like that - almost full. An explicit gc will clean up some part, but not fully. Arijit On 9 July 2010 17:00, Mattias Persson matt...@neotechnology.com wrote: 2010/7/9 Marko Rodriguez okramma...@gmail.com Hi, Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Thats cool! Does that already exist? In my code (like others on the list it seems) I have a counter++ that every 20,000 inserts (some made up number that is not going to throw an OutOfMemory) commits and the reopens a new transaction. Sorta sux. No it doesn't, I just wrote stuff which I though someone could think of as useful. A cool thing with just telling it to do a batch insert mode transaction (not the actual commit interval) is that it could look at how much memory it had to play around with and commit whenever it would be the most efficient, even having the ability to change the limit on the fly if the memory suddenly ran out. Thanks, Marko. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
Modifications in a transaction are kept in memory so that there's the ability to rollback the transaction completely if something would go wrong. There could of course be a solution where (I'm just spawning supposedly), so that if a tx gets big enough such a transaction gets converted into its own graph database or some other on-disk data structure which would then be merged into the main database on commit. Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Another view of this is that such big transactions (I'm assuming here) are only really used for a first-time insertion of a big data set, where the BatchInserter can be used and does exactly that... it flushes to disk whenever it feels like and you can just go on feeding it more and more data. 2010/7/8 Rick Bullotta rick.bullo...@burningskysoftware.com Paul, I also would like to see automatic swapping/paging to disk as part of Neo4J, minimally when in bulk insert mode...and ideally in every usage scenario. I don't fully understand why the in-memory logs get so large and/or aren't backed by the on-disk log, or if they are, why they need to be kept in memory as well. Perhaps it isn't the transaction stuff that is taking up memory, but the graph itself? Can any of the Neo team help provide some insight? Thanks! -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Paul A. Jackson Sent: Thursday, July 08, 2010 1:35 PM To: (User@lists.neo4j.org) Subject: [Neo4j] OutOfMemory while populating large graph I have seen people discuss committing transactions after some microbatch of a few hundred records, but I thought this was optional. I thought Neo4J would automatically write out to disk as memory became full. Well, I encountered an OOM and want to make sure that I understand the reason. Was my understanding incorrect, or is there a parameter that I need to set to some limit, or is the problem them I am indexing as I go. The stack trace, FWIW, is: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.init(HashMap.java:209) at java.util.HashSet.init(HashSet.java:86) at org.neo4j.index.lucene.LuceneTransaction$TxCache.add(LuceneTransaction.java: 334) at org.neo4j.index.lucene.LuceneTransaction.insert(LuceneTransaction.java:93) at org.neo4j.index.lucene.LuceneTransaction.index(LuceneTransaction.java:59) at org.neo4j.index.lucene.LuceneXaConnection.index(LuceneXaConnection.java:94) at org.neo4j.index.lucene.LuceneIndexService.indexThisTx(LuceneIndexService.jav a:220) at org.neo4j.index.impl.GenericIndexService.index(GenericIndexService.java:54) at org.neo4j.index.lucene.LuceneIndexService.index(LuceneIndexService.java:209) at JiraLoader$JiraExtractor$Item.setNodeProperty(JiraLoader.java:321) at JiraLoader$JiraExtractor$Item.updateGraph(JiraLoader.java:240) Thanks, Paul Jackson ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
Hi, Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Thats cool! Does that already exist? In my code (like others on the list it seems) I have a counter++ that every 20,000 inserts (some made up number that is not going to throw an OutOfMemory) commits and the reopens a new transaction. Sorta sux. Thanks, Marko. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
I've a similar problem. Although I'm not going out of memory yet, I can see the heap constantly growing, and JProfiler says most of it is due to the Lucene indexing. And even if I do the commit after every X transactions, once the population is finished, the final commit is done, and the graph db closed - the heap stays like that - almost full. An explicit gc will clean up some part, but not fully. Arijit On 9 July 2010 17:00, Mattias Persson matt...@neotechnology.com wrote: 2010/7/9 Marko Rodriguez okramma...@gmail.com Hi, Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Thats cool! Does that already exist? In my code (like others on the list it seems) I have a counter++ that every 20,000 inserts (some made up number that is not going to throw an OutOfMemory) commits and the reopens a new transaction. Sorta sux. No it doesn't, I just wrote stuff which I though someone could think of as useful. A cool thing with just telling it to do a batch insert mode transaction (not the actual commit interval) is that it could look at how much memory it had to play around with and commit whenever it would be the most efficient, even having the ability to change the limit on the fly if the memory suddenly ran out. Thanks, Marko. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
Short answer is maybe. ;-) There are some cases where the transaction is an all or nothing scenario, others where incremental commits are OK. Having the ability to do incremental autocommits would be useful, however. In a perfect world, it could be based on a bucket (e.g. XXX transactions), a time (each 30 seconds), or on a memory usage rule. -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Mattias Persson Sent: Friday, July 09, 2010 7:30 AM To: Neo4j user discussions Subject: Re: [Neo4j] OutOfMemory while populating large graph 2010/7/9 Marko Rodriguez okramma...@gmail.com Hi, Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Thats cool! Does that already exist? In my code (like others on the list it seems) I have a counter++ that every 20,000 inserts (some made up number that is not going to throw an OutOfMemory) commits and the reopens a new transaction. Sorta sux. No it doesn't, I just wrote stuff which I though someone could think of as useful. A cool thing with just telling it to do a batch insert mode transaction (not the actual commit interval) is that it could look at how much memory it had to play around with and commit whenever it would be the most efficient, even having the ability to change the limit on the fly if the memory suddenly ran out. Thanks, Marko. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
I confess I had not investigated the batch inserter. From the description it fits my requirements exactly. With respect to auto-commits, it seems there are two use cases. The first is every day operations that might run out of memory. In this case it might be nice for neo4j to swap out memory to temporary disk as needed. If this performs acceptably, I think that should be default behavior. The second case is the initial population of a graph, where there is no need for roll back and so there is no need to commit to a temporary location. In this case, it seems having neo4j decide when to commit would be ideal. My concern with the first use case is that swapping to temporary storage at ideal intervals may be less efficient than having the user commit to permanent storage at less-than-ideal intervals. If that is the case, then the only real justification for committing to temporary storage would be if there was a requirement to potentially roll back a transaction that was larger than memory could support. -Paul -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Mattias Persson Sent: Friday, July 09, 2010 7:30 AM To: Neo4j user discussions Subject: Re: [Neo4j] OutOfMemory while populating large graph 2010/7/9 Marko Rodriguez okramma...@gmail.com Hi, Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Thats cool! Does that already exist? In my code (like others on the list it seems) I have a counter++ that every 20,000 inserts (some made up number that is not going to throw an OutOfMemory) commits and the reopens a new transaction. Sorta sux. No it doesn't, I just wrote stuff which I though someone could think of as useful. A cool thing with just telling it to do a batch insert mode transaction (not the actual commit interval) is that it could look at how much memory it had to play around with and commit whenever it would be the most efficient, even having the ability to change the limit on the fly if the memory suddenly ran out. Thanks, Marko. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
Note that a couple of memory issues are fixed in Lucene 2.9.3. Leaking when indexing big docs, and indolent reclamation of space from the FieldCache. Bill Arijit Mukherjee ariji...@gmail.com wrote: I've a similar problem. Although I'm not going out of memory yet, I can see the heap constantly growing, and JProfiler says most of it is due to the Lucene indexing. And even if I do the commit after every X transactions, once the population is finished, the final commit is done, and the graph db closed - the heap stays like that - almost full. An explicit gc will clean up some part, but not fully. Arijit On 9 July 2010 17:00, Mattias Persson matt...@neotechnology.com wrote: 2010/7/9 Marko Rodriguez okramma...@gmail.com Hi, Would it actually be worth something to be able to begin a transaction which auto-committs stuff every X write operation, like a batch inserter mode which can be used in normal EmbeddedGraphDatabase? Kind of like: graphDb.beginTx( Mode.BATCH_INSERT ) ...so that you can start such a transaction and then just insert data without having to care about restarting it now and then? Thats cool! Does that already exist? In my code (like others on the list it seems) I have a counter++ that every 20,000 inserts (some made up number that is not going to throw an OutOfMemory) commits and the reopens a new transaction. Sorta sux. No it doesn't, I just wrote stuff which I though someone could think of as useful. A cool thing with just telling it to do a batch insert mode transaction (not the actual commit interval) is that it could look at how much memory it had to play around with and commit whenever it would be the most efficient, even having the ability to change the limit on the fly if the memory suddenly ran out. Thanks, Marko. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] OutOfMemory while populating large graph
I have seen people discuss committing transactions after some microbatch of a few hundred records, but I thought this was optional. I thought Neo4J would automatically write out to disk as memory became full. Well, I encountered an OOM and want to make sure that I understand the reason. Was my understanding incorrect, or is there a parameter that I need to set to some limit, or is the problem them I am indexing as I go. The stack trace, FWIW, is: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.init(HashMap.java:209) at java.util.HashSet.init(HashSet.java:86) at org.neo4j.index.lucene.LuceneTransaction$TxCache.add(LuceneTransaction.java:334) at org.neo4j.index.lucene.LuceneTransaction.insert(LuceneTransaction.java:93) at org.neo4j.index.lucene.LuceneTransaction.index(LuceneTransaction.java:59) at org.neo4j.index.lucene.LuceneXaConnection.index(LuceneXaConnection.java:94) at org.neo4j.index.lucene.LuceneIndexService.indexThisTx(LuceneIndexService.java:220) at org.neo4j.index.impl.GenericIndexService.index(GenericIndexService.java:54) at org.neo4j.index.lucene.LuceneIndexService.index(LuceneIndexService.java:209) at JiraLoader$JiraExtractor$Item.setNodeProperty(JiraLoader.java:321) at JiraLoader$JiraExtractor$Item.updateGraph(JiraLoader.java:240) Thanks, Paul Jackson ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] OutOfMemory while populating large graph
Paul, I also would like to see automatic swapping/paging to disk as part of Neo4J, minimally when in bulk insert mode...and ideally in every usage scenario. I don't fully understand why the in-memory logs get so large and/or aren't backed by the on-disk log, or if they are, why they need to be kept in memory as well. Perhaps it isn't the transaction stuff that is taking up memory, but the graph itself? Can any of the Neo team help provide some insight? Thanks! -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Paul A. Jackson Sent: Thursday, July 08, 2010 1:35 PM To: (User@lists.neo4j.org) Subject: [Neo4j] OutOfMemory while populating large graph I have seen people discuss committing transactions after some microbatch of a few hundred records, but I thought this was optional. I thought Neo4J would automatically write out to disk as memory became full. Well, I encountered an OOM and want to make sure that I understand the reason. Was my understanding incorrect, or is there a parameter that I need to set to some limit, or is the problem them I am indexing as I go. The stack trace, FWIW, is: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.init(HashMap.java:209) at java.util.HashSet.init(HashSet.java:86) at org.neo4j.index.lucene.LuceneTransaction$TxCache.add(LuceneTransaction.java: 334) at org.neo4j.index.lucene.LuceneTransaction.insert(LuceneTransaction.java:93) at org.neo4j.index.lucene.LuceneTransaction.index(LuceneTransaction.java:59) at org.neo4j.index.lucene.LuceneXaConnection.index(LuceneXaConnection.java:94) at org.neo4j.index.lucene.LuceneIndexService.indexThisTx(LuceneIndexService.jav a:220) at org.neo4j.index.impl.GenericIndexService.index(GenericIndexService.java:54) at org.neo4j.index.lucene.LuceneIndexService.index(LuceneIndexService.java:209) at JiraLoader$JiraExtractor$Item.setNodeProperty(JiraLoader.java:321) at JiraLoader$JiraExtractor$Item.updateGraph(JiraLoader.java:240) Thanks, Paul Jackson ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user