Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Wed, Feb 2, 2011 at 1:20 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: You are doing I/O bound work. More then two threads is most likely just going to add overhead and make things slower! I'm certainly doing something wired cause the performance of my tests aren't linear. I've run the app and let it process two big data files and this are the results: 298 rows in 3465742 ms and 2897177 rows in 3483767 ms which are values not in line with the performance of the other test which used just only 10 rows. With more rows the performance dropped down to values between 1.160404947 ms per row and 1.202469507 ms per row. The overall DB size (I measure only the db directory) has grown as well far behond the expected values: 4.9G ... Do you or anyone else have any clue? Thanks -- Massimo ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
2011/2/3 Massimo Lusetti mluse...@gmail.com On Wed, Feb 2, 2011 at 1:20 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: You are doing I/O bound work. More then two threads is most likely just going to add overhead and make things slower! I'm certainly doing something wired cause the performance of my tests aren't linear. I've run the app and let it process two big data files and this are the results: 298 rows in 3465742 ms and 2897177 rows in 3483767 ms which are values not in line with the performance of the other test which used just only 10 rows. With more rows the performance dropped down to values between 1.160404947 ms per row and 1.202469507 ms per row. The overall DB size (I measure only the db directory) has grown as well far behond the expected values: 4.9G ... Do you or anyone else have any clue? Lucene lookup performance degrades the bigger the index gets. That may be a reason. Thanks -- Massimo ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Thu, Feb 3, 2011 at 11:30 AM, Mattias Persson matt...@neotechnology.com wrote: Lucene lookup performance degrades the bigger the index gets. That may be a reason. I don't think Lucene cannot handle an index with 6/7 million of entries. Maybe are some logs around? Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
Massimo, I yesterday just tried to import the Germany OpenStreetMap dataset into Neo4j using Lucene indexing. There are around 60M nodes that all are indexed into Lucene, and then looked up when the Ways, consisting of a number of nodes each, are calculated. Lucene is not fast, but it works on these 60M entries. And in inserts, it does scale reasonably even after a couple of million inserts, which wasn't the case when I tried BerkeleyDB (Java), JDBM or other K/V stores. Actually, if anyone knows a better index for exact lookups that is configurable to have good insert and lookup performance, any hint would be greatly appreciated. This is what right now limits insert performance when you use the Neo4j Batchinserter. The BatchInserter itself can insert around 200K-400K nodes or relationships per second, but the indexing subsystems are just not up to that speed. Help is greatly appreciated! Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Thu, Feb 3, 2011 at 11:52 AM, Massimo Lusetti mluse...@gmail.com wrote: On Thu, Feb 3, 2011 at 11:30 AM, Mattias Persson matt...@neotechnology.com wrote: Lucene lookup performance degrades the bigger the index gets. That may be a reason. I don't think Lucene cannot handle an index with 6/7 million of entries. Maybe are some logs around? Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Thu, Feb 3, 2011 at 2:01 PM, Peter Neubauer peter.neuba...@neotechnology.com wrote: Massimo, I yesterday just tried to import the Germany OpenStreetMap dataset into Neo4j using Lucene indexing. There are around 60M nodes that all are indexed into Lucene, and then looked up when the Ways, consisting of a number of nodes each, are calculated. Lucene is not fast, but it works on these 60M entries. And in inserts, it does scale reasonably even after a couple of million inserts, which wasn't the case when I tried BerkeleyDB (Java), JDBM or other K/V stores. Happy to see good results so the question seems how to fine tune Lucene or the usage which neo4j make of Lucene... Actually, if anyone knows a better index for exact lookups that is configurable to have good insert and lookup performance, any hint would be greatly appreciated. This is what right now limits insert performance when you use the Neo4j Batchinserter. The BatchInserter itself can insert around 200K-400K nodes or relationships per second, but the indexing subsystems are just not up to that speed. I'm not using the BatchInserter cause I will need to do that work on a daily basis and the normal operations are a lot slower. Help is greatly appreciated! Indeed. Thanks in advance Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
2011/2/3 Massimo Lusetti mluse...@gmail.com On Thu, Feb 3, 2011 at 2:01 PM, Peter Neubauer peter.neuba...@neotechnology.com wrote: Massimo, I yesterday just tried to import the Germany OpenStreetMap dataset into Neo4j using Lucene indexing. There are around 60M nodes that all are indexed into Lucene, and then looked up when the Ways, consisting of a number of nodes each, are calculated. Lucene is not fast, but it works on these 60M entries. And in inserts, it does scale reasonably even after a couple of million inserts, which wasn't the case when I tried BerkeleyDB (Java), JDBM or other K/V stores. Happy to see good results so the question seems how to fine tune Lucene or the usage which neo4j make of Lucene... That was just one suggestion I had, maybe there's other things affecting the performance, I mean if we stop guessing and actually measures bottle necks. Actually, if anyone knows a better index for exact lookups that is configurable to have good insert and lookup performance, any hint would be greatly appreciated. This is what right now limits insert performance when you use the Neo4j Batchinserter. The BatchInserter itself can insert around 200K-400K nodes or relationships per second, but the indexing subsystems are just not up to that speed. I'm not using the BatchInserter cause I will need to do that work on a daily basis and the normal operations are a lot slower. Help is greatly appreciated! Indeed. Thanks in advance Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 10:19 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: For getting a performance boost out of writes, doing multiple operations in one transaction will give a much bigger gain than multiple threads though. For your use case, I think two writer threads and a few hundred elements per transaction is an appropriate size. I got some numbers. On a base of 10 rounds of 10 rows each I got an average of 111.1811 sec to crunch a chunk of data, so it means that it would take 1.111811 ms to process a single row. A single file of data contains data for a single day and has an average of 250 rows so it would take approximately 46 minutes to crunch. The final db size is 588034K (574M) which has 100 rows so we can estimate that the final DB size would be 132307650K (126G). The current SQL DB is 60G and the app takes 4 and 1/2 hours to crunch a month of logs, identical machine. The test has been conducted starting from an empty db and the progress of chunk time is this one: 80307 ms, 83444 ms, 97162 ms, 131703 ms, 134647 ms, 104602 ms, 115944 ms, 112489 ms, 115660 ms, 135853 ms. There are 20 threads (default config) and they're processing 200 rows each within a single Transaction. The JVM is: OpenJDK Runtime Environment (build 1.6.0-b20) and was started with this options: -Djava.awt.headless=true -Xms40m -Xmx1536m -XX:+UseConcMarkSweepGC And ... What do you think? Thanks for any info you could give -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
More threads != faster You are doing I/O bound work. More then two threads is most likely just going to add overhead and make things slower! Also, I'm wondering, what does crunch mean in this context? Is it the write operations we have been talking about, or is it some other operation? I'm thinking that if it takes 45 minutes to insert a days worth of data, that is fast enough to do in real time. If Neo4j allows you to *process* that data faster and in new ways, then that would be gain enough, and probably warrant slightly longer insert times. -tobias On Wed, Feb 2, 2011 at 12:17 PM, Massimo Lusetti mluse...@gmail.com wrote: On Tue, Feb 1, 2011 at 10:19 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: For getting a performance boost out of writes, doing multiple operations in one transaction will give a much bigger gain than multiple threads though. For your use case, I think two writer threads and a few hundred elements per transaction is an appropriate size. I got some numbers. On a base of 10 rounds of 10 rows each I got an average of 111.1811 sec to crunch a chunk of data, so it means that it would take 1.111811 ms to process a single row. A single file of data contains data for a single day and has an average of 250 rows so it would take approximately 46 minutes to crunch. The final db size is 588034K (574M) which has 100 rows so we can estimate that the final DB size would be 132307650K (126G). The current SQL DB is 60G and the app takes 4 and 1/2 hours to crunch a month of logs, identical machine. The test has been conducted starting from an empty db and the progress of chunk time is this one: 80307 ms, 83444 ms, 97162 ms, 131703 ms, 134647 ms, 104602 ms, 115944 ms, 112489 ms, 115660 ms, 135853 ms. There are 20 threads (default config) and they're processing 200 rows each within a single Transaction. The JVM is: OpenJDK Runtime Environment (build 1.6.0-b20) and was started with this options: -Djava.awt.headless=true -Xms40m -Xmx1536m -XX:+UseConcMarkSweepGC And ... What do you think? Thanks for any info you could give -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Lucene index commit rate and NoSuchElementException
Hi everyone, I'm new to neo4j and I'm making experience with it, I got a fairly big table (in my current db) which consists of something more then 220 million rows. I want to put that in a graphdb, for instance neo4j, and graph it to do some statistics on them. Every row will be a node in my vision with just one property and will have an index to check if cause I'm using small chunk of data to test and want to know if I've already imported that specific row. I'm pretty sure there aren't two identical rows in the DB table due to db constraints but I still get this errors as import proceed: java.util.NoSuchElementException: More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is 'Node[201728]' and the second element is 'Node[201744]' Does this has something to do with the commit rate of the neo4j internal Lucene index? Or I'm doing something wrong? BTW I'm committing on every row imported: String rowMD5 = md5Source.digest(logRow.getRaw().getBytes()); Transaction tx = graphDb.beginTx(); IndexNode md5Index = graphDb.index().forNodes(md5-rows); if (md5Index.get(md5, rowMD5).getSingle() == null) { Node rowNode = graphDb.createNode(); String[] logRowArray = LogRow.asStringArray(logRow); rowNode.setProperty(logRow, logRowArray); md5Index.add(rowNode, md5, rowMD5); tx.success(); } tx.finish(); Thanks for any clue... Regards -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
Since you are checking for existence before inserting the conflict you are getting is strange. Are you running multiple insertion threads? -Tobias On Tue, Feb 1, 2011 at 6:19 PM, Massimo Lusetti mluse...@gmail.com wrote: Hi everyone, I'm new to neo4j and I'm making experience with it, I got a fairly big table (in my current db) which consists of something more then 220 million rows. I want to put that in a graphdb, for instance neo4j, and graph it to do some statistics on them. Every row will be a node in my vision with just one property and will have an index to check if cause I'm using small chunk of data to test and want to know if I've already imported that specific row. I'm pretty sure there aren't two identical rows in the DB table due to db constraints but I still get this errors as import proceed: java.util.NoSuchElementException: More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is 'Node[201728]' and the second element is 'Node[201744]' Does this has something to do with the commit rate of the neo4j internal Lucene index? Or I'm doing something wrong? BTW I'm committing on every row imported: String rowMD5 = md5Source.digest(logRow.getRaw().getBytes()); Transaction tx = graphDb.beginTx(); IndexNode md5Index = graphDb.index().forNodes(md5-rows); if (md5Index.get(md5, rowMD5).getSingle() == null) { Node rowNode = graphDb.createNode(); String[] logRowArray = LogRow.asStringArray(logRow); rowNode.setProperty(logRow, logRowArray); md5Index.add(rowNode, md5, rowMD5); tx.success(); } tx.finish(); Thanks for any clue... Regards -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
Also, have you been running this insert multiple times without cleaning up the database between runs? Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Tue, Feb 1, 2011 at 6:36 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: Since you are checking for existence before inserting the conflict you are getting is strange. Are you running multiple insertion threads? -Tobias On Tue, Feb 1, 2011 at 6:19 PM, Massimo Lusetti mluse...@gmail.com wrote: Hi everyone, I'm new to neo4j and I'm making experience with it, I got a fairly big table (in my current db) which consists of something more then 220 million rows. I want to put that in a graphdb, for instance neo4j, and graph it to do some statistics on them. Every row will be a node in my vision with just one property and will have an index to check if cause I'm using small chunk of data to test and want to know if I've already imported that specific row. I'm pretty sure there aren't two identical rows in the DB table due to db constraints but I still get this errors as import proceed: java.util.NoSuchElementException: More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is 'Node[201728]' and the second element is 'Node[201744]' Does this has something to do with the commit rate of the neo4j internal Lucene index? Or I'm doing something wrong? BTW I'm committing on every row imported: String rowMD5 = md5Source.digest(logRow.getRaw().getBytes()); Transaction tx = graphDb.beginTx(); IndexNode md5Index = graphDb.index().forNodes(md5-rows); if (md5Index.get(md5, rowMD5).getSingle() == null) { Node rowNode = graphDb.createNode(); String[] logRowArray = LogRow.asStringArray(logRow); rowNode.setProperty(logRow, logRowArray); md5Index.add(rowNode, md5, rowMD5); tx.success(); } tx.finish(); Thanks for any clue... Regards -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
Hmm MD5 is not a unique hashing function so it might be that you get the same hash for different byte arrays. Can you output the MD5 of the multiple logRow's that are returned by the index. Michael Am 01.02.2011 um 18:19 schrieb Massimo Lusetti: Hi everyone, I'm new to neo4j and I'm making experience with it, I got a fairly big table (in my current db) which consists of something more then 220 million rows. I want to put that in a graphdb, for instance neo4j, and graph it to do some statistics on them. Every row will be a node in my vision with just one property and will have an index to check if cause I'm using small chunk of data to test and want to know if I've already imported that specific row. I'm pretty sure there aren't two identical rows in the DB table due to db constraints but I still get this errors as import proceed: java.util.NoSuchElementException: More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is 'Node[201728]' and the second element is 'Node[201744]' Does this has something to do with the commit rate of the neo4j internal Lucene index? Or I'm doing something wrong? BTW I'm committing on every row imported: String rowMD5 = md5Source.digest(logRow.getRaw().getBytes()); Transaction tx = graphDb.beginTx(); IndexNode md5Index = graphDb.index().forNodes(md5-rows); if (md5Index.get(md5, rowMD5).getSingle() == null) { Node rowNode = graphDb.createNode(); String[] logRowArray = LogRow.asStringArray(logRow); rowNode.setProperty(logRow, logRowArray); md5Index.add(rowNode, md5, rowMD5); tx.success(); } tx.finish(); Thanks for any clue... Regards -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
Seems a little weird, the commit rate won't affect the end result, just performance (more operations per commit means faster performance). Your code seems correct for single threaded use btw. Den tisdag 1 februari 2011 skrev Michael Hungermichael.hun...@neotechnology.com: Hmm MD5 is not a unique hashing function so it might be that you get the same hash for different byte arrays. Can you output the MD5 of the multiple logRow's that are returned by the index. Michael Am 01.02.2011 um 18:19 schrieb Massimo Lusetti: Hi everyone, I'm new to neo4j and I'm making experience with it, I got a fairly big table (in my current db) which consists of something more then 220 million rows. I want to put that in a graphdb, for instance neo4j, and graph it to do some statistics on them. Every row will be a node in my vision with just one property and will have an index to check if cause I'm using small chunk of data to test and want to know if I've already imported that specific row. I'm pretty sure there aren't two identical rows in the DB table due to db constraints but I still get this errors as import proceed: java.util.NoSuchElementException: More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is 'Node[201728]' and the second element is 'Node[201744]' Does this has something to do with the commit rate of the neo4j internal Lucene index? Or I'm doing something wrong? BTW I'm committing on every row imported: String rowMD5 = md5Source.digest(logRow.getRaw().getBytes()); Transaction tx = graphDb.beginTx(); IndexNode md5Index = graphDb.index().forNodes(md5-rows); if (md5Index.get(md5, rowMD5).getSingle() == null) { Node rowNode = graphDb.createNode(); String[] logRowArray = LogRow.asStringArray(logRow); rowNode.setProperty(logRow, logRowArray); md5Index.add(rowNode, md5, rowMD5); tx.success(); } tx.finish(); Thanks for any clue... Regards -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 6:36 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: Since you are checking for existence before inserting the conflict you are getting is strange. Are you running multiple insertion threads? Yep, I got 20 concurrent threads doing the job. I've forgot about them, thanks! Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 6:43 PM, Peter Neubauer peter.neuba...@neotechnology.com wrote: Also, have you been running this insert multiple times without cleaning up the database between runs? Nope for the tests I wipe (rm -rf) the db dir every run. Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson matt...@neotechnology.com wrote: Seems a little weird, the commit rate won't affect the end result, just performance (more operations per commit means faster performance). Your code seems correct for single threaded use btw. Does it means that I cannot access the graphdb from multiple threads? That code is on a singleton service which expose the GraphDatabaseService through a method addNode() from where I run that code. The singleton service is called by a thread pool which can fire at maximum 20 concurrent threads. Any hints is really appreciated. Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
No, it means that you have to synchronize the threads so that they don't insert the same data concurrently. Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new Object()) when you start working on a new hash. If the token Object you get back is not the same as you put in, you know that another thread is working on that md5, which means this thread should move on to another one. When the transaction is done you remove the md5 from the Map, to ensure that you don't leak memory. That's a simple locking on arbitrary key implementation. The reason you cannot just do synchronized(md5) {...} is of course that your hashes are computed, and thus will not be the same object every time, even though they are equals(). For getting a performance boost out of writes, doing multiple operations in one transaction will give a much bigger gain than multiple threads though. For your use case, I think two writer threads and a few hundred elements per transaction is an appropriate size. -tobias On Tue, Feb 1, 2011 at 9:06 PM, Massimo Lusetti mluse...@gmail.com wrote: On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson matt...@neotechnology.com wrote: Seems a little weird, the commit rate won't affect the end result, just performance (more operations per commit means faster performance). Your code seems correct for single threaded use btw. Does it means that I cannot access the graphdb from multiple threads? That code is on a singleton service which expose the GraphDatabaseService through a method addNode() from where I run that code. The singleton service is called by a thread pool which can fire at maximum 20 concurrent threads. Any hints is really appreciated. Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
What about batch insertion of the nodes and indexing them after the fact? And I agree with Tobias that a CHM should be a better claim checking algorithm than using indexing for that. The index as well as the insertion of the nodes will only be visible to other threads after the commit (ACID, please TI correct me if I'm wrong) , so it is surely possible that you accidentally insert the same data twice. Cheers Michael Am 01.02.2011 um 22:19 schrieb Tobias Ivarsson: No, it means that you have to synchronize the threads so that they don't insert the same data concurrently. Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new Object()) when you start working on a new hash. If the token Object you get back is not the same as you put in, you know that another thread is working on that md5, which means this thread should move on to another one. When the transaction is done you remove the md5 from the Map, to ensure that you don't leak memory. That's a simple locking on arbitrary key implementation. The reason you cannot just do synchronized(md5) {...} is of course that your hashes are computed, and thus will not be the same object every time, even though they are equals(). For getting a performance boost out of writes, doing multiple operations in one transaction will give a much bigger gain than multiple threads though. For your use case, I think two writer threads and a few hundred elements per transaction is an appropriate size. -tobias On Tue, Feb 1, 2011 at 9:06 PM, Massimo Lusetti mluse...@gmail.com wrote: On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson matt...@neotechnology.com wrote: Seems a little weird, the commit rate won't affect the end result, just performance (more operations per commit means faster performance). Your code seems correct for single threaded use btw. Does it means that I cannot access the graphdb from multiple threads? That code is on a singleton service which expose the GraphDatabaseService through a method addNode() from where I run that code. The singleton service is called by a thread pool which can fire at maximum 20 concurrent threads. Any hints is really appreciated. Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
That is correct, the Isolation of ACID says that data isn't visible to other threads until after commit. The CHM should not replace the index check though, since you want to limit the number of items in the CHM, you only want this to reflect the elements currently being worked on, the index check should still be there for elements processed before. -tobias On Tue, Feb 1, 2011 at 10:25 PM, Michael Hunger michael.hun...@neotechnology.com wrote: What about batch insertion of the nodes and indexing them after the fact? And I agree with Tobias that a CHM should be a better claim checking algorithm than using indexing for that. The index as well as the insertion of the nodes will only be visible to other threads after the commit (ACID, please TI correct me if I'm wrong) , so it is surely possible that you accidentally insert the same data twice. Cheers Michael Am 01.02.2011 um 22:19 schrieb Tobias Ivarsson: No, it means that you have to synchronize the threads so that they don't insert the same data concurrently. Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new Object()) when you start working on a new hash. If the token Object you get back is not the same as you put in, you know that another thread is working on that md5, which means this thread should move on to another one. When the transaction is done you remove the md5 from the Map, to ensure that you don't leak memory. That's a simple locking on arbitrary key implementation. The reason you cannot just do synchronized(md5) {...} is of course that your hashes are computed, and thus will not be the same object every time, even though they are equals(). For getting a performance boost out of writes, doing multiple operations in one transaction will give a much bigger gain than multiple threads though. For your use case, I think two writer threads and a few hundred elements per transaction is an appropriate size. -tobias On Tue, Feb 1, 2011 at 9:06 PM, Massimo Lusetti mluse...@gmail.com wrote: On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson matt...@neotechnology.com wrote: Seems a little weird, the commit rate won't affect the end result, just performance (more operations per commit means faster performance). Your code seems correct for single threaded use btw. Does it means that I cannot access the graphdb from multiple threads? That code is on a singleton service which expose the GraphDatabaseService through a method addNode() from where I run that code. The singleton service is called by a thread pool which can fire at maximum 20 concurrent threads. Any hints is really appreciated. Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 10:19 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: No, it means that you have to synchronize the threads so that they don't insert the same data concurrently. That would be a typical issue but I'm sure my are not duplicated since the come from the (old) db which has constraints on that, I can see the problem of data visibility as Michael suggest but I need a way to recognize if I've already had put in the data row or not even a lot later in time, let's say 2/3 days after so I came to the conclusion of the Lucene index. Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new Object()) when you start working on a new hash. If the token Object you get back is not the same as you put in, you know that another thread is working on that md5, which means this thread should move on to another one. When the transaction is done you remove the md5 from the Map, to ensure that you don't leak memory. That's a simple locking on arbitrary key implementation. The reason you cannot just do synchronized(md5) {...} is of course that your hashes are computed, and thus will not be the same object every time, even though they are equals(). For getting a performance boost out of writes, doing multiple operations in one transaction will give a much bigger gain than multiple threads though. For your use case, I think two writer threads and a few hundred elements per transaction is an appropriate size. Wow, thanks for the hint, really appreciated I'm going to give it a try and let you know. BTW since the log row is similar to and apache log file with some more additional informations how much disk space do you think it will occupy ? -tobias Again thanks a lot! Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 10:25 PM, Michael Hunger michael.hun...@neotechnology.com wrote: What about batch insertion of the nodes and indexing them after the fact? The data to be entered will changes values in other nodes (statistics) so I absolutely need to be sure to not insert data twice and as I said between a long period of time, to be more precise I've to accept data that goes almost an year back and check if it has to be written or has already been written. Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Lucene index commit rate and NoSuchElementException
On Tue, Feb 1, 2011 at 10:50 PM, Tobias Ivarsson tobias.ivars...@neotechnology.com wrote: That is correct, the Isolation of ACID says that data isn't visible to other threads until after commit. The CHM should not replace the index check though, since you want to limit the number of items in the CHM, you only want this to reflect the elements currently being worked on, the index check should still be there for elements processed before. So if I read you right you suggest to use a buffer (the CHM) just to mitigate the effects of ACID from the Lucene/Node db. Is it right? Cheers -- Massimo http://meridio.blogspot.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user