Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-03 Thread Massimo Lusetti
On Wed, Feb 2, 2011 at 1:20 PM, Tobias Ivarsson
tobias.ivars...@neotechnology.com wrote:

 You are doing I/O bound work. More then two threads is most likely just
 going to add overhead and make things slower!

I'm certainly doing something wired cause the performance of my tests
aren't linear.

I've run the app and let it process two big data files and this are
the results: 298 rows in 3465742 ms and 2897177 rows in 3483767 ms
which are values not in line with the performance of the other test
which used just only 10 rows. With more rows the performance
dropped down to values between 1.160404947 ms per row and 1.202469507
ms per row. The overall DB size (I measure only the db directory) has
grown as well far behond the expected values: 4.9G ...

Do you or anyone else have any clue?

Thanks
-- 
Massimo
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-03 Thread Mattias Persson
2011/2/3 Massimo Lusetti mluse...@gmail.com

 On Wed, Feb 2, 2011 at 1:20 PM, Tobias Ivarsson
 tobias.ivars...@neotechnology.com wrote:

  You are doing I/O bound work. More then two threads is most likely just
  going to add overhead and make things slower!

 I'm certainly doing something wired cause the performance of my tests
 aren't linear.

 I've run the app and let it process two big data files and this are
 the results: 298 rows in 3465742 ms and 2897177 rows in 3483767 ms
 which are values not in line with the performance of the other test
 which used just only 10 rows. With more rows the performance
 dropped down to values between 1.160404947 ms per row and 1.202469507
 ms per row. The overall DB size (I measure only the db directory) has
 grown as well far behond the expected values: 4.9G ...

 Do you or anyone else have any clue?


Lucene lookup performance degrades the bigger the index gets. That may be a
reason.


 Thanks
 --
 Massimo
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-03 Thread Massimo Lusetti
On Thu, Feb 3, 2011 at 11:30 AM, Mattias Persson
matt...@neotechnology.com wrote:

 Lucene lookup performance degrades the bigger the index gets. That may be a
 reason.

I don't think Lucene cannot handle an index with 6/7 million of
entries. Maybe are some logs around?

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-03 Thread Peter Neubauer
Massimo,
I yesterday just tried to import the Germany OpenStreetMap dataset
into Neo4j using Lucene indexing. There are around 60M nodes that all
are indexed into Lucene, and then looked up when the Ways, consisting
of a number of nodes each, are calculated. Lucene is not fast, but it
works on these 60M entries. And in inserts, it does scale reasonably
even after a couple of million inserts, which wasn't the case when I
tried BerkeleyDB (Java), JDBM or other K/V stores.

Actually, if anyone knows a better index for exact lookups that is
configurable to have good insert and lookup performance, any hint
would be greatly appreciated. This is what right now limits insert
performance when you use the Neo4j Batchinserter. The BatchInserter
itself can insert around 200K-400K nodes or relationships per second,
but the indexing subsystems are just not up to that speed.

Help is greatly appreciated!

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.



On Thu, Feb 3, 2011 at 11:52 AM, Massimo Lusetti mluse...@gmail.com wrote:
 On Thu, Feb 3, 2011 at 11:30 AM, Mattias Persson
 matt...@neotechnology.com wrote:

 Lucene lookup performance degrades the bigger the index gets. That may be a
 reason.

 I don't think Lucene cannot handle an index with 6/7 million of
 entries. Maybe are some logs around?

 Cheers
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-03 Thread Massimo Lusetti
On Thu, Feb 3, 2011 at 2:01 PM, Peter Neubauer
peter.neuba...@neotechnology.com wrote:

 Massimo,
 I yesterday just tried to import the Germany OpenStreetMap dataset
 into Neo4j using Lucene indexing. There are around 60M nodes that all
 are indexed into Lucene, and then looked up when the Ways, consisting
 of a number of nodes each, are calculated. Lucene is not fast, but it
 works on these 60M entries. And in inserts, it does scale reasonably
 even after a couple of million inserts, which wasn't the case when I
 tried BerkeleyDB (Java), JDBM or other K/V stores.

Happy to see good results so the question seems how to fine tune
Lucene or the usage which neo4j make of Lucene...

 Actually, if anyone knows a better index for exact lookups that is
 configurable to have good insert and lookup performance, any hint
 would be greatly appreciated. This is what right now limits insert
 performance when you use the Neo4j Batchinserter. The BatchInserter
 itself can insert around 200K-400K nodes or relationships per second,
 but the indexing subsystems are just not up to that speed.

I'm not using the BatchInserter cause I will need to do that work on a
daily basis and the normal operations are a lot slower.

 Help is greatly appreciated!

Indeed. Thanks in advance

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-03 Thread Mattias Persson
2011/2/3 Massimo Lusetti mluse...@gmail.com

 On Thu, Feb 3, 2011 at 2:01 PM, Peter Neubauer
 peter.neuba...@neotechnology.com wrote:

  Massimo,
  I yesterday just tried to import the Germany OpenStreetMap dataset
  into Neo4j using Lucene indexing. There are around 60M nodes that all
  are indexed into Lucene, and then looked up when the Ways, consisting
  of a number of nodes each, are calculated. Lucene is not fast, but it
  works on these 60M entries. And in inserts, it does scale reasonably
  even after a couple of million inserts, which wasn't the case when I
  tried BerkeleyDB (Java), JDBM or other K/V stores.

 Happy to see good results so the question seems how to fine tune
 Lucene or the usage which neo4j make of Lucene...


That was just one suggestion I had, maybe there's other things affecting the
performance, I mean if we stop guessing and actually measures bottle necks.


  Actually, if anyone knows a better index for exact lookups that is
  configurable to have good insert and lookup performance, any hint
  would be greatly appreciated. This is what right now limits insert
  performance when you use the Neo4j Batchinserter. The BatchInserter
  itself can insert around 200K-400K nodes or relationships per second,
  but the indexing subsystems are just not up to that speed.

 I'm not using the BatchInserter cause I will need to do that work on a
 daily basis and the normal operations are a lot slower.

  Help is greatly appreciated!

 Indeed. Thanks in advance

 Cheers
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-02 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 10:19 PM, Tobias Ivarsson
tobias.ivars...@neotechnology.com wrote:

 For getting a performance boost out of writes, doing multiple operations in
 one transaction will give a much bigger gain than multiple threads though.
 For your use case, I think two writer threads and a few hundred elements per
 transaction is an appropriate size.

I got some numbers.

On a base of 10 rounds of 10 rows each I got an average of
111.1811 sec to crunch a chunk of data, so it means that it would take
1.111811 ms to process a single row.
A single file of data contains data for a single day and has an
average of 250 rows so it would take approximately 46 minutes to
crunch.

The final db size is 588034K (574M) which has 100 rows so we can
estimate that the final DB size would be 132307650K (126G).

The current SQL DB is 60G and the app takes 4 and 1/2 hours to crunch
a month of logs, identical machine.

The test has been conducted starting from an empty db and the progress
of chunk time is this one: 80307 ms, 83444 ms, 97162 ms, 131703 ms,
134647 ms, 104602 ms, 115944 ms, 112489 ms, 115660 ms, 135853 ms.

There are 20 threads (default config) and they're processing 200 rows
each within a single Transaction.

The JVM is:
OpenJDK Runtime Environment (build 1.6.0-b20) and was started with this options:
-Djava.awt.headless=true -Xms40m -Xmx1536m -XX:+UseConcMarkSweepGC

And ...

What do you think?

Thanks for any info you could give
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-02 Thread Tobias Ivarsson
More threads != faster

You are doing I/O bound work. More then two threads is most likely just
going to add overhead and make things slower!

Also, I'm wondering, what does crunch mean in this context? Is it the
write operations we have been talking about, or is it some other operation?
I'm thinking that if it takes 45 minutes to insert a days worth of data,
that is fast enough to do in real time. If Neo4j allows you to *process*
that data faster and in new ways, then that would be gain enough, and
probably warrant slightly longer insert times.

-tobias

On Wed, Feb 2, 2011 at 12:17 PM, Massimo Lusetti mluse...@gmail.com wrote:

 On Tue, Feb 1, 2011 at 10:19 PM, Tobias Ivarsson
 tobias.ivars...@neotechnology.com wrote:

  For getting a performance boost out of writes, doing multiple operations
 in
  one transaction will give a much bigger gain than multiple threads
 though.
  For your use case, I think two writer threads and a few hundred elements
 per
  transaction is an appropriate size.

 I got some numbers.

 On a base of 10 rounds of 10 rows each I got an average of
 111.1811 sec to crunch a chunk of data, so it means that it would take
 1.111811 ms to process a single row.
 A single file of data contains data for a single day and has an
 average of 250 rows so it would take approximately 46 minutes to
 crunch.

 The final db size is 588034K (574M) which has 100 rows so we can
 estimate that the final DB size would be 132307650K (126G).

 The current SQL DB is 60G and the app takes 4 and 1/2 hours to crunch
 a month of logs, identical machine.

 The test has been conducted starting from an empty db and the progress
 of chunk time is this one: 80307 ms, 83444 ms, 97162 ms, 131703 ms,
 134647 ms, 104602 ms, 115944 ms, 112489 ms, 115660 ms, 135853 ms.

 There are 20 threads (default config) and they're processing 200 rows
 each within a single Transaction.

 The JVM is:
 OpenJDK Runtime Environment (build 1.6.0-b20) and was started with this
 options:
 -Djava.awt.headless=true -Xms40m -Xmx1536m -XX:+UseConcMarkSweepGC

 And ...

 What do you think?

 Thanks for any info you could give
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Tobias Ivarsson tobias.ivars...@neotechnology.com
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
Hi everyone,
  I'm new to neo4j and I'm making experience with it, I got a fairly
big table (in my current db) which consists of something more then 220
million rows.

I want to put that in a graphdb, for instance neo4j, and graph it to
do some statistics on them. Every row will be a node in my vision with
just one property and will have an index to check if cause I'm using
small chunk of data to test and want to know if I've already imported
that specific row.

I'm pretty sure there aren't two identical rows in the DB table due to
db constraints but I still get this errors as import proceed:
java.util.NoSuchElementException: More than one element in
org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is
'Node[201728]' and the second element is 'Node[201744]'

Does this has something to do with the commit rate of the neo4j
internal Lucene index? Or I'm doing something wrong? BTW I'm
committing on every row imported:

 String rowMD5 = md5Source.digest(logRow.getRaw().getBytes());

Transaction tx = graphDb.beginTx();

IndexNode md5Index = graphDb.index().forNodes(md5-rows);

if (md5Index.get(md5, rowMD5).getSingle() == null)
{
Node rowNode = graphDb.createNode();
String[] logRowArray = LogRow.asStringArray(logRow);
rowNode.setProperty(logRow, logRowArray);
md5Index.add(rowNode, md5, rowMD5);

tx.success();
}

tx.finish();


Thanks for any clue...

Regards
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Tobias Ivarsson
Since you are checking for existence before inserting the conflict you are
getting is strange. Are you running multiple insertion threads?

-Tobias

On Tue, Feb 1, 2011 at 6:19 PM, Massimo Lusetti mluse...@gmail.com wrote:

 Hi everyone,
  I'm new to neo4j and I'm making experience with it, I got a fairly
 big table (in my current db) which consists of something more then 220
 million rows.

 I want to put that in a graphdb, for instance neo4j, and graph it to
 do some statistics on them. Every row will be a node in my vision with
 just one property and will have an index to check if cause I'm using
 small chunk of data to test and want to know if I've already imported
 that specific row.

 I'm pretty sure there aren't two identical rows in the DB table due to
 db constraints but I still get this errors as import proceed:
 java.util.NoSuchElementException: More than one element in
 org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is
 'Node[201728]' and the second element is 'Node[201744]'

 Does this has something to do with the commit rate of the neo4j
 internal Lucene index? Or I'm doing something wrong? BTW I'm
 committing on every row imported:

  String rowMD5 = md5Source.digest(logRow.getRaw().getBytes());

Transaction tx = graphDb.beginTx();

IndexNode md5Index = graphDb.index().forNodes(md5-rows);

if (md5Index.get(md5, rowMD5).getSingle() == null)
{
Node rowNode = graphDb.createNode();
String[] logRowArray = LogRow.asStringArray(logRow);
rowNode.setProperty(logRow, logRowArray);
md5Index.add(rowNode, md5, rowMD5);

tx.success();
}

tx.finish();


 Thanks for any clue...

 Regards
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Tobias Ivarsson tobias.ivars...@neotechnology.com
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Peter Neubauer
Also,
have you been running this insert multiple times without cleaning up
the database between runs?

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.



On Tue, Feb 1, 2011 at 6:36 PM, Tobias Ivarsson
tobias.ivars...@neotechnology.com wrote:
 Since you are checking for existence before inserting the conflict you are
 getting is strange. Are you running multiple insertion threads?

 -Tobias

 On Tue, Feb 1, 2011 at 6:19 PM, Massimo Lusetti mluse...@gmail.com wrote:

 Hi everyone,
  I'm new to neo4j and I'm making experience with it, I got a fairly
 big table (in my current db) which consists of something more then 220
 million rows.

 I want to put that in a graphdb, for instance neo4j, and graph it to
 do some statistics on them. Every row will be a node in my vision with
 just one property and will have an index to check if cause I'm using
 small chunk of data to test and want to know if I've already imported
 that specific row.

 I'm pretty sure there aren't two identical rows in the DB table due to
 db constraints but I still get this errors as import proceed:
 java.util.NoSuchElementException: More than one element in
 org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is
 'Node[201728]' and the second element is 'Node[201744]'

 Does this has something to do with the commit rate of the neo4j
 internal Lucene index? Or I'm doing something wrong? BTW I'm
 committing on every row imported:

  String rowMD5 = md5Source.digest(logRow.getRaw().getBytes());

                Transaction     tx = graphDb.beginTx();

                IndexNode md5Index = graphDb.index().forNodes(md5-rows);

                if (md5Index.get(md5, rowMD5).getSingle() == null)
                {
                        Node rowNode = graphDb.createNode();
                        String[] logRowArray = LogRow.asStringArray(logRow);
                        rowNode.setProperty(logRow, logRowArray);
                        md5Index.add(rowNode, md5, rowMD5);

                        tx.success();
                }

                tx.finish();


 Thanks for any clue...

 Regards
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




 --
 Tobias Ivarsson tobias.ivars...@neotechnology.com
 Hacker, Neo Technology
 www.neotechnology.com
 Cellphone: +46 706 534857
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Michael Hunger
Hmm MD5 is not a unique hashing function so it might be that you get the same 
hash for different byte arrays.

Can you output the MD5 of the multiple logRow's that are returned by the index.

Michael

Am 01.02.2011 um 18:19 schrieb Massimo Lusetti:

 Hi everyone,
  I'm new to neo4j and I'm making experience with it, I got a fairly
 big table (in my current db) which consists of something more then 220
 million rows.
 
 I want to put that in a graphdb, for instance neo4j, and graph it to
 do some statistics on them. Every row will be a node in my vision with
 just one property and will have an index to check if cause I'm using
 small chunk of data to test and want to know if I've already imported
 that specific row.
 
 I'm pretty sure there aren't two identical rows in the DB table due to
 db constraints but I still get this errors as import proceed:
 java.util.NoSuchElementException: More than one element in
 org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is
 'Node[201728]' and the second element is 'Node[201744]'
 
 Does this has something to do with the commit rate of the neo4j
 internal Lucene index? Or I'm doing something wrong? BTW I'm
 committing on every row imported:
 
 String rowMD5 = md5Source.digest(logRow.getRaw().getBytes());
   
   Transaction tx = graphDb.beginTx();
 
   IndexNode md5Index = graphDb.index().forNodes(md5-rows);
   
   if (md5Index.get(md5, rowMD5).getSingle() == null)
   {
   Node rowNode = graphDb.createNode();
   String[] logRowArray = LogRow.asStringArray(logRow);
   rowNode.setProperty(logRow, logRowArray);
   md5Index.add(rowNode, md5, rowMD5);
   
   tx.success();
   }
   
   tx.finish();
 
 
 Thanks for any clue...
 
 Regards
 -- 
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Mattias Persson
Seems a little weird, the commit rate won't affect the end result,
just performance (more operations per commit means faster
performance). Your code seems correct for single threaded use btw.

Den tisdag 1 februari 2011 skrev Michael
Hungermichael.hun...@neotechnology.com:
 Hmm MD5 is not a unique hashing function so it might be that you get the same 
 hash for different byte arrays.

 Can you output the MD5 of the multiple logRow's that are returned by the 
 index.

 Michael

 Am 01.02.2011 um 18:19 schrieb Massimo Lusetti:

 Hi everyone,
  I'm new to neo4j and I'm making experience with it, I got a fairly
 big table (in my current db) which consists of something more then 220
 million rows.

 I want to put that in a graphdb, for instance neo4j, and graph it to
 do some statistics on them. Every row will be a node in my vision with
 just one property and will have an index to check if cause I'm using
 small chunk of data to test and want to know if I've already imported
 that specific row.

 I'm pretty sure there aren't two identical rows in the DB table due to
 db constraints but I still get this errors as import proceed:
 java.util.NoSuchElementException: More than one element in
 org.neo4j.index.impl.lucene.LuceneIndex$1@77435b. First element is
 'Node[201728]' and the second element is 'Node[201744]'

 Does this has something to do with the commit rate of the neo4j
 internal Lucene index? Or I'm doing something wrong? BTW I'm
 committing on every row imported:

 String rowMD5 = md5Source.digest(logRow.getRaw().getBytes());

               Transaction     tx = graphDb.beginTx();

               IndexNode md5Index = graphDb.index().forNodes(md5-rows);

               if (md5Index.get(md5, rowMD5).getSingle() == null)
               {
                       Node rowNode = graphDb.createNode();
                       String[] logRowArray = LogRow.asStringArray(logRow);
                       rowNode.setProperty(logRow, logRowArray);
                       md5Index.add(rowNode, md5, rowMD5);

                       tx.success();
               }

               tx.finish();


 Thanks for any clue...

 Regards
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user


-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 6:36 PM, Tobias Ivarsson
tobias.ivars...@neotechnology.com wrote:

 Since you are checking for existence before inserting the conflict you are
 getting is strange. Are you running multiple insertion threads?

Yep, I got 20 concurrent threads doing the job. I've forgot about them, thanks!

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 6:43 PM, Peter Neubauer
peter.neuba...@neotechnology.com wrote:

 Also,
 have you been running this insert multiple times without cleaning up
 the database between runs?

Nope for the tests I wipe (rm -rf) the db dir every run.

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson
matt...@neotechnology.com wrote:

 Seems a little weird, the commit rate won't affect the end result,
 just performance (more operations per commit means faster
 performance). Your code seems correct for single threaded use btw.

Does it means that I cannot access the graphdb from multiple threads?
That code is on a singleton service which expose the
GraphDatabaseService through a method addNode() from where I run that
code.

The singleton service is called by a thread pool which can fire at
maximum 20 concurrent threads.

Any hints is really appreciated.

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Tobias Ivarsson
No, it means that you have to synchronize the threads so that they don't
insert the same data concurrently.
Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new
Object()) when you start working on a new hash. If the token Object you get
back is not the same as you put in, you know that another thread is working
on that md5, which means this thread should move on to another one. When the
transaction is done you remove the md5 from the Map, to ensure that you
don't leak memory.

That's a simple locking on arbitrary key implementation. The reason you
cannot just do synchronized(md5) {...} is of course that your hashes are
computed, and thus will not be the same object every time, even though they
are equals().

For getting a performance boost out of writes, doing multiple operations in
one transaction will give a much bigger gain than multiple threads though.
For your use case, I think two writer threads and a few hundred elements per
transaction is an appropriate size.

-tobias

On Tue, Feb 1, 2011 at 9:06 PM, Massimo Lusetti mluse...@gmail.com wrote:

 On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson
 matt...@neotechnology.com wrote:

  Seems a little weird, the commit rate won't affect the end result,
  just performance (more operations per commit means faster
  performance). Your code seems correct for single threaded use btw.

 Does it means that I cannot access the graphdb from multiple threads?
 That code is on a singleton service which expose the
 GraphDatabaseService through a method addNode() from where I run that
 code.

 The singleton service is called by a thread pool which can fire at
 maximum 20 concurrent threads.

 Any hints is really appreciated.

 Cheers
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Tobias Ivarsson tobias.ivars...@neotechnology.com
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Michael Hunger
What about batch insertion of the nodes and indexing them after the fact?

And I agree with Tobias that a CHM should be a better claim checking algorithm 
than using
indexing for that. The index as well as the insertion of the nodes will only be 
visible to other
threads after the commit (ACID, please TI correct me if I'm wrong) , so it is 
surely possible that
you accidentally insert the same data twice.

Cheers

Michael

Am 01.02.2011 um 22:19 schrieb Tobias Ivarsson:

 No, it means that you have to synchronize the threads so that they don't
 insert the same data concurrently.
 Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new
 Object()) when you start working on a new hash. If the token Object you get
 back is not the same as you put in, you know that another thread is working
 on that md5, which means this thread should move on to another one. When the
 transaction is done you remove the md5 from the Map, to ensure that you
 don't leak memory.
 
 That's a simple locking on arbitrary key implementation. The reason you
 cannot just do synchronized(md5) {...} is of course that your hashes are
 computed, and thus will not be the same object every time, even though they
 are equals().
 
 For getting a performance boost out of writes, doing multiple operations in
 one transaction will give a much bigger gain than multiple threads though.
 For your use case, I think two writer threads and a few hundred elements per
 transaction is an appropriate size.
 
 -tobias
 
 On Tue, Feb 1, 2011 at 9:06 PM, Massimo Lusetti mluse...@gmail.com wrote:
 
 On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson
 matt...@neotechnology.com wrote:
 
 Seems a little weird, the commit rate won't affect the end result,
 just performance (more operations per commit means faster
 performance). Your code seems correct for single threaded use btw.
 
 Does it means that I cannot access the graphdb from multiple threads?
 That code is on a singleton service which expose the
 GraphDatabaseService through a method addNode() from where I run that
 code.
 
 The singleton service is called by a thread pool which can fire at
 maximum 20 concurrent threads.
 
 Any hints is really appreciated.
 
 Cheers
 --
 Massimo
 http://meridio.blogspot.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 
 
 
 
 -- 
 Tobias Ivarsson tobias.ivars...@neotechnology.com
 Hacker, Neo Technology
 www.neotechnology.com
 Cellphone: +46 706 534857
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Tobias Ivarsson
That is correct, the Isolation of ACID says that data isn't visible to other
threads until after commit.

The CHM should not replace the index check though, since you want to limit
the number of items in the CHM, you only want this to reflect the elements
currently being worked on, the index check should still be there for
elements processed before.

-tobias

On Tue, Feb 1, 2011 at 10:25 PM, Michael Hunger 
michael.hun...@neotechnology.com wrote:

 What about batch insertion of the nodes and indexing them after the fact?

 And I agree with Tobias that a CHM should be a better claim checking
 algorithm than using
 indexing for that. The index as well as the insertion of the nodes will
 only be visible to other
 threads after the commit (ACID, please TI correct me if I'm wrong) , so it
 is surely possible that
 you accidentally insert the same data twice.

 Cheers

 Michael

 Am 01.02.2011 um 22:19 schrieb Tobias Ivarsson:

  No, it means that you have to synchronize the threads so that they don't
  insert the same data concurrently.
  Perhaps a ConcurrentHashMapMD5,token where you would
 putIfAbsent(md5,new
  Object()) when you start working on a new hash. If the token Object you
 get
  back is not the same as you put in, you know that another thread is
 working
  on that md5, which means this thread should move on to another one. When
 the
  transaction is done you remove the md5 from the Map, to ensure that you
  don't leak memory.
 
  That's a simple locking on arbitrary key implementation. The reason you
  cannot just do synchronized(md5) {...} is of course that your hashes are
  computed, and thus will not be the same object every time, even though
 they
  are equals().
 
  For getting a performance boost out of writes, doing multiple operations
 in
  one transaction will give a much bigger gain than multiple threads
 though.
  For your use case, I think two writer threads and a few hundred elements
 per
  transaction is an appropriate size.
 
  -tobias
 
  On Tue, Feb 1, 2011 at 9:06 PM, Massimo Lusetti mluse...@gmail.com
 wrote:
 
  On Tue, Feb 1, 2011 at 8:02 PM, Mattias Persson
  matt...@neotechnology.com wrote:
 
  Seems a little weird, the commit rate won't affect the end result,
  just performance (more operations per commit means faster
  performance). Your code seems correct for single threaded use btw.
 
  Does it means that I cannot access the graphdb from multiple threads?
  That code is on a singleton service which expose the
  GraphDatabaseService through a method addNode() from where I run that
  code.
 
  The singleton service is called by a thread pool which can fire at
  maximum 20 concurrent threads.
 
  Any hints is really appreciated.
 
  Cheers
  --
  Massimo
  http://meridio.blogspot.com
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
 
 
 
  --
  Tobias Ivarsson tobias.ivars...@neotechnology.com
  Hacker, Neo Technology
  www.neotechnology.com
  Cellphone: +46 706 534857
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Tobias Ivarsson tobias.ivars...@neotechnology.com
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 10:19 PM, Tobias Ivarsson
tobias.ivars...@neotechnology.com wrote:

 No, it means that you have to synchronize the threads so that they don't
 insert the same data concurrently.

That would be a typical issue but I'm sure my are not duplicated since
the come from the (old) db which has constraints on that, I can see
the problem of data visibility as Michael suggest but I need a way to
recognize if I've already had put in the data row or not even a lot
later in time, let's say 2/3 days after so I came to the conclusion of
the Lucene index.

 Perhaps a ConcurrentHashMapMD5,token where you would putIfAbsent(md5,new
 Object()) when you start working on a new hash. If the token Object you get
 back is not the same as you put in, you know that another thread is working
 on that md5, which means this thread should move on to another one. When the
 transaction is done you remove the md5 from the Map, to ensure that you
 don't leak memory.

 That's a simple locking on arbitrary key implementation. The reason you
 cannot just do synchronized(md5) {...} is of course that your hashes are
 computed, and thus will not be the same object every time, even though they
 are equals().

 For getting a performance boost out of writes, doing multiple operations in
 one transaction will give a much bigger gain than multiple threads though.
 For your use case, I think two writer threads and a few hundred elements per
 transaction is an appropriate size.

Wow, thanks for the hint, really appreciated I'm going to give it a
try and let you know. BTW since the log row is similar to and apache
log file with some more additional informations how much disk space do
you think it will occupy ?

 -tobias

Again thanks a lot!

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 10:25 PM, Michael Hunger
michael.hun...@neotechnology.com wrote:

 What about batch insertion of the nodes and indexing them after the fact?

The data to be entered will changes values in other nodes (statistics)
so I absolutely need to be sure to not insert data twice and as I said
between a long period of time, to be more precise I've to accept
data that goes almost an year back and check if it has to be written
or has already been written.

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Lucene index commit rate and NoSuchElementException

2011-02-01 Thread Massimo Lusetti
On Tue, Feb 1, 2011 at 10:50 PM, Tobias Ivarsson
tobias.ivars...@neotechnology.com wrote:

 That is correct, the Isolation of ACID says that data isn't visible to other
 threads until after commit.

 The CHM should not replace the index check though, since you want to limit
 the number of items in the CHM, you only want this to reflect the elements
 currently being worked on, the index check should still be there for
 elements processed before.

So if I read you right you suggest to use a buffer (the CHM) just to
mitigate the effects of ACID from the Lucene/Node db. Is it right?

Cheers
-- 
Massimo
http://meridio.blogspot.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user