Re: [Neo4j] Using EmbeddedGraphDatabase, possible to stop node caching eating ram?

2010-10-01 Thread Mattias Persson
2010/9/30 Garrett Barton garrett.bar...@gmail.com

 Thanks for the reply!

 Root nodes are found via:

 private Node getNewNode(NeoTypes entity) {
  Node n = graphDb.createNode();
  n.createRelationshipTp(getRootEntity(entity),entity);
 }

 private Node getRootNode(NeoTypes entity) {
  Node root = rootMap.get(entity);
  if(root == null) {
if(indexService.getSingleNode(eid,entity.toString()) != null )
  root = indexService.getSingleNode(eid,entity.toString());
else {
  Node entityNode = graphDb.createNode();
  entityNode.createRelationshipTo(graphdb.getReferenceNode(),entity);
  root = entitiyNode;
  indexService.index(root, eid,entity.toString());
}
rootMap.put(entity,root);
  }
 }

 Where rootMap = new HashMapNeoType,Node();
 Thus I create the root entity once, attach it to the referenceNode and
 the look it up via the rootMap. My loads are loading one entity at a
 time, which from what you said will single thread me since every
 relationship that attaches to one of my rootNodes (the same one per
 run) locks that node until the transaction completes.


 I was creating root nodes in order to provide entrypoints, but this
 may be undesireable now that I think about it since each root entity
 could easily have 500M nodes hanging off of it. Neo probably would not
 be able to operate through a traversal of that very well correct? If I
 remove this restriction and load nodes individually I should be able
 to thread out again. Only when I do the relationships will I
 occasionally run into locks, which I can try to mitigate with more
 threads and smaller transaction sizes (10k maybe?) Is there any
 documentation on what operations will take out a lock?

It depends on what kind of traversal you are doing... could you give some
examples?


 I know its not the db (postgres) as the same code to hit this rs also
 can drive my full lucene indexing layer and I can pull well over
 100k/s per thread with it. (My lucene implementation indexes on
 average 400-500k/s with 4 threads and peaks once in a while over
 1mil/s) HUGE hack right now, but instead of just calling tx.finish() I
 am also shutting down and starting the db back up again every 150k.
 This has brought the rates (including start/stop time) up to about
 15k/s and it stays at that level now. Need to figure out why I run out
 of ram so I can avoid doing this.

(See my answer below on batch insertion as well). If neo specific
configuration is used for neo4j it will try to cache pretty much in your
heap so if your other database also caches stuff your heap will pretty soon
be full. You can try to set down the caching levels. You can fiddle around
with the Cache settings where an example
http://dist.neo4j.org/neo_default.props , look at f.ex.
adaptive_cache_heap_ratio=0.77, maybe lower it a bit.


 In general creating nodes I assume is expensive?  Can I create batch
 of Nodes, close the transaction, update all of their properties and
 then reopen a transaction to attach relationships?  What is the
 bottleneck when doing insertions?


If this is a one-time batch insertion then should really use the batch
inserter http://wiki.neo4j.org/content/Batch_Insert which is optimized for
these things. It's much faster for imports of this sort and you don't have
to (in fact, you cannot have) multiple threads inserting your data.


  Message: 1
  Date: Thu, 30 Sep 2010 09:54:31 +0200
  From: Mattias Persson matt...@neotechnology.com
  Subject: Re: [Neo4j] Using EmbeddedGraphDatabase, possible to stop
 node caching eating ram?
  To: Neo4j user discussions user@lists.neo4j.org
  Message-ID:
 
  aanlktinkcdufqrjszxmyosgotj_fak+jnp-638cf5...@mail.gmail.comaanlktinkcdufqrjszxmyosgotj_fak%2bjnp-638cf5...@mail.gmail.com
 
  Content-Type: text/plain; charset=UTF-8
 
  2010/9/29 Garrett Barton garrett.bar...@gmail.com
 
  Hey all,
 
  I have an issue similar to this post
  http://www.mail-archive.com/user@lists.neo4j.org/msg04942.html
 
  I am following the advice under the Big Transactions page on the wiki
  and my code looks something like this:
 
  public void doBigBatchJob(NeoType entity) {
 Transaction tx = null;
 try  {
 int counter = 0;
 while(rs.next()) {
 if(tx == null)
 tx = graphDb.beginTx();
 
 Node n = getNewNode(entity);
 for(String col: columnList)
 if(rs.getString(col) != null)
 n.setProperty(col,rs.getString(col));
 
 counter++;
 
 if ( counter % 1 == 0 ) {
 tx.success();
 tx.finish();
 tx = null;
 }
 }
 }
 finally {
 if(tx != null) {
 tx.success();
 tx.finish();
 }
 }
  }
 
  It looks correct to me.
 
 
  Where getNewNode creates a node and gives it a relationship to the
  parent entity. Parent nodes are cached, that helped a whole bunch.
 
  How are you looking

Re: [Neo4j] Using EmbeddedGraphDatabase, possible to stop node caching eating ram?

2010-09-30 Thread Mattias Persson
2010/9/29 Garrett Barton garrett.bar...@gmail.com

 Hey all,

 I have an issue similar to this post
 http://www.mail-archive.com/user@lists.neo4j.org/msg04942.html

 I am following the advice under the Big Transactions page on the wiki
 and my code looks something like this:

 public void doBigBatchJob(NeoType entity) {
Transaction tx = null;
try  {
int counter = 0;
while(rs.next()) {
if(tx == null)
tx = graphDb.beginTx();

Node n = getNewNode(entity);
for(String col: columnList)
if(rs.getString(col) != null)
n.setProperty(col,rs.getString(col));

counter++;

if ( counter % 1 == 0 ) {
tx.success();
tx.finish();
tx = null;
}
}
}
finally {
if(tx != null) {
tx.success();
tx.finish();
}
}
 }

 It looks correct to me.


 Where getNewNode creates a node and gives it a relationship to the
 parent entity. Parent nodes are cached, that helped a whole bunch.

How are you looking up parent nodes?


 I have timers throughout the code as well, I know I eat some time
 pulling from the db, but if i take out the node creation and to a pull
 test of the db I can sustain 100k/s rates easily.  When I start this
 process up, I get an initial 12-14k/s rate that works well for the
 first 500k or so then the drop off is huge.  By the time its done the
 next 500k its down to under 3k/s.

 What I watch with JProfiler I see the ram I gave the vm maxes out and
 stays there, as soon as that peaks rates tank.
 Current setup is:
 -Xms2048m -Xmx2048m -XX:+UseConcMarkSweepGC

 Box has about 8GB of ram free for this, its own storage for the neo
 db, and I have already watched nr_dirty and nr_writeback and they
 never get over 2k/10 respectfully.

 neo config options:
 nodestore.db.mapped_memory= 500M
 relationshipstore.db.mapped_memory= 1G
 propertystore.db.mapped_memory= 500M
 propertystore.db.strings.mapped_memory= 2G
 propertystore.db.arrays.mapped_memory= 0M

 I have not run through a complete initial node load as the first set
 of nodes is ~16M, the second set is about 20M and theres a good 30M
 relationships between the two I haven't gotten to yet.

 Am I configuring something wrong?  I read that neo will cache all the
 nodes I create, is that whats hurting me? I do not really want to use
 batchinserter because I think its bugged (lucene part) and I will be
 injesting 100's of millions of nodes live daily when this thing works
 anyways.  (Yes I have the potential to see what the upper limits of
 Neo are).


It might me the SQL database you're running from causes the slowdowns...
I've seen this before a couple of times, so try to do this in two steps:

1) Extract data from your SQL database and store in a CSV file or something.
2) Import from that file into neo4j.

If you do it this way, do you experience these slowdowns?



 Also, is neo single write transaction based? My injest code is
 actually threadable and I noticed in JProfiler that only 1 thread
 would be inserting at a time.


It might be that you always create relationships to some parent node(s) so
that locks are taken on them. That would mean that those locks are held
until that thread has committed its transaction, this will make it look like
it's only one thread at a time committing stuff.


 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user