Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Tim Jones Wed, 28 Jul 2010 09:19:07 -0700

> > I  turned  off log rotation and I commit after each item A.

Reading it again, I think this is the most likely culprit, especially if you 
only have 7500 items. Try everything in a single transaction.




----- Original Message ----
> From: Tim Jones <[email protected]>
> To: Neo4j user discussions <[email protected]>
> Sent: Wed, July 28, 2010 4:53:25 PM
> Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take a 
>month to run!
> 
> I can't give too much help on this unfortunately, but as far as possibility 
> 1) 

> goes, my database contains around 8 million nodes, and I traverse them in  
>about 
>
> 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB  of 
> RAM. 
>I 
>
> allocate a 1GB heap to the JDK.
> 
> Inserts take a little  longer because of the approach I use - inserting 200K 
> nodes now takes a few  minutes. I then have a separate step to remove 
>duplicates 
>
> that takes about  10-15 minutes.
> 
> It seems to me that you might be better off doing  something similar: 
> creating 
>a 
>
> new Relationship PURCHASED_BOTH with an  attribute 'count: 1' and always add 
>this 
>
> relationship between products in  catalogues A and B.
> 
> Then run a post-processing job that retrieves all  PURCHASED_BOTH 
> relationships 
>
> for each product in catalogue A, and build an  in-memory map so you only keep 
>one 
>
> of these relationships, and update the  'count' attribute in memory for that 
> relationship. Delete the duplicates and  commit. This way to get your desired 
> result in 2 passes instead of doing  everything in one go.
> 
> It seems a bit of a fiddle and I can't guarantee  it'll improve performance 
>(just 
>
> to stress - I may be waaay off the mark  here, but it works for me). I think 
> it 
>
> will though because it'll mean that  your loop only has to create 
> relationships 
>
> instead of performing updates.  Oh, and make sure that you aren't performing 
>one 
>
> operation per transaction -  you could group together several tens of 
> thousands 
>
> before committing (I do  50,000 inserts before committing when I'm running 
> this 
>
> post-processing  operation, and it's fine).
> 
> Tim
> 
> 
> 
> ----- Original Message  ----
> > From: Jeff Klann <[email protected]>
> > To: Neo4j user  discussions <[email protected]>
> > Sent:  Wed, July 28, 2010 4:20:28 PM
> > Subject: [Neo4j] Stumped by performance  issue in traversal - would take a 
>month 
>
> >to run!
> > 
> > Hi, I  have an algorithm running on my little server that is very very   
>slow.
> > It's a recommendation traversal (for all A and B in the catalog  of  items:
> > for each item A, how many customers also purchased  another item in  the
> > catalog B). It's processed 90 items in about 8  hours so far! Before I  dive
> > deeper into trying to figure out the  performance problem, I thought  I'd
> > email the list to see if more  experienced people have ideas.
> > 
> > Some  characteristics of my  datastore: it's size is pretty moderate for a
> > database   application. 7500 items, not sure how many customers and 
purchases
> > (how  can I  find the size of an index?) but probably ~1 million  customers.
> > The  relationshipstore + nodestore < 500mb.  (Propertystore is huge but I
> > don't  access it much in  traversals.)
> > 
> > The possibilities I see are:
> > 
> >  1)  *Neo4J is just slow.* Probably not slower than Postgres which I  was  
>using
> > previously, but maybe I need to switch to a distributed  map-reduce db  in 
>the
> > cloud and give up the very nice graph  modeling approach? I didn't  think 
>this
> > would be a problem, because  my data size is pretty moderate and  Neo4J is
> > supposed to be  fast.
> > 
> > 2) *I just need more RAM.* I definitely  need more  RAM - I have a measly 
1GB
> > currently. But would this get my 20day   traversal down to a few hours?
> > Doesn't seem like it'd have THAT much  impact.  I'm running Linux and 
nothing
> > much else besides Neo4j, so  I've got 650m  physical RAM. Using 300m heap,
> > about 300m  memory-map.
> > 
> > 3) *There's some  secret about Neo4J  performance I don't know.* Is there
> > something I'm unaware  that  Neo4J is doing? When I access a property, does 
>it
> > load a chunk of   properties I don't care about? For the current node/edge 
or
> > others? I  turned  off log rotation and I commit after each item A. Are 
there
> >  other performance  tips I might have missed?
> > 
> > 4) *My  algorithm is inefficient.* It's a fairly  naive algorithm and maybe
> >  there's some optimizations I can do. It looks  like:
> > 
> > >  For each item A in the catalog:
> > >   For each  customer C  that has purchased that item:
> > >    For each item B   that customer purchased:
> > >       Update the  co-occurrence  edge between A&B.
> > >
> >        (If the edge exists, add  one to its weight. If it doesn't  exist,
> > > create it with weight  one.)
> > >
> > This  is O(n^2) worst case, but practically it'll be much better  due to  
the
> > sparseness of purchases. The large number of customers slows  it  down,
> > though. The slowest part, I suspect, is the last line.  It's a lot of  
>finding
> > and re-finding edges between As and Bs and  updating the edge  properties. I
> > don't see much way around it,  though. I wrote another version  that avoids
> > this but is always  O(n^2), and it takes about 15 minutes per A to  check
> > against all B  (which would also take a month). The version above seems  to 
>be
> >  averaging 3 customers/sec, which doesn't seem that slow until you   realize
> > that some of these items were purchased by thousands of   customers.
> > 
> > I'd hate to give up on Neo4J. I really like the  graph database  concept. 
But
> > can it handle data? I hope someone  sees something I'm doing  wrong.
> > 
> > Thanks,
> >  Jeff  Klann
> > _______________________________________________
> >  Neo4j mailing  list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> > 
> 
> 
>        
> 
> _______________________________________________
> Neo4j  mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
> 


      

_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to