> > I turned off log rotation and I commit after each item A. Reading it again, I think this is the most likely culprit, especially if you only have 7500 items. Try everything in a single transaction.
----- Original Message ---- > From: Tim Jones <[email protected]> > To: Neo4j user discussions <[email protected]> > Sent: Wed, July 28, 2010 4:53:25 PM > Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take a >month to run! > > I can't give too much help on this unfortunately, but as far as possibility > 1) > goes, my database contains around 8 million nodes, and I traverse them in >about > > 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of > RAM. >I > > allocate a 1GB heap to the JDK. > > Inserts take a little longer because of the approach I use - inserting 200K > nodes now takes a few minutes. I then have a separate step to remove >duplicates > > that takes about 10-15 minutes. > > It seems to me that you might be better off doing something similar: > creating >a > > new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add >this > > relationship between products in catalogues A and B. > > Then run a post-processing job that retrieves all PURCHASED_BOTH > relationships > > for each product in catalogue A, and build an in-memory map so you only keep >one > > of these relationships, and update the 'count' attribute in memory for that > relationship. Delete the duplicates and commit. This way to get your desired > result in 2 passes instead of doing everything in one go. > > It seems a bit of a fiddle and I can't guarantee it'll improve performance >(just > > to stress - I may be waaay off the mark here, but it works for me). I think > it > > will though because it'll mean that your loop only has to create > relationships > > instead of performing updates. Oh, and make sure that you aren't performing >one > > operation per transaction - you could group together several tens of > thousands > > before committing (I do 50,000 inserts before committing when I'm running > this > > post-processing operation, and it's fine). > > Tim > > > > ----- Original Message ---- > > From: Jeff Klann <[email protected]> > > To: Neo4j user discussions <[email protected]> > > Sent: Wed, July 28, 2010 4:20:28 PM > > Subject: [Neo4j] Stumped by performance issue in traversal - would take a >month > > >to run! > > > > Hi, I have an algorithm running on my little server that is very very >slow. > > It's a recommendation traversal (for all A and B in the catalog of items: > > for each item A, how many customers also purchased another item in the > > catalog B). It's processed 90 items in about 8 hours so far! Before I dive > > deeper into trying to figure out the performance problem, I thought I'd > > email the list to see if more experienced people have ideas. > > > > Some characteristics of my datastore: it's size is pretty moderate for a > > database application. 7500 items, not sure how many customers and purchases > > (how can I find the size of an index?) but probably ~1 million customers. > > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I > > don't access it much in traversals.) > > > > The possibilities I see are: > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was >using > > previously, but maybe I need to switch to a distributed map-reduce db in >the > > cloud and give up the very nice graph modeling approach? I didn't think >this > > would be a problem, because my data size is pretty moderate and Neo4J is > > supposed to be fast. > > > > 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB > > currently. But would this get my 20day traversal down to a few hours? > > Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing > > much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, > > about 300m memory-map. > > > > 3) *There's some secret about Neo4J performance I don't know.* Is there > > something I'm unaware that Neo4J is doing? When I access a property, does >it > > load a chunk of properties I don't care about? For the current node/edge or > > others? I turned off log rotation and I commit after each item A. Are there > > other performance tips I might have missed? > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe > > there's some optimizations I can do. It looks like: > > > > > For each item A in the catalog: > > > For each customer C that has purchased that item: > > > For each item B that customer purchased: > > > Update the co-occurrence edge between A&B. > > > > > (If the edge exists, add one to its weight. If it doesn't exist, > > > create it with weight one.) > > > > > This is O(n^2) worst case, but practically it'll be much better due to the > > sparseness of purchases. The large number of customers slows it down, > > though. The slowest part, I suspect, is the last line. It's a lot of >finding > > and re-finding edges between As and Bs and updating the edge properties. I > > don't see much way around it, though. I wrote another version that avoids > > this but is always O(n^2), and it takes about 15 minutes per A to check > > against all B (which would also take a month). The version above seems to >be > > averaging 3 customers/sec, which doesn't seem that slow until you realize > > that some of these items were purchased by thousands of customers. > > > > I'd hate to give up on Neo4J. I really like the graph database concept. But > > can it handle data? I hope someone sees something I'm doing wrong. > > > > Thanks, > > Jeff Klann > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

