Hi, Jeff. If you are committing after each item, it definitely will slow down performance. Start a single transaction, commit when you're all done the entire traversal, and report back the results. You will still "see" the changes you've made prior to committing the transaction, as long as you're on the same execution thread.
Rick -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Jeff Klann Sent: Wednesday, July 28, 2010 5:43 PM To: Neo4j user discussions Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Thank you both for your responses. - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully that's a huge factor. - Tim, I like your algorithm trick! Would save a lot of reading/writing but would definitely require more memory due to the massive increase in # of edges. - Transactions are not the issue, unless reading AFTER comitting a transaction is somehow slower? I'm only committing after each of 7,000 items and like I said it took 8 hours to run through 90-some items... committing is not where the time is being spent. To gauge the performance problem, I wanted to see how many customers are purchasing each item and I'm concerned that even this query is taking a really long time. It's simple: > For each item A > Count the number of relationships to a customer > It took 15 minutes to do 200 items. That's almost 5 seconds an item just to count the number of customers who purchased an item! (Looks like on average about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR query! Considering that Neo4J advertises it can traverse 1m relationships/sec on "commodity hardware", I would expect this to be much faster. (Even if it were 50k customers per item, that'd be 7000items * 50000customers / 1m traversals = 350 seconds. 6 minutes would be much more reasonable.) My "commodity hardware" will have a lot more memory tomorrow, hopefully that'll solve these problems! Thanks, Jeff Klann p.s. My propertystore is big because I was naive on import and stored everything as string properties (this will change). How does that affect performance? On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones <[email protected]> wrote: > I can't give too much help on this unfortunately, but as far as possibility > 1) > goes, my database contains around 8 million nodes, and I traverse them in > about > 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of > RAM. I > allocate a 1GB heap to the JDK. > > Inserts take a little longer because of the approach I use - inserting 200K > nodes now takes a few minutes. I then have a separate step to remove > duplicates > that takes about 10-15 minutes. > > It seems to me that you might be better off doing something similar: > creating a > new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add > this > relationship between products in catalogues A and B. > > Then run a post-processing job that retrieves all PURCHASED_BOTH > relationships > for each product in catalogue A, and build an in-memory map so you only > keep one > of these relationships, and update the 'count' attribute in memory for that > relationship. Delete the duplicates and commit. This way to get your > desired > result in 2 passes instead of doing everything in one go. > > It seems a bit of a fiddle and I can't guarantee it'll improve performance > (just > to stress - I may be waaay off the mark here, but it works for me). I think > it > will though because it'll mean that your loop only has to create > relationships > instead of performing updates. Oh, and make sure that you aren't performing > one > operation per transaction - you could group together several tens of > thousands > before committing (I do 50,000 inserts before committing when I'm running > this > post-processing operation, and it's fine). > > Tim > > > > ----- Original Message ---- > > From: Jeff Klann <[email protected]> > > To: Neo4j user discussions <[email protected]> > > Sent: Wed, July 28, 2010 4:20:28 PM > > Subject: [Neo4j] Stumped by performance issue in traversal - would take a > month > >to run! > > > > Hi, I have an algorithm running on my little server that is very very > slow. > > It's a recommendation traversal (for all A and B in the catalog of > items: > > for each item A, how many customers also purchased another item in the > > catalog B). It's processed 90 items in about 8 hours so far! Before I > dive > > deeper into trying to figure out the performance problem, I thought I'd > > email the list to see if more experienced people have ideas. > > > > Some characteristics of my datastore: it's size is pretty moderate for a > > database application. 7500 items, not sure how many customers and > purchases > > (how can I find the size of an index?) but probably ~1 million > customers. > > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I > > don't access it much in traversals.) > > > > The possibilities I see are: > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was > using > > previously, but maybe I need to switch to a distributed map-reduce db in > the > > cloud and give up the very nice graph modeling approach? I didn't think > this > > would be a problem, because my data size is pretty moderate and Neo4J is > > supposed to be fast. > > > > 2) *I just need more RAM.* I definitely need more RAM - I have a measly > 1GB > > currently. But would this get my 20day traversal down to a few hours? > > Doesn't seem like it'd have THAT much impact. I'm running Linux and > nothing > > much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, > > about 300m memory-map. > > > > 3) *There's some secret about Neo4J performance I don't know.* Is there > > something I'm unaware that Neo4J is doing? When I access a property, > does it > > load a chunk of properties I don't care about? For the current node/edge > or > > others? I turned off log rotation and I commit after each item A. Are > there > > other performance tips I might have missed? > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and > maybe > > there's some optimizations I can do. It looks like: > > > > > For each item A in the catalog: > > > For each customer C that has purchased that item: > > > For each item B that customer purchased: > > > Update the co-occurrence edge between A&B. > > > > > (If the edge exists, add one to its weight. If it doesn't exist, > > > create it with weight one.) > > > > > This is O(n^2) worst case, but practically it'll be much better due to > the > > sparseness of purchases. The large number of customers slows it down, > > though. The slowest part, I suspect, is the last line. It's a lot of > finding > > and re-finding edges between As and Bs and updating the edge properties. > I > > don't see much way around it, though. I wrote another version that > avoids > > this but is always O(n^2), and it takes about 15 minutes per A to check > > against all B (which would also take a month). The version above seems > to be > > averaging 3 customers/sec, which doesn't seem that slow until you > realize > > that some of these items were purchased by thousands of customers. > > > > I'd hate to give up on Neo4J. I really like the graph database concept. > But > > can it handle data? I hope someone sees something I'm doing wrong. > > > > Thanks, > > Jeff Klann > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

