Thank you both for your responses. - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully that's a huge factor. - Tim, I like your algorithm trick! Would save a lot of reading/writing but would definitely require more memory due to the massive increase in # of edges. - Transactions are not the issue, unless reading AFTER comitting a transaction is somehow slower? I'm only committing after each of 7,000 items and like I said it took 8 hours to run through 90-some items... committing is not where the time is being spent.
To gauge the performance problem, I wanted to see how many customers are purchasing each item and I'm concerned that even this query is taking a really long time. It's simple: > For each item A > Count the number of relationships to a customer > It took 15 minutes to do 200 items. That's almost 5 seconds an item just to count the number of customers who purchased an item! (Looks like on average about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR query! Considering that Neo4J advertises it can traverse 1m relationships/sec on "commodity hardware", I would expect this to be much faster. (Even if it were 50k customers per item, that'd be 7000items * 50000customers / 1m traversals = 350 seconds. 6 minutes would be much more reasonable.) My "commodity hardware" will have a lot more memory tomorrow, hopefully that'll solve these problems! Thanks, Jeff Klann p.s. My propertystore is big because I was naive on import and stored everything as string properties (this will change). How does that affect performance? On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones <[email protected]> wrote: > I can't give too much help on this unfortunately, but as far as possibility > 1) > goes, my database contains around 8 million nodes, and I traverse them in > about > 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of > RAM. I > allocate a 1GB heap to the JDK. > > Inserts take a little longer because of the approach I use - inserting 200K > nodes now takes a few minutes. I then have a separate step to remove > duplicates > that takes about 10-15 minutes. > > It seems to me that you might be better off doing something similar: > creating a > new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add > this > relationship between products in catalogues A and B. > > Then run a post-processing job that retrieves all PURCHASED_BOTH > relationships > for each product in catalogue A, and build an in-memory map so you only > keep one > of these relationships, and update the 'count' attribute in memory for that > relationship. Delete the duplicates and commit. This way to get your > desired > result in 2 passes instead of doing everything in one go. > > It seems a bit of a fiddle and I can't guarantee it'll improve performance > (just > to stress - I may be waaay off the mark here, but it works for me). I think > it > will though because it'll mean that your loop only has to create > relationships > instead of performing updates. Oh, and make sure that you aren't performing > one > operation per transaction - you could group together several tens of > thousands > before committing (I do 50,000 inserts before committing when I'm running > this > post-processing operation, and it's fine). > > Tim > > > > ----- Original Message ---- > > From: Jeff Klann <[email protected]> > > To: Neo4j user discussions <[email protected]> > > Sent: Wed, July 28, 2010 4:20:28 PM > > Subject: [Neo4j] Stumped by performance issue in traversal - would take a > month > >to run! > > > > Hi, I have an algorithm running on my little server that is very very > slow. > > It's a recommendation traversal (for all A and B in the catalog of > items: > > for each item A, how many customers also purchased another item in the > > catalog B). It's processed 90 items in about 8 hours so far! Before I > dive > > deeper into trying to figure out the performance problem, I thought I'd > > email the list to see if more experienced people have ideas. > > > > Some characteristics of my datastore: it's size is pretty moderate for a > > database application. 7500 items, not sure how many customers and > purchases > > (how can I find the size of an index?) but probably ~1 million > customers. > > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I > > don't access it much in traversals.) > > > > The possibilities I see are: > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was > using > > previously, but maybe I need to switch to a distributed map-reduce db in > the > > cloud and give up the very nice graph modeling approach? I didn't think > this > > would be a problem, because my data size is pretty moderate and Neo4J is > > supposed to be fast. > > > > 2) *I just need more RAM.* I definitely need more RAM - I have a measly > 1GB > > currently. But would this get my 20day traversal down to a few hours? > > Doesn't seem like it'd have THAT much impact. I'm running Linux and > nothing > > much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, > > about 300m memory-map. > > > > 3) *There's some secret about Neo4J performance I don't know.* Is there > > something I'm unaware that Neo4J is doing? When I access a property, > does it > > load a chunk of properties I don't care about? For the current node/edge > or > > others? I turned off log rotation and I commit after each item A. Are > there > > other performance tips I might have missed? > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and > maybe > > there's some optimizations I can do. It looks like: > > > > > For each item A in the catalog: > > > For each customer C that has purchased that item: > > > For each item B that customer purchased: > > > Update the co-occurrence edge between A&B. > > > > > (If the edge exists, add one to its weight. If it doesn't exist, > > > create it with weight one.) > > > > > This is O(n^2) worst case, but practically it'll be much better due to > the > > sparseness of purchases. The large number of customers slows it down, > > though. The slowest part, I suspect, is the last line. It's a lot of > finding > > and re-finding edges between As and Bs and updating the edge properties. > I > > don't see much way around it, though. I wrote another version that > avoids > > this but is always O(n^2), and it takes about 15 minutes per A to check > > against all B (which would also take a month). The version above seems > to be > > averaging 3 customers/sec, which doesn't seem that slow until you > realize > > that some of these items were purchased by thousands of customers. > > > > I'd hate to give up on Neo4J. I really like the graph database concept. > But > > can it handle data? I hope someone sees something I'm doing wrong. > > > > Thanks, > > Jeff Klann > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

