Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Jeff Klann Wed, 28 Jul 2010 14:43:00 -0700

Thank you both for your responses.

- I will get some more RAM tomorrow and give Neo4J another shot. Hopefully
that's a huge factor.
- Tim, I like your algorithm trick! Would save a lot of reading/writing but
would definitely require more memory due to the massive increase in # of
edges.
- Transactions are not the issue, unless reading AFTER comitting a
transaction is somehow slower? I'm only committing after each of 7,000 items
and like I said it took 8 hours to run through 90-some items... committing
is not where the time is being spent.


To gauge the performance problem, I wanted to see how many customers are
purchasing each item and I'm concerned that even this query is taking a
really long time. It's simple:

> For each item A
>   Count the number of relationships to a customer
>
It took 15 minutes to do 200 items. That's almost 5 seconds an item just to
count the number of customers who purchased an item! (Looks like on average
about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR
query! Considering that Neo4J advertises it can traverse 1m
relationships/sec on "commodity hardware", I would expect this to be much
faster. (Even if it were 50k customers per item, that'd be 7000items *
50000customers / 1m traversals = 350 seconds. 6 minutes would be much more
reasonable.)

My "commodity hardware" will have a lot more memory tomorrow, hopefully
that'll solve these problems!

Thanks,
Jeff Klann
p.s. My propertystore is big because I was naive on import and stored
everything as string properties (this will change). How does that affect
performance?

On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones <[email protected]> wrote:

> I can't give too much help on this unfortunately, but as far as possibility
> 1)
> goes, my database contains around 8 million nodes, and I traverse them in
> about
> 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of
> RAM. I
> allocate a 1GB heap to the JDK.
>
> Inserts take a little longer because of the approach I use - inserting 200K
> nodes now takes a few minutes. I then have a separate step to remove
> duplicates
> that takes about 10-15 minutes.
>
> It seems to me that you might be better off doing something similar:
> creating a
> new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add
> this
> relationship between products in catalogues A and B.
>
> Then run a post-processing job that retrieves all PURCHASED_BOTH
> relationships
> for each product in catalogue A, and build an in-memory map so you only
> keep one
> of these relationships, and update the 'count' attribute in memory for that
> relationship. Delete the duplicates and commit. This way to get your
> desired
> result in 2 passes instead of doing everything in one go.
>
> It seems a bit of a fiddle and I can't guarantee it'll improve performance
> (just
> to stress - I may be waaay off the mark here, but it works for me). I think
> it
> will though because it'll mean that your loop only has to create
> relationships
> instead of performing updates. Oh, and make sure that you aren't performing
> one
> operation per transaction - you could group together several tens of
> thousands
> before committing (I do 50,000 inserts before committing when I'm running
> this
> post-processing operation, and it's fine).
>
> Tim
>
>
>
> ----- Original Message ----
> > From: Jeff Klann <[email protected]>
> > To: Neo4j user discussions <[email protected]>
> > Sent: Wed, July 28, 2010 4:20:28 PM
> > Subject: [Neo4j] Stumped by performance issue in traversal - would take a
> month
> >to run!
> >
> > Hi, I have an algorithm running on my little server that is very very
>  slow.
> > It's a recommendation traversal (for all A and B in the catalog of
>  items:
> > for each item A, how many customers also purchased another item in  the
> > catalog B). It's processed 90 items in about 8 hours so far! Before I
>  dive
> > deeper into trying to figure out the performance problem, I thought  I'd
> > email the list to see if more experienced people have ideas.
> >
> > Some  characteristics of my datastore: it's size is pretty moderate for a
> > database  application. 7500 items, not sure how many customers and
> purchases
> > (how can I  find the size of an index?) but probably ~1 million
> customers.
> > The  relationshipstore + nodestore < 500mb. (Propertystore is huge but I
> > don't  access it much in traversals.)
> >
> > The possibilities I see are:
> >
> > 1)  *Neo4J is just slow.* Probably not slower than Postgres which I was
>  using
> > previously, but maybe I need to switch to a distributed map-reduce db  in
> the
> > cloud and give up the very nice graph modeling approach? I didn't  think
> this
> > would be a problem, because my data size is pretty moderate and  Neo4J is
> > supposed to be fast.
> >
> > 2) *I just need more RAM.* I definitely  need more RAM - I have a measly
> 1GB
> > currently. But would this get my 20day  traversal down to a few hours?
> > Doesn't seem like it'd have THAT much impact.  I'm running Linux and
> nothing
> > much else besides Neo4j, so I've got 650m  physical RAM. Using 300m heap,
> > about 300m memory-map.
> >
> > 3) *There's some  secret about Neo4J performance I don't know.* Is there
> > something I'm unaware  that Neo4J is doing? When I access a property,
> does it
> > load a chunk of  properties I don't care about? For the current node/edge
> or
> > others? I turned  off log rotation and I commit after each item A. Are
> there
> > other performance  tips I might have missed?
> >
> > 4) *My algorithm is inefficient.* It's a fairly  naive algorithm and
> maybe
> > there's some optimizations I can do. It looks  like:
> >
> > > For each item A in the catalog:
> > >   For each  customer C that has purchased that item:
> > >    For each item B  that customer purchased:
> > >       Update the co-occurrence  edge between A&B.
> > >
> >       (If the edge exists, add  one to its weight. If it doesn't exist,
> > > create it with weight  one.)
> > >
> > This is O(n^2) worst case, but practically it'll be much better  due to
> the
> > sparseness of purchases. The large number of customers slows it  down,
> > though. The slowest part, I suspect, is the last line. It's a lot of
>  finding
> > and re-finding edges between As and Bs and updating the edge  properties.
> I
> > don't see much way around it, though. I wrote another version  that
> avoids
> > this but is always O(n^2), and it takes about 15 minutes per A to  check
> > against all B (which would also take a month). The version above seems
>  to be
> > averaging 3 customers/sec, which doesn't seem that slow until you
>  realize
> > that some of these items were purchased by thousands of  customers.
> >
> > I'd hate to give up on Neo4J. I really like the graph database  concept.
> But
> > can it handle data? I hope someone sees something I'm doing  wrong.
> >
> > Thanks,
> > Jeff  Klann
> > _______________________________________________
> > Neo4j mailing  list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
>
>
>
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to