Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Rick Bullotta Wed, 28 Jul 2010 14:48:16 -0700

Hi, Jeff.

If you are committing after each item, it definitely will slow down
performance.  Start a single transaction, commit when you're all done the
entire traversal, and report back the results.  You will still "see" the
changes you've made prior to committing the transaction, as long as you're
on the same execution thread.


Rick

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Jeff Klann
Sent: Wednesday, July 28, 2010 5:43 PM
To: Neo4j user discussions
Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take
a month to run!

Thank you both for your responses.

- I will get some more RAM tomorrow and give Neo4J another shot. Hopefully
that's a huge factor.
- Tim, I like your algorithm trick! Would save a lot of reading/writing but
would definitely require more memory due to the massive increase in # of
edges.
- Transactions are not the issue, unless reading AFTER comitting a
transaction is somehow slower? I'm only committing after each of 7,000 items
and like I said it took 8 hours to run through 90-some items... committing
is not where the time is being spent.

To gauge the performance problem, I wanted to see how many customers are
purchasing each item and I'm concerned that even this query is taking a
really long time. It's simple:

> For each item A
>   Count the number of relationships to a customer
>
It took 15 minutes to do 200 items. That's almost 5 seconds an item just to
count the number of customers who purchased an item! (Looks like on average
about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR
query! Considering that Neo4J advertises it can traverse 1m
relationships/sec on "commodity hardware", I would expect this to be much
faster. (Even if it were 50k customers per item, that'd be 7000items *
50000customers / 1m traversals = 350 seconds. 6 minutes would be much more
reasonable.)

My "commodity hardware" will have a lot more memory tomorrow, hopefully
that'll solve these problems!

Thanks,
Jeff Klann
p.s. My propertystore is big because I was naive on import and stored
everything as string properties (this will change). How does that affect
performance?

On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones <[email protected]> wrote:

> I can't give too much help on this unfortunately, but as far as
possibility
> 1)
> goes, my database contains around 8 million nodes, and I traverse them in
> about
> 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of
> RAM. I
> allocate a 1GB heap to the JDK.
>
> Inserts take a little longer because of the approach I use - inserting
200K
> nodes now takes a few minutes. I then have a separate step to remove
> duplicates
> that takes about 10-15 minutes.
>
> It seems to me that you might be better off doing something similar:
> creating a
> new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always
add
> this
> relationship between products in catalogues A and B.
>
> Then run a post-processing job that retrieves all PURCHASED_BOTH
> relationships
> for each product in catalogue A, and build an in-memory map so you only
> keep one
> of these relationships, and update the 'count' attribute in memory for
that
> relationship. Delete the duplicates and commit. This way to get your
> desired
> result in 2 passes instead of doing everything in one go.
>
> It seems a bit of a fiddle and I can't guarantee it'll improve performance
> (just
> to stress - I may be waaay off the mark here, but it works for me). I
think
> it
> will though because it'll mean that your loop only has to create
> relationships
> instead of performing updates. Oh, and make sure that you aren't
performing
> one
> operation per transaction - you could group together several tens of
> thousands
> before committing (I do 50,000 inserts before committing when I'm running
> this
> post-processing operation, and it's fine).
>
> Tim
>
>
>
> ----- Original Message ----
> > From: Jeff Klann <[email protected]>
> > To: Neo4j user discussions <[email protected]>
> > Sent: Wed, July 28, 2010 4:20:28 PM
> > Subject: [Neo4j] Stumped by performance issue in traversal - would take
a
> month
> >to run!
> >
> > Hi, I have an algorithm running on my little server that is very very
>  slow.
> > It's a recommendation traversal (for all A and B in the catalog of
>  items:
> > for each item A, how many customers also purchased another item in  the
> > catalog B). It's processed 90 items in about 8 hours so far! Before I
>  dive
> > deeper into trying to figure out the performance problem, I thought  I'd
> > email the list to see if more experienced people have ideas.
> >
> > Some  characteristics of my datastore: it's size is pretty moderate for
a
> > database  application. 7500 items, not sure how many customers and
> purchases
> > (how can I  find the size of an index?) but probably ~1 million
> customers.
> > The  relationshipstore + nodestore < 500mb. (Propertystore is huge but I
> > don't  access it much in traversals.)
> >
> > The possibilities I see are:
> >
> > 1)  *Neo4J is just slow.* Probably not slower than Postgres which I was
>  using
> > previously, but maybe I need to switch to a distributed map-reduce db
in
> the
> > cloud and give up the very nice graph modeling approach? I didn't  think
> this
> > would be a problem, because my data size is pretty moderate and  Neo4J
is
> > supposed to be fast.
> >
> > 2) *I just need more RAM.* I definitely  need more RAM - I have a measly
> 1GB
> > currently. But would this get my 20day  traversal down to a few hours?
> > Doesn't seem like it'd have THAT much impact.  I'm running Linux and
> nothing
> > much else besides Neo4j, so I've got 650m  physical RAM. Using 300m
heap,
> > about 300m memory-map.
> >
> > 3) *There's some  secret about Neo4J performance I don't know.* Is there
> > something I'm unaware  that Neo4J is doing? When I access a property,
> does it
> > load a chunk of  properties I don't care about? For the current
node/edge
> or
> > others? I turned  off log rotation and I commit after each item A. Are
> there
> > other performance  tips I might have missed?
> >
> > 4) *My algorithm is inefficient.* It's a fairly  naive algorithm and
> maybe
> > there's some optimizations I can do. It looks  like:
> >
> > > For each item A in the catalog:
> > >   For each  customer C that has purchased that item:
> > >    For each item B  that customer purchased:
> > >       Update the co-occurrence  edge between A&B.
> > >
> >       (If the edge exists, add  one to its weight. If it doesn't exist,
> > > create it with weight  one.)
> > >
> > This is O(n^2) worst case, but practically it'll be much better  due to
> the
> > sparseness of purchases. The large number of customers slows it  down,
> > though. The slowest part, I suspect, is the last line. It's a lot of
>  finding
> > and re-finding edges between As and Bs and updating the edge
properties.
> I
> > don't see much way around it, though. I wrote another version  that
> avoids
> > this but is always O(n^2), and it takes about 15 minutes per A to  check
> > against all B (which would also take a month). The version above seems
>  to be
> > averaging 3 customers/sec, which doesn't seem that slow until you
>  realize
> > that some of these items were purchased by thousands of  customers.
> >
> > I'd hate to give up on Neo4J. I really like the graph database  concept.
> But
> > can it handle data? I hope someone sees something I'm doing  wrong.
> >
> > Thanks,
> > Jeff  Klann
> > _______________________________________________
> > Neo4j mailing  list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
>
>
>
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to