Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

David Montag Mon, 02 Aug 2010 05:44:30 -0700

Hi Jeff,

If I'm not mistaken, Neo4j loads all properties for a node or relationship
when you invoke any operation that touches a property. As for the
performance of traversals, it is highly dependent on how deep you traverse,
and what you do during the traversal, so ymmv.


Using a traverser is slower than doing getRelationships, as the traverser
does extra processing to keep state around. Are you using Node#traverse() or
the new traversal framework? Is your code available somewhere?

Are you saying that checking whether there's a relationship between A and B
takes over 20s? How many relationships do A and B have? What does your neo
config look like (params)? Edge indexing might be a solution, you can look
at the new indexing component for that. (
https://svn.neo4j.org/laboratory/components/lucene-index/)

As for the incrementing of a property - while you're within a transaction,
couldn't you increment a variable and then write that variable at the end of
the transaction?

David

On Fri, Jul 30, 2010 at 8:10 PM, Jeff Klann <[email protected]> wrote:

> Hi, so I got 2GB more RAM and noticed that after adding some more memory
> map
> and increasing the heap space, my small query went from 6hrs to 3min. Quite
> reasonable!
>
> But the larger one that would take a month would still take a month. So
> I've
> been performance testing parts of it:
>
> The algorithm as in my first post showed *no* performance improvement on
> more RAM.
> But individual parts....
>   - Traversing only (first three lines) was much speedier, but still seems
> slow. 1.5 million traversals (15 out of 7000 items) took 23sec. It shaves
> off a few seconds if I run this twice and time it the second time, or if I
> don't print any node properties as I traverse. (Does Neo4J load ALL the
> properties for a node if one is accessed?) Even with a double run and not
> reading node properties, it still takes 16sec, which would make traversal
> take two hours. I thought Neo4J was suppposed to do ~1m traversals/sec,
> this
> is doing about 100k. Why? (And in fact on the other query it was getting
> about 800,000 traversals/sec.) Is one of Traversers vs. getRelationship
> iterators faster when getting all relationships of a type at depth 1?
>   - Searching for relationships between A & B (but not writing to them)
> takes it from 20s to 91s. Yuck. Maybe edge indexing is the way to avoid
> that?
>   - Incrementing a property on the root node for every A & B takes it from
> 20s to 61s (57s if it's all in one transaction). THAT seems weird. I
> imagine
> it has something to do with logging changes? Any way that can be turned off
> for a particular property (like it could be marked 'volatile' during a
> transaction or something)?
>
> I'm much more hopeful with the extra RAM but it's still kind of slow.
> Suggestions?
>
> Thanks,
> Jeff Klann
>
> On Wed, Jul 28, 2010 at 11:20 AM, Jeff Klann <[email protected]> wrote:
>
> > Hi, I have an algorithm running on my little server that is very very
> slow.
> > It's a recommendation traversal (for all A and B in the catalog of items:
> > for each item A, how many customers also purchased another item in the
> > catalog B). It's processed 90 items in about 8 hours so far! Before I
> dive
> > deeper into trying to figure out the performance problem, I thought I'd
> > email the list to see if more experienced people have ideas.
> >
> > Some characteristics of my datastore: it's size is pretty moderate for a
> > database application. 7500 items, not sure how many customers and
> purchases
> > (how can I find the size of an index?) but probably ~1 million customers.
> > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I
> > don't access it much in traversals.)
> >
> > The possibilities I see are:
> >
> > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was
> > using previously, but maybe I need to switch to a distributed map-reduce
> db
> > in the cloud and give up the very nice graph modeling approach? I didn't
> > think this would be a problem, because my data size is pretty moderate
> and
> > Neo4J is supposed to be fast.
> >
> > 2) *I just need more RAM.* I definitely need more RAM - I have a measly
> > 1GB currently. But would this get my 20day traversal down to a few hours?
> > Doesn't seem like it'd have THAT much impact. I'm running Linux and
> nothing
> > much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap,
> > about 300m memory-map.
> >
> > 3) *There's some secret about Neo4J performance I don't know.* Is there
> > something I'm unaware that Neo4J is doing? When I access a property, does
> it
> > load a chunk of properties I don't care about? For the current node/edge
> or
> > others? I turned off log rotation and I commit after each item A. Are
> there
> > other performance tips I might have missed?
> >
> > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe
> > there's some optimizations I can do. It looks like:
> >
> >> For each item A in the catalog:
> >>   For each customer C that has purchased that item:
> >>    For each item B that customer purchased:
> >>       Update the co-occurrence edge between A&B.
> >>
> >       (If the edge exists, add one to its weight. If it doesn't exist,
> >> create it with weight one.)
> >>
> > This is O(n^2) worst case, but practically it'll be much better due to
> the
> > sparseness of purchases. The large number of customers slows it down,
> > though. The slowest part, I suspect, is the last line. It's a lot of
> finding
> > and re-finding edges between As and Bs and updating the edge properties.
> I
> > don't see much way around it, though. I wrote another version that avoids
> > this but is always O(n^2), and it takes about 15 minutes per A to check
> > against all B (which would also take a month). The version above seems to
> be
> > averaging 3 customers/sec, which doesn't seem that slow until you realize
> > that some of these items were purchased by thousands of customers.
> >
> > I'd hate to give up on Neo4J. I really like the graph database concept.
> But
> > can it handle data? I hope someone sees something I'm doing wrong.
> >
> > Thanks,
> > Jeff Klann
> >
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to