Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Peter Neubauer Fri, 06 Aug 2010 23:04:31 -0700

Good Jeff,
let the community know how that goes!

Cheers,


/peter neubauer

COO and Sales, Neo Technology

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.



On Sat, Aug 7, 2010 at 6:40 AM, Jeff Klann <[email protected]> wrote:
> To anyone interested, after reading Martin's e-mail I ran a test of a small
> (75mb) Neo4J graph and traversed it (~500k edges) with memory mapping turned
> off in about 1.5 minutes on my hard drive and 30sec when copied to a
> (probably slow) flash drive. Which incidentally is the same time it took
> with memory mapping turned on with the HDD. Very nice. I think I will pick
> up a larger flash drive (or SSD if I can find one inexpensively) and try it
> out on a more sizable graph.
>
> Thanks,
> Jeff Klann
>
> On Thu, Aug 5, 2010 at 7:13 PM, Martin Neumann 
> <[email protected]>wrote:
>
>> > >
>> > > - Martin, I'm confused a bit about SSDs. I read up on them after I read
>> > > your
>> > > post. You said flash drives are best, but I read that even the highest
>> > > performing flash drives are about 30MB/s read, whereas modern hard
>> drives
>> > > are at least 50MB/s. True SSDs claim to be 50MB/s too but they're quite
>> > > expensive. So why is a flash drive best? I could definitely spring for
>> > one
>> > > big enough to hold my db if it'd help a lot, but it has that slower
>> read
>> > > speed. Does the faster seek time really make that much of a difference?
>> > Any
>> > > brands you'd recommend?
>> > >
>>
>> Neo4j stores the data as Graph on HD.
>>
>> An example: e = (n1,n2)
>> e at location 1000
>> n1 at location 1
>> n2 at location 5
>>
>> A traversal, assuming nothing is cached, would result in moving the head to
>> 1 then to 1000 then back to 5.
>> Normal HD take a while to move to the locations before it can start to read
>> data. SSD does not have these delays. If you read little data that is
>> spread
>> widely over the storage, like in a traversal, SSD are much faster then HD
>> even if they are slower to retrieve the data.
>> I don't have performance data on that myself but I heard rumors of around
>> 20-40 times speedup.
>>
>> cheers Martin
>>
>>
>> On Thu, Aug 5, 2010 at 9:02 PM, Jeff Klann <[email protected]> wrote:
>>
>> > Thanks for the answers.
>> >
>> > Yes, I can do online updates of the datastore, but while this is in R&D I
>> > will need to rerun the main loop when I change the algorithm and just for
>> > personal benefit I don't want to wait hours to see the changes. Seems to
>> be
>> > running acceptably now, though. However, I haven't benchmarked it against
>> > doing JOINS in Postgres. Are there any good performance stats out there?
>> > The
>> > speed is about the same as I'd expect from SQL.
>> >
>> > The graph will probably be nearly a complete graph in the end. The edges
>> > between orders will eventually store various stats on the relationships
>> > between pairs of items. It'd be nice if I can query an index for outgoing
>> > edges from nodes with certain properties. Is this possible? I'll have a
>> > look
>> > at the edge indexer component.
>> >
>> > Thanks,
>> > Jeff Klann
>> >
>> > On Mon, Aug 2, 2010 at 2:40 PM, David Montag <
>> > [email protected]
>> > > wrote:
>> >
>> > > Hi Jeff,
>> > >
>> > > Please see answers below.
>> > >
>> > > On Mon, Aug 2, 2010 at 5:47 PM, Jeff Klann <[email protected]> wrote:
>> > >
>> > > > Thank you all for your continued interest in helping me. I tweaked
>> the
>> > > code
>> > > > more to minimize writes to the database and it now looks like:
>> > > > For each item A
>> > > >   For each customer that purchased A
>> > > >      For each item B (with id>A) that A purchased
>> > > >         Increment (in memory) the weight of (A-B)
>> > > >   Write out the edges [(A-B):weight] to disk and clear the in-memory
>> > map
>> > > >
>> > > > This actually (if I'm not mistaken) covers all relationships and does
>> > > 7500
>> > > > items in about 45 minutes! Not too bad, especially due to (I think)
>> > > > avoiding
>> > > > edge-checking, and I think it's usable for my application, though
>> it's
>> > > > still
>> > > > ~200k traversals/sec on average, which is a few times slower than I
>> > > hoped.
>> > > > I
>> > > > doubt that's much faster than a two-table join in SQL, though deeper
>> > > > traversals should show benefits.
>> > > >
>> > >
>> > > Do you need to do this computation on the full graph all the time?
>> Maybe
>> > it
>> > > would be enough to do it once, and then update it when a customer buys
>> > > something? Usually, high one-time costs can be tolerated, and with
>> Neo4j
>> > > you
>> > > can actually do the updating for a customer performing a purchase at
>> > > runtime
>> > > without performance problems.
>> > >
>> > >
>> > > >
>> > > > - David, thank you for your answers on traversers vs.
>> getRelationships
>> > > and
>> > > > on property-loading. I imported some properties I don't really need,
>> > > > perhaps
>> > > > if I delete them it'll speed things up? Also I'm using the old
>> > > > Node.traverse(). How is the new framework better? I expect it has a
>> > nicer
>> > > > syntax, which I would like to try, but does it improve performance
>> too?
>> > > >
>> > >
>> > > Well, depending on your setup you should be able to theoretically
>> improve
>> > > performance compared to the old traversal framework. The old framework
>> > > keeps
>> > > track of visited nodes, so that you don't traverse to the same node
>> > twice.
>> > > This behavior is customizable in the new framework. Please see
>> > > http://wiki.neo4j.org/content/Traversal_Framework and check the
>> > Uniqueness
>> > > constraints. If you know exactly when to stop, then you should be able
>> to
>> > > use Uniqueness.NONE, meaning that the framework does not keep track of
>> > > visited nodes, meaning that you could end up traversing in a cycle. In
>> > your
>> > > network however, you might know that you always traverse (item)
>> > <--BOUGHT--
>> > > (customer) --BOUGHT--> (item) --CORRELATION--> (item)*  and no further
>> > than
>> > > that, so then you know that you won't end up in a cycle. But yeah, then
>> > you
>> > > need to programmatically make sure you don't go too far. And I don't
>> know
>> > > if
>> > > this gives you any performance benefits what so ever.
>> > >
>> > > Also, as I understand it, all properties for a node are loaded when
>> they
>> > > are
>> > > first touched. Then they're kept in memory, so if you update properties
>> > > later on the same node, and it is still cached, it won't reread
>> > everything.
>> > >
>> > >
>> > > >
>> > > > - David, on checking relationships, I said checking 15 nodes for
>> > > > relationships to n other nodes (where n might be large, I'm not sure
>> > > large,
>> > > > but <<7500), takes 71s. The nodes are a highly-connected graph and
>> also
>> > > > with
>> > > > edges going out to customers. So in the end the max & edges for a
>> node
>> > > > would
>> > > > be very high, up to around 7500 items and 300,000 customers.
>> > > >
>> > >
>> > > Just so I understand your data model: if a customer buys N products A1
>> -
>> > > AN,
>> > > will there be be a complete graph between the nodes A1 - AN? When in
>> your
>> > > algorithm do you need to check for the occurrence of a relationship
>> > between
>> > > A and B?
>> > >
>> > >
>> > > >
>> > > > - Martin, I'm confused a bit about SSDs. I read up on them after I
>> read
>> > > > your
>> > > > post. You said flash drives are best, but I read that even the
>> highest
>> > > > performing flash drives are about 30MB/s read, whereas modern hard
>> > drives
>> > > > are at least 50MB/s. True SSDs claim to be 50MB/s too but they're
>> quite
>> > > > expensive. So why is a flash drive best? I could definitely spring
>> for
>> > > one
>> > > > big enough to hold my db if it'd help a lot, but it has that slower
>> > read
>> > > > speed. Does the faster seek time really make that much of a
>> difference?
>> > > Any
>> > > > brands you'd recommend?
>> > > >
>> > >
>> > > I think the general consensus is that an SSD is usually the single best
>> > > upgrade you can get for a computer or server. The blazingly fast seeks
>> > make
>> > > all the difference. If you have a big file with data spread out over it
>> > and
>> > > you need to read and write to different locations of the file rapidly,
>> > that
>> > > means a lot of work for the heads in a conventional hard drive. The SSD
>> > > nails this. Know when you start an application or do something
>> processing
>> > > heavy, and you hear your hard drive "work"? It's seeking.
>> > >
>> > > As for brands, I've heard good things about the Intel X25 ones. I have
>> an
>> > > SSD in my mac, but I don't know what brand it is. All I know is that
>> it's
>> > > ridiculously fast.
>> > >
>> > > David
>> > >
>> > >
>> > > >
>> > > > I will post some code snippets. Looks like there are a lot of sites
>> for
>> > > > sharing codes snippets. Any recommendation?
>> > > >
>> > > > Thanks all,
>> > > > Jeff Klann
>> > > >
>> > > > On Mon, Aug 2, 2010 at 8:44 AM, David Montag <
>> > > > [email protected]
>> > > > > wrote:
>> > > >
>> > > > > Hi Jeff,
>> > > > >
>> > > > > If I'm not mistaken, Neo4j loads all properties for a node or
>> > > > relationship
>> > > > > when you invoke any operation that touches a property. As for the
>> > > > > performance of traversals, it is highly dependent on how deep you
>> > > > traverse,
>> > > > > and what you do during the traversal, so ymmv.
>> > > > >
>> > > > > Using a traverser is slower than doing getRelationships, as the
>> > > traverser
>> > > > > does extra processing to keep state around. Are you using
>> > > Node#traverse()
>> > > > > or
>> > > > > the new traversal framework? Is your code available somewhere?
>> > > > >
>> > > > > Are you saying that checking whether there's a relationship between
>> A
>> > > and
>> > > > B
>> > > > > takes over 20s? How many relationships do A and B have? What does
>> > your
>> > > > neo
>> > > > > config look like (params)? Edge indexing might be a solution, you
>> can
>> > > > look
>> > > > > at the new indexing component for that. (
>> > > > > https://svn.neo4j.org/laboratory/components/lucene-index/)
>> > > > >
>> > > > > As for the incrementing of a property - while you're within a
>> > > > transaction,
>> > > > > couldn't you increment a variable and then write that variable at
>> the
>> > > end
>> > > > > of
>> > > > > the transaction?
>> > > > >
>> > > > > David
>> > > > >
>> > > > > On Fri, Jul 30, 2010 at 8:10 PM, Jeff Klann <[email protected]>
>> > wrote:
>> > > > >
>> > > > > > Hi, so I got 2GB more RAM and noticed that after adding some more
>> > > > memory
>> > > > > > map
>> > > > > > and increasing the heap space, my small query went from 6hrs to
>> > 3min.
>> > > > > Quite
>> > > > > > reasonable!
>> > > > > >
>> > > > > > But the larger one that would take a month would still take a
>> > month.
>> > > So
>> > > > > > I've
>> > > > > > been performance testing parts of it:
>> > > > > >
>> > > > > > The algorithm as in my first post showed *no* performance
>> > improvement
>> > > > on
>> > > > > > more RAM.
>> > > > > > But individual parts....
>> > > > > >   - Traversing only (first three lines) was much speedier, but
>> > still
>> > > > > seems
>> > > > > > slow. 1.5 million traversals (15 out of 7000 items) took 23sec.
>> It
>> > > > shaves
>> > > > > > off a few seconds if I run this twice and time it the second
>> time,
>> > or
>> > > > if
>> > > > > I
>> > > > > > don't print any node properties as I traverse. (Does Neo4J load
>> ALL
>> > > the
>> > > > > > properties for a node if one is accessed?) Even with a double run
>> > and
>> > > > not
>> > > > > > reading node properties, it still takes 16sec, which would make
>> > > > traversal
>> > > > > > take two hours. I thought Neo4J was suppposed to do ~1m
>> > > traversals/sec,
>> > > > > > this
>> > > > > > is doing about 100k. Why? (And in fact on the other query it was
>> > > > getting
>> > > > > > about 800,000 traversals/sec.) Is one of Traversers vs.
>> > > getRelationship
>> > > > > > iterators faster when getting all relationships of a type at
>> depth
>> > 1?
>> > > > > >   - Searching for relationships between A & B (but not writing to
>> > > them)
>> > > > > > takes it from 20s to 91s. Yuck. Maybe edge indexing is the way to
>> > > avoid
>> > > > > > that?
>> > > > > >   - Incrementing a property on the root node for every A & B
>> takes
>> > it
>> > > > > from
>> > > > > > 20s to 61s (57s if it's all in one transaction). THAT seems
>> weird.
>> > I
>> > > > > > imagine
>> > > > > > it has something to do with logging changes? Any way that can be
>> > > turned
>> > > > > off
>> > > > > > for a particular property (like it could be marked 'volatile'
>> > during
>> > > a
>> > > > > > transaction or something)?
>> > > > > >
>> > > > > > I'm much more hopeful with the extra RAM but it's still kind of
>> > slow.
>> > > > > > Suggestions?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Jeff Klann
>> > > > > >
>> > > > > > On Wed, Jul 28, 2010 at 11:20 AM, Jeff Klann <[email protected]>
>> > > wrote:
>> > > > > >
>> > > > > > > Hi, I have an algorithm running on my little server that is
>> very
>> > > very
>> > > > > > slow.
>> > > > > > > It's a recommendation traversal (for all A and B in the catalog
>> > of
>> > > > > items:
>> > > > > > > for each item A, how many customers also purchased another item
>> > in
>> > > > the
>> > > > > > > catalog B). It's processed 90 items in about 8 hours so far!
>> > Before
>> > > I
>> > > > > > dive
>> > > > > > > deeper into trying to figure out the performance problem, I
>> > thought
>> > > > I'd
>> > > > > > > email the list to see if more experienced people have ideas.
>> > > > > > >
>> > > > > > > Some characteristics of my datastore: it's size is pretty
>> > moderate
>> > > > for
>> > > > > a
>> > > > > > > database application. 7500 items, not sure how many customers
>> and
>> > > > > > purchases
>> > > > > > > (how can I find the size of an index?) but probably ~1 million
>> > > > > customers.
>> > > > > > > The relationshipstore + nodestore < 500mb. (Propertystore is
>> huge
>> > > but
>> > > > I
>> > > > > > > don't access it much in traversals.)
>> > > > > > >
>> > > > > > > The possibilities I see are:
>> > > > > > >
>> > > > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres
>> which
>> > I
>> > > > was
>> > > > > > > using previously, but maybe I need to switch to a distributed
>> > > > > map-reduce
>> > > > > > db
>> > > > > > > in the cloud and give up the very nice graph modeling approach?
>> I
>> > > > > didn't
>> > > > > > > think this would be a problem, because my data size is pretty
>> > > > moderate
>> > > > > > and
>> > > > > > > Neo4J is supposed to be fast.
>> > > > > > >
>> > > > > > > 2) *I just need more RAM.* I definitely need more RAM - I have
>> a
>> > > > measly
>> > > > > > > 1GB currently. But would this get my 20day traversal down to a
>> > few
>> > > > > hours?
>> > > > > > > Doesn't seem like it'd have THAT much impact. I'm running Linux
>> > and
>> > > > > > nothing
>> > > > > > > much else besides Neo4j, so I've got 650m physical RAM. Using
>> > 300m
>> > > > > heap,
>> > > > > > > about 300m memory-map.
>> > > > > > >
>> > > > > > > 3) *There's some secret about Neo4J performance I don't know.*
>> Is
>> > > > there
>> > > > > > > something I'm unaware that Neo4J is doing? When I access a
>> > > property,
>> > > > > does
>> > > > > > it
>> > > > > > > load a chunk of properties I don't care about? For the current
>> > > > > node/edge
>> > > > > > or
>> > > > > > > others? I turned off log rotation and I commit after each item
>> A.
>> > > Are
>> > > > > > there
>> > > > > > > other performance tips I might have missed?
>> > > > > > >
>> > > > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm
>> > and
>> > > > > maybe
>> > > > > > > there's some optimizations I can do. It looks like:
>> > > > > > >
>> > > > > > >> For each item A in the catalog:
>> > > > > > >>   For each customer C that has purchased that item:
>> > > > > > >>    For each item B that customer purchased:
>> > > > > > >>       Update the co-occurrence edge between A&B.
>> > > > > > >>
>> > > > > > >       (If the edge exists, add one to its weight. If it doesn't
>> > > > exist,
>> > > > > > >> create it with weight one.)
>> > > > > > >>
>> > > > > > > This is O(n^2) worst case, but practically it'll be much better
>> > due
>> > > > to
>> > > > > > the
>> > > > > > > sparseness of purchases. The large number of customers slows it
>> > > down,
>> > > > > > > though. The slowest part, I suspect, is the last line. It's a
>> lot
>> > > of
>> > > > > > finding
>> > > > > > > and re-finding edges between As and Bs and updating the edge
>> > > > > properties.
>> > > > > > I
>> > > > > > > don't see much way around it, though. I wrote another version
>> > that
>> > > > > avoids
>> > > > > > > this but is always O(n^2), and it takes about 15 minutes per A
>> to
>> > > > check
>> > > > > > > against all B (which would also take a month). The version
>> above
>> > > > seems
>> > > > > to
>> > > > > > be
>> > > > > > > averaging 3 customers/sec, which doesn't seem that slow until
>> you
>> > > > > realize
>> > > > > > > that some of these items were purchased by thousands of
>> > customers.
>> > > > > > >
>> > > > > > > I'd hate to give up on Neo4J. I really like the graph database
>> > > > concept.
>> > > > > > But
>> > > > > > > can it handle data? I hope someone sees something I'm doing
>> > wrong.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > Jeff Klann
>> > > > > > >
>> > > > > > _______________________________________________
>> > > > > > Neo4j mailing list
>> > > > > > [email protected]
>> > > > > > https://lists.neo4j.org/mailman/listinfo/user
>> > > > > >
>> > > > > _______________________________________________
>> > > > > Neo4j mailing list
>> > > > > [email protected]
>> > > > > https://lists.neo4j.org/mailman/listinfo/user
>> > > > >
>> > > > _______________________________________________
>> > > > Neo4j mailing list
>> > > > [email protected]
>> > > > https://lists.neo4j.org/mailman/listinfo/user
>> > > >
>> > > _______________________________________________
>> > > Neo4j mailing list
>> > > [email protected]
>> > > https://lists.neo4j.org/mailman/listinfo/user
>> > >
>> > _______________________________________________
>> > Neo4j mailing list
>> > [email protected]
>> > https://lists.neo4j.org/mailman/listinfo/user
>> >
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to