Good Jeff, let the community know how that goes! Cheers,
/peter neubauer COO and Sales, Neo Technology GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Sat, Aug 7, 2010 at 6:40 AM, Jeff Klann <[email protected]> wrote: > To anyone interested, after reading Martin's e-mail I ran a test of a small > (75mb) Neo4J graph and traversed it (~500k edges) with memory mapping turned > off in about 1.5 minutes on my hard drive and 30sec when copied to a > (probably slow) flash drive. Which incidentally is the same time it took > with memory mapping turned on with the HDD. Very nice. I think I will pick > up a larger flash drive (or SSD if I can find one inexpensively) and try it > out on a more sizable graph. > > Thanks, > Jeff Klann > > On Thu, Aug 5, 2010 at 7:13 PM, Martin Neumann > <[email protected]>wrote: > >> > > >> > > - Martin, I'm confused a bit about SSDs. I read up on them after I read >> > > your >> > > post. You said flash drives are best, but I read that even the highest >> > > performing flash drives are about 30MB/s read, whereas modern hard >> drives >> > > are at least 50MB/s. True SSDs claim to be 50MB/s too but they're quite >> > > expensive. So why is a flash drive best? I could definitely spring for >> > one >> > > big enough to hold my db if it'd help a lot, but it has that slower >> read >> > > speed. Does the faster seek time really make that much of a difference? >> > Any >> > > brands you'd recommend? >> > > >> >> Neo4j stores the data as Graph on HD. >> >> An example: e = (n1,n2) >> e at location 1000 >> n1 at location 1 >> n2 at location 5 >> >> A traversal, assuming nothing is cached, would result in moving the head to >> 1 then to 1000 then back to 5. >> Normal HD take a while to move to the locations before it can start to read >> data. SSD does not have these delays. If you read little data that is >> spread >> widely over the storage, like in a traversal, SSD are much faster then HD >> even if they are slower to retrieve the data. >> I don't have performance data on that myself but I heard rumors of around >> 20-40 times speedup. >> >> cheers Martin >> >> >> On Thu, Aug 5, 2010 at 9:02 PM, Jeff Klann <[email protected]> wrote: >> >> > Thanks for the answers. >> > >> > Yes, I can do online updates of the datastore, but while this is in R&D I >> > will need to rerun the main loop when I change the algorithm and just for >> > personal benefit I don't want to wait hours to see the changes. Seems to >> be >> > running acceptably now, though. However, I haven't benchmarked it against >> > doing JOINS in Postgres. Are there any good performance stats out there? >> > The >> > speed is about the same as I'd expect from SQL. >> > >> > The graph will probably be nearly a complete graph in the end. The edges >> > between orders will eventually store various stats on the relationships >> > between pairs of items. It'd be nice if I can query an index for outgoing >> > edges from nodes with certain properties. Is this possible? I'll have a >> > look >> > at the edge indexer component. >> > >> > Thanks, >> > Jeff Klann >> > >> > On Mon, Aug 2, 2010 at 2:40 PM, David Montag < >> > [email protected] >> > > wrote: >> > >> > > Hi Jeff, >> > > >> > > Please see answers below. >> > > >> > > On Mon, Aug 2, 2010 at 5:47 PM, Jeff Klann <[email protected]> wrote: >> > > >> > > > Thank you all for your continued interest in helping me. I tweaked >> the >> > > code >> > > > more to minimize writes to the database and it now looks like: >> > > > For each item A >> > > > For each customer that purchased A >> > > > For each item B (with id>A) that A purchased >> > > > Increment (in memory) the weight of (A-B) >> > > > Write out the edges [(A-B):weight] to disk and clear the in-memory >> > map >> > > > >> > > > This actually (if I'm not mistaken) covers all relationships and does >> > > 7500 >> > > > items in about 45 minutes! Not too bad, especially due to (I think) >> > > > avoiding >> > > > edge-checking, and I think it's usable for my application, though >> it's >> > > > still >> > > > ~200k traversals/sec on average, which is a few times slower than I >> > > hoped. >> > > > I >> > > > doubt that's much faster than a two-table join in SQL, though deeper >> > > > traversals should show benefits. >> > > > >> > > >> > > Do you need to do this computation on the full graph all the time? >> Maybe >> > it >> > > would be enough to do it once, and then update it when a customer buys >> > > something? Usually, high one-time costs can be tolerated, and with >> Neo4j >> > > you >> > > can actually do the updating for a customer performing a purchase at >> > > runtime >> > > without performance problems. >> > > >> > > >> > > > >> > > > - David, thank you for your answers on traversers vs. >> getRelationships >> > > and >> > > > on property-loading. I imported some properties I don't really need, >> > > > perhaps >> > > > if I delete them it'll speed things up? Also I'm using the old >> > > > Node.traverse(). How is the new framework better? I expect it has a >> > nicer >> > > > syntax, which I would like to try, but does it improve performance >> too? >> > > > >> > > >> > > Well, depending on your setup you should be able to theoretically >> improve >> > > performance compared to the old traversal framework. The old framework >> > > keeps >> > > track of visited nodes, so that you don't traverse to the same node >> > twice. >> > > This behavior is customizable in the new framework. Please see >> > > http://wiki.neo4j.org/content/Traversal_Framework and check the >> > Uniqueness >> > > constraints. If you know exactly when to stop, then you should be able >> to >> > > use Uniqueness.NONE, meaning that the framework does not keep track of >> > > visited nodes, meaning that you could end up traversing in a cycle. In >> > your >> > > network however, you might know that you always traverse (item) >> > <--BOUGHT-- >> > > (customer) --BOUGHT--> (item) --CORRELATION--> (item)* and no further >> > than >> > > that, so then you know that you won't end up in a cycle. But yeah, then >> > you >> > > need to programmatically make sure you don't go too far. And I don't >> know >> > > if >> > > this gives you any performance benefits what so ever. >> > > >> > > Also, as I understand it, all properties for a node are loaded when >> they >> > > are >> > > first touched. Then they're kept in memory, so if you update properties >> > > later on the same node, and it is still cached, it won't reread >> > everything. >> > > >> > > >> > > > >> > > > - David, on checking relationships, I said checking 15 nodes for >> > > > relationships to n other nodes (where n might be large, I'm not sure >> > > large, >> > > > but <<7500), takes 71s. The nodes are a highly-connected graph and >> also >> > > > with >> > > > edges going out to customers. So in the end the max & edges for a >> node >> > > > would >> > > > be very high, up to around 7500 items and 300,000 customers. >> > > > >> > > >> > > Just so I understand your data model: if a customer buys N products A1 >> - >> > > AN, >> > > will there be be a complete graph between the nodes A1 - AN? When in >> your >> > > algorithm do you need to check for the occurrence of a relationship >> > between >> > > A and B? >> > > >> > > >> > > > >> > > > - Martin, I'm confused a bit about SSDs. I read up on them after I >> read >> > > > your >> > > > post. You said flash drives are best, but I read that even the >> highest >> > > > performing flash drives are about 30MB/s read, whereas modern hard >> > drives >> > > > are at least 50MB/s. True SSDs claim to be 50MB/s too but they're >> quite >> > > > expensive. So why is a flash drive best? I could definitely spring >> for >> > > one >> > > > big enough to hold my db if it'd help a lot, but it has that slower >> > read >> > > > speed. Does the faster seek time really make that much of a >> difference? >> > > Any >> > > > brands you'd recommend? >> > > > >> > > >> > > I think the general consensus is that an SSD is usually the single best >> > > upgrade you can get for a computer or server. The blazingly fast seeks >> > make >> > > all the difference. If you have a big file with data spread out over it >> > and >> > > you need to read and write to different locations of the file rapidly, >> > that >> > > means a lot of work for the heads in a conventional hard drive. The SSD >> > > nails this. Know when you start an application or do something >> processing >> > > heavy, and you hear your hard drive "work"? It's seeking. >> > > >> > > As for brands, I've heard good things about the Intel X25 ones. I have >> an >> > > SSD in my mac, but I don't know what brand it is. All I know is that >> it's >> > > ridiculously fast. >> > > >> > > David >> > > >> > > >> > > > >> > > > I will post some code snippets. Looks like there are a lot of sites >> for >> > > > sharing codes snippets. Any recommendation? >> > > > >> > > > Thanks all, >> > > > Jeff Klann >> > > > >> > > > On Mon, Aug 2, 2010 at 8:44 AM, David Montag < >> > > > [email protected] >> > > > > wrote: >> > > > >> > > > > Hi Jeff, >> > > > > >> > > > > If I'm not mistaken, Neo4j loads all properties for a node or >> > > > relationship >> > > > > when you invoke any operation that touches a property. As for the >> > > > > performance of traversals, it is highly dependent on how deep you >> > > > traverse, >> > > > > and what you do during the traversal, so ymmv. >> > > > > >> > > > > Using a traverser is slower than doing getRelationships, as the >> > > traverser >> > > > > does extra processing to keep state around. Are you using >> > > Node#traverse() >> > > > > or >> > > > > the new traversal framework? Is your code available somewhere? >> > > > > >> > > > > Are you saying that checking whether there's a relationship between >> A >> > > and >> > > > B >> > > > > takes over 20s? How many relationships do A and B have? What does >> > your >> > > > neo >> > > > > config look like (params)? Edge indexing might be a solution, you >> can >> > > > look >> > > > > at the new indexing component for that. ( >> > > > > https://svn.neo4j.org/laboratory/components/lucene-index/) >> > > > > >> > > > > As for the incrementing of a property - while you're within a >> > > > transaction, >> > > > > couldn't you increment a variable and then write that variable at >> the >> > > end >> > > > > of >> > > > > the transaction? >> > > > > >> > > > > David >> > > > > >> > > > > On Fri, Jul 30, 2010 at 8:10 PM, Jeff Klann <[email protected]> >> > wrote: >> > > > > >> > > > > > Hi, so I got 2GB more RAM and noticed that after adding some more >> > > > memory >> > > > > > map >> > > > > > and increasing the heap space, my small query went from 6hrs to >> > 3min. >> > > > > Quite >> > > > > > reasonable! >> > > > > > >> > > > > > But the larger one that would take a month would still take a >> > month. >> > > So >> > > > > > I've >> > > > > > been performance testing parts of it: >> > > > > > >> > > > > > The algorithm as in my first post showed *no* performance >> > improvement >> > > > on >> > > > > > more RAM. >> > > > > > But individual parts.... >> > > > > > - Traversing only (first three lines) was much speedier, but >> > still >> > > > > seems >> > > > > > slow. 1.5 million traversals (15 out of 7000 items) took 23sec. >> It >> > > > shaves >> > > > > > off a few seconds if I run this twice and time it the second >> time, >> > or >> > > > if >> > > > > I >> > > > > > don't print any node properties as I traverse. (Does Neo4J load >> ALL >> > > the >> > > > > > properties for a node if one is accessed?) Even with a double run >> > and >> > > > not >> > > > > > reading node properties, it still takes 16sec, which would make >> > > > traversal >> > > > > > take two hours. I thought Neo4J was suppposed to do ~1m >> > > traversals/sec, >> > > > > > this >> > > > > > is doing about 100k. Why? (And in fact on the other query it was >> > > > getting >> > > > > > about 800,000 traversals/sec.) Is one of Traversers vs. >> > > getRelationship >> > > > > > iterators faster when getting all relationships of a type at >> depth >> > 1? >> > > > > > - Searching for relationships between A & B (but not writing to >> > > them) >> > > > > > takes it from 20s to 91s. Yuck. Maybe edge indexing is the way to >> > > avoid >> > > > > > that? >> > > > > > - Incrementing a property on the root node for every A & B >> takes >> > it >> > > > > from >> > > > > > 20s to 61s (57s if it's all in one transaction). THAT seems >> weird. >> > I >> > > > > > imagine >> > > > > > it has something to do with logging changes? Any way that can be >> > > turned >> > > > > off >> > > > > > for a particular property (like it could be marked 'volatile' >> > during >> > > a >> > > > > > transaction or something)? >> > > > > > >> > > > > > I'm much more hopeful with the extra RAM but it's still kind of >> > slow. >> > > > > > Suggestions? >> > > > > > >> > > > > > Thanks, >> > > > > > Jeff Klann >> > > > > > >> > > > > > On Wed, Jul 28, 2010 at 11:20 AM, Jeff Klann <[email protected]> >> > > wrote: >> > > > > > >> > > > > > > Hi, I have an algorithm running on my little server that is >> very >> > > very >> > > > > > slow. >> > > > > > > It's a recommendation traversal (for all A and B in the catalog >> > of >> > > > > items: >> > > > > > > for each item A, how many customers also purchased another item >> > in >> > > > the >> > > > > > > catalog B). It's processed 90 items in about 8 hours so far! >> > Before >> > > I >> > > > > > dive >> > > > > > > deeper into trying to figure out the performance problem, I >> > thought >> > > > I'd >> > > > > > > email the list to see if more experienced people have ideas. >> > > > > > > >> > > > > > > Some characteristics of my datastore: it's size is pretty >> > moderate >> > > > for >> > > > > a >> > > > > > > database application. 7500 items, not sure how many customers >> and >> > > > > > purchases >> > > > > > > (how can I find the size of an index?) but probably ~1 million >> > > > > customers. >> > > > > > > The relationshipstore + nodestore < 500mb. (Propertystore is >> huge >> > > but >> > > > I >> > > > > > > don't access it much in traversals.) >> > > > > > > >> > > > > > > The possibilities I see are: >> > > > > > > >> > > > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres >> which >> > I >> > > > was >> > > > > > > using previously, but maybe I need to switch to a distributed >> > > > > map-reduce >> > > > > > db >> > > > > > > in the cloud and give up the very nice graph modeling approach? >> I >> > > > > didn't >> > > > > > > think this would be a problem, because my data size is pretty >> > > > moderate >> > > > > > and >> > > > > > > Neo4J is supposed to be fast. >> > > > > > > >> > > > > > > 2) *I just need more RAM.* I definitely need more RAM - I have >> a >> > > > measly >> > > > > > > 1GB currently. But would this get my 20day traversal down to a >> > few >> > > > > hours? >> > > > > > > Doesn't seem like it'd have THAT much impact. I'm running Linux >> > and >> > > > > > nothing >> > > > > > > much else besides Neo4j, so I've got 650m physical RAM. Using >> > 300m >> > > > > heap, >> > > > > > > about 300m memory-map. >> > > > > > > >> > > > > > > 3) *There's some secret about Neo4J performance I don't know.* >> Is >> > > > there >> > > > > > > something I'm unaware that Neo4J is doing? When I access a >> > > property, >> > > > > does >> > > > > > it >> > > > > > > load a chunk of properties I don't care about? For the current >> > > > > node/edge >> > > > > > or >> > > > > > > others? I turned off log rotation and I commit after each item >> A. >> > > Are >> > > > > > there >> > > > > > > other performance tips I might have missed? >> > > > > > > >> > > > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm >> > and >> > > > > maybe >> > > > > > > there's some optimizations I can do. It looks like: >> > > > > > > >> > > > > > >> For each item A in the catalog: >> > > > > > >> For each customer C that has purchased that item: >> > > > > > >> For each item B that customer purchased: >> > > > > > >> Update the co-occurrence edge between A&B. >> > > > > > >> >> > > > > > > (If the edge exists, add one to its weight. If it doesn't >> > > > exist, >> > > > > > >> create it with weight one.) >> > > > > > >> >> > > > > > > This is O(n^2) worst case, but practically it'll be much better >> > due >> > > > to >> > > > > > the >> > > > > > > sparseness of purchases. The large number of customers slows it >> > > down, >> > > > > > > though. The slowest part, I suspect, is the last line. It's a >> lot >> > > of >> > > > > > finding >> > > > > > > and re-finding edges between As and Bs and updating the edge >> > > > > properties. >> > > > > > I >> > > > > > > don't see much way around it, though. I wrote another version >> > that >> > > > > avoids >> > > > > > > this but is always O(n^2), and it takes about 15 minutes per A >> to >> > > > check >> > > > > > > against all B (which would also take a month). The version >> above >> > > > seems >> > > > > to >> > > > > > be >> > > > > > > averaging 3 customers/sec, which doesn't seem that slow until >> you >> > > > > realize >> > > > > > > that some of these items were purchased by thousands of >> > customers. >> > > > > > > >> > > > > > > I'd hate to give up on Neo4J. I really like the graph database >> > > > concept. >> > > > > > But >> > > > > > > can it handle data? I hope someone sees something I'm doing >> > wrong. >> > > > > > > >> > > > > > > Thanks, >> > > > > > > Jeff Klann >> > > > > > > >> > > > > > _______________________________________________ >> > > > > > Neo4j mailing list >> > > > > > [email protected] >> > > > > > https://lists.neo4j.org/mailman/listinfo/user >> > > > > > >> > > > > _______________________________________________ >> > > > > Neo4j mailing list >> > > > > [email protected] >> > > > > https://lists.neo4j.org/mailman/listinfo/user >> > > > > >> > > > _______________________________________________ >> > > > Neo4j mailing list >> > > > [email protected] >> > > > https://lists.neo4j.org/mailman/listinfo/user >> > > > >> > > _______________________________________________ >> > > Neo4j mailing list >> > > [email protected] >> > > https://lists.neo4j.org/mailman/listinfo/user >> > > >> > _______________________________________________ >> > Neo4j mailing list >> > [email protected] >> > https://lists.neo4j.org/mailman/listinfo/user >> > >> _______________________________________________ >> Neo4j mailing list >> [email protected] >> https://lists.neo4j.org/mailman/listinfo/user >> > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

