Re: [Neo4j] Querying for nodes that have no relationhip to a specfic node
Hi everyone, I would have an SQL db for the app besides the graph db. I have users that I would store as nodes within the graph besides storing them in SQL as well. Within those nodes I store attributes like male/female, age or date of birth, etc. I would have one kind of relationship for friendship, which doesn't present any kind of problem and I would do the standard type of queries neo4jr-social provides (e.g. friend suggestions, degrees of separation, friends in common, ...) We want to measure the compatibility/taste match/whatever between users in background, meaning for instance how much you have in common. This is done in Ruby. The result will be an integer between 0 and 100. BTW, this value is symmetric, meaning it could be modelled as a bidirectional relationship. Let's say I have 10k users and for every user I calculate the match between him and 10 other users. If I store all the results I calculate I potentially up to 100k relationships every day / 3m relationships every month. If I store this in SQL it can turn into a bottleneck very fast. The table will grow soon too big and the queries will be slower and slower. That's when I started thinking in storing those relationships in Neo4j because it's meant to handle a very large number of nodes and relationships really efficiently. I can model that as a relationship and either store the value inside the relationship or code the relationship names as 'match_high, match_medium, match_low' Now back to step 1. Selecting the users I'll be calculating new relationships with. They must match certain criteria, e.g. female/male, similar age, etc. and it could be pseudo random. Now the first step if you think in SQL is to query for all users that match the criteria and don't have a relationship with user A. And then yesterday looking at the Neo4j docs I thought this kind of query cannot be done. I could select all the users that match the criteria from SQL, then query all the relationships for A from Neo4j, substract those from the array of valid users and pick randomly n users. Because n is a low value, perhaps 10, this looks to me like a very inefficient way of doing this. Also it will be fast at the beginning but it will get slower as the relationship density grows with time... Maybe I should consider a different strategy. I've been also considering only storing high or interesting values but it would be more interesting to have the n top users for A ordered by relationship value. If I go ahead with this then I could just go and store it within SQL. This is not what we strive for but if I don't find a better way I'll guess we'll have to live with that. Also the solution I find should be easily scalable. It should also apply when having for instance 100k users. Any thoughts or comments? What would you recommend? Thanks for help guys! Alberto. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Querying for nodes that have no relationhip to a specfic node
Hi Alberto, Okay, interesting. You want to calculate some metric between pairs of users, so it's not a friend-of-a-friend scenario or anything like that, which would have been great in a graph db. This is just all/some pairs of random users. That you can do with your SQL db or neo4j or what ever db you want. But then you need to store the result. You can store these metrics as relationships in neo4j, and then just update them for each user when you recompute. You can find the user nodes via indexing. Maybe it's acceptable that some metrics are out of date, so you can just background process them continuously. Depending on your scenario, if your users know each other, it might be interesting to start computing in a foaf style order (breadth first). Remember, the power is in the relationships. Isolated nodes are not interesting. David -- Sent from cell, excuse typos. On Wednesday, July 28, 2010, Alberto Perdomo alberto.perd...@gmail.com wrote: Hi everyone, I would have an SQL db for the app besides the graph db. I have users that I would store as nodes within the graph besides storing them in SQL as well. Within those nodes I store attributes like male/female, age or date of birth, etc. I would have one kind of relationship for friendship, which doesn't present any kind of problem and I would do the standard type of queries neo4jr-social provides (e.g. friend suggestions, degrees of separation, friends in common, ...) We want to measure the compatibility/taste match/whatever between users in background, meaning for instance how much you have in common. This is done in Ruby. The result will be an integer between 0 and 100. BTW, this value is symmetric, meaning it could be modelled as a bidirectional relationship. Let's say I have 10k users and for every user I calculate the match between him and 10 other users. If I store all the results I calculate I potentially up to 100k relationships every day / 3m relationships every month. If I store this in SQL it can turn into a bottleneck very fast. The table will grow soon too big and the queries will be slower and slower. That's when I started thinking in storing those relationships in Neo4j because it's meant to handle a very large number of nodes and relationships really efficiently. I can model that as a relationship and either store the value inside the relationship or code the relationship names as 'match_high, match_medium, match_low' Now back to step 1. Selecting the users I'll be calculating new relationships with. They must match certain criteria, e.g. female/male, similar age, etc. and it could be pseudo random. Now the first step if you think in SQL is to query for all users that match the criteria and don't have a relationship with user A. And then yesterday looking at the Neo4j docs I thought this kind of query cannot be done. I could select all the users that match the criteria from SQL, then query all the relationships for A from Neo4j, substract those from the array of valid users and pick randomly n users. Because n is a low value, perhaps 10, this looks to me like a very inefficient way of doing this. Also it will be fast at the beginning but it will get slower as the relationship density grows with time... Maybe I should consider a different strategy. I've been also considering only storing high or interesting values but it would be more interesting to have the n top users for A ordered by relationship value. If I go ahead with this then I could just go and store it within SQL. This is not what we strive for but if I don't find a better way I'll guess we'll have to live with that. Also the solution I find should be easily scalable. It should also apply when having for instance 100k users. Any thoughts or comments? What would you recommend? Thanks for help guys! Alberto. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Neo4j and JPA refactoring
Hi there, I noticed the big update to the JPA implementation for Neo4j at https://svn.neo4j.org/laboratory/components/neo-persistence/trunk/ . Avishay, could you spread some light on what is changed and improved, and the current state of the project? Cheers, /peter neubauer COO and Sales, Neo Technology GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Read-only transactions?
Hi, Is it possible to mark a transaction as being read-only? It's taking a while for my transaction to shut down, even though there are no writes to commit. Thanks, Tim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Unable to build a master-slave system
Hello, I'm trying to build a high availability system with neo4j as explained here: http://wiki.neo4j.org/content/Online_Backup_HA. In theory everything looks pretty simple and straightforward... but once I try to run the slave process I'm getting the following exception: Throwing away org.neo4j.onlinebackup.net.connecttomaster...@1f1fba0 java.nio.channels.NotYetConnectedException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source) at sun.nio.ch.SocketChannelImpl.write(Unknown Source) at org.neo4j.onlinebackup.net.Connection.write(Connection.java:238) at org.neo4j.onlinebackup.net.ConnectToMasterJob.sendGreeting(ConnectToMasterJob.java:55) at org.neo4j.onlinebackup.net.ConnectToMasterJob.performJob(ConnectToMasterJob.java:141) at org.neo4j.onlinebackup.net.JobEater.run(JobEater.java:32) ... followed by this exception on master side: Connection closed Connection[slave_ip_address:11587] org.neo4j.onlinebackup.net.SocketException: Connection[slave_ip_address:11587] error reading Throwing away org.neo4j.onlinebackup.net.handleincommingslave...@fd13b5 at org.neo4j.onlinebackup.net.Connection.read(Connection.java:210) at org.neo4j.onlinebackup.net.HandleIncommingSlaveJob.getGreeting(HandleIncommingSlaveJob.java:41) at org.neo4j.onlinebackup.net.HandleIncommingSlaveJob.performJob(HandleIncommingSlaveJob.java:160) at org.neo4j.onlinebackup.net.JobEater.run(JobEater.java:93) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(Unknown Source) at sun.nio.ch.SocketChannelImpl.read(Unknown Source) at org.neo4j.onlinebackup.net.Connection.read(Connection.java:205) ... 3 more null chain job Any idea of what might be wrong here? (I'm running everything on 64-bit Windows (7 or Server 2008 R2), neo4j-kernel 1.0 and online-backup 0.5) Thank you, George ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Querying for nodes that have no relationhip to a specfic node
Hi David, But then you need to store the result. You can store these metrics as relationships in neo4j, and then just update them for each user when you recompute. You can find the user nodes via indexing. Maybe it's acceptable that some metrics are out of date, so you can just background process them continuously. I already have background processes that go through all users and calculate new new pairs. But then in order to do that I do need to exclude the pairs I already have... because it would be silly and as the relationship density grows the probablity of calculating a pair again would be higher and higher... Would I be able to do that kind of query using indexing? Depending on your scenario, if your users know each other, it might be interesting to start computing in a foaf style order (breadth first). Remember, the power is in the relationships. Isolated nodes are not interesting. You mean I look first for possible pairs with users that are friends of friends instead of randomly? We are also interesting in storing friendship relationship so that sounds interesting. That would be a different type of query: Traverse the graph from node A to nodes which are friends of friends of A and have no match relationship with A. I guess that is not difficult to implement using Neo4j? Thanks for your input David! ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Stumped by performance issue in traversal - would take a month to run!
Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using previously, but maybe I need to switch to a distributed map-reduce db in the cloud and give up the very nice graph modeling approach? I didn't think this would be a problem, because my data size is pretty moderate and Neo4J is supposed to be fast. 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB currently. But would this get my 20day traversal down to a few hours? Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, about 300m memory-map. 3) *There's some secret about Neo4J performance I don't know.* Is there something I'm unaware that Neo4J is doing? When I access a property, does it load a chunk of properties I don't care about? For the current node/edge or others? I turned off log rotation and I commit after each item A. Are there other performance tips I might have missed? 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe there's some optimizations I can do. It looks like: For each item A in the catalog: For each customer C that has purchased that item: For each item B that customer purchased: Update the co-occurrence edge between AB. (If the edge exists, add one to its weight. If it doesn't exist, create it with weight one.) This is O(n^2) worst case, but practically it'll be much better due to the sparseness of purchases. The large number of customers slows it down, though. The slowest part, I suspect, is the last line. It's a lot of finding and re-finding edges between As and Bs and updating the edge properties. I don't see much way around it, though. I wrote another version that avoids this but is always O(n^2), and it takes about 15 minutes per A to check against all B (which would also take a month). The version above seems to be averaging 3 customers/sec, which doesn't seem that slow until you realize that some of these items were purchased by thousands of customers. I'd hate to give up on Neo4J. I really like the graph database concept. But can it handle data? I hope someone sees something I'm doing wrong. Thanks, Jeff Klann ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
Jeff, when you're doing your traversal/update process, how often do you commit the transactions? -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Jeff Klann Sent: Wednesday, July 28, 2010 11:20 AM To: Neo4j user discussions Subject: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using previously, but maybe I need to switch to a distributed map-reduce db in the cloud and give up the very nice graph modeling approach? I didn't think this would be a problem, because my data size is pretty moderate and Neo4J is supposed to be fast. 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB currently. But would this get my 20day traversal down to a few hours? Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, about 300m memory-map. 3) *There's some secret about Neo4J performance I don't know.* Is there something I'm unaware that Neo4J is doing? When I access a property, does it load a chunk of properties I don't care about? For the current node/edge or others? I turned off log rotation and I commit after each item A. Are there other performance tips I might have missed? 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe there's some optimizations I can do. It looks like: For each item A in the catalog: For each customer C that has purchased that item: For each item B that customer purchased: Update the co-occurrence edge between AB. (If the edge exists, add one to its weight. If it doesn't exist, create it with weight one.) This is O(n^2) worst case, but practically it'll be much better due to the sparseness of purchases. The large number of customers slows it down, though. The slowest part, I suspect, is the last line. It's a lot of finding and re-finding edges between As and Bs and updating the edge properties. I don't see much way around it, though. I wrote another version that avoids this but is always O(n^2), and it takes about 15 minutes per A to check against all B (which would also take a month). The version above seems to be averaging 3 customers/sec, which doesn't seem that slow until you realize that some of these items were purchased by thousands of customers. I'd hate to give up on Neo4J. I really like the graph database concept. But can it handle data? I hope someone sees something I'm doing wrong. Thanks, Jeff Klann ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
Oh, and you DEFINITELY need more RAM! -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Jeff Klann Sent: Wednesday, July 28, 2010 11:20 AM To: Neo4j user discussions Subject: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using previously, but maybe I need to switch to a distributed map-reduce db in the cloud and give up the very nice graph modeling approach? I didn't think this would be a problem, because my data size is pretty moderate and Neo4J is supposed to be fast. 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB currently. But would this get my 20day traversal down to a few hours? Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, about 300m memory-map. 3) *There's some secret about Neo4J performance I don't know.* Is there something I'm unaware that Neo4J is doing? When I access a property, does it load a chunk of properties I don't care about? For the current node/edge or others? I turned off log rotation and I commit after each item A. Are there other performance tips I might have missed? 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe there's some optimizations I can do. It looks like: For each item A in the catalog: For each customer C that has purchased that item: For each item B that customer purchased: Update the co-occurrence edge between AB. (If the edge exists, add one to its weight. If it doesn't exist, create it with weight one.) This is O(n^2) worst case, but practically it'll be much better due to the sparseness of purchases. The large number of customers slows it down, though. The slowest part, I suspect, is the last line. It's a lot of finding and re-finding edges between As and Bs and updating the edge properties. I don't see much way around it, though. I wrote another version that avoids this but is always O(n^2), and it takes about 15 minutes per A to check against all B (which would also take a month). The version above seems to be averaging 3 customers/sec, which doesn't seem that slow until you realize that some of these items were purchased by thousands of customers. I'd hate to give up on Neo4J. I really like the graph database concept. But can it handle data? I hope someone sees something I'm doing wrong. Thanks, Jeff Klann ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
I can't give too much help on this unfortunately, but as far as possibility 1) goes, my database contains around 8 million nodes, and I traverse them in about 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. I allocate a 1GB heap to the JDK. Inserts take a little longer because of the approach I use - inserting 200K nodes now takes a few minutes. I then have a separate step to remove duplicates that takes about 10-15 minutes. It seems to me that you might be better off doing something similar: creating a new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add this relationship between products in catalogues A and B. Then run a post-processing job that retrieves all PURCHASED_BOTH relationships for each product in catalogue A, and build an in-memory map so you only keep one of these relationships, and update the 'count' attribute in memory for that relationship. Delete the duplicates and commit. This way to get your desired result in 2 passes instead of doing everything in one go. It seems a bit of a fiddle and I can't guarantee it'll improve performance (just to stress - I may be waaay off the mark here, but it works for me). I think it will though because it'll mean that your loop only has to create relationships instead of performing updates. Oh, and make sure that you aren't performing one operation per transaction - you could group together several tens of thousands before committing (I do 50,000 inserts before committing when I'm running this post-processing operation, and it's fine). Tim - Original Message From: Jeff Klann jkl...@iupui.edu To: Neo4j user discussions user@lists.neo4j.org Sent: Wed, July 28, 2010 4:20:28 PM Subject: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using previously, but maybe I need to switch to a distributed map-reduce db in the cloud and give up the very nice graph modeling approach? I didn't think this would be a problem, because my data size is pretty moderate and Neo4J is supposed to be fast. 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB currently. But would this get my 20day traversal down to a few hours? Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, about 300m memory-map. 3) *There's some secret about Neo4J performance I don't know.* Is there something I'm unaware that Neo4J is doing? When I access a property, does it load a chunk of properties I don't care about? For the current node/edge or others? I turned off log rotation and I commit after each item A. Are there other performance tips I might have missed? 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe there's some optimizations I can do. It looks like: For each item A in the catalog: For each customer C that has purchased that item: For each item B that customer purchased: Update the co-occurrence edge between AB. (If the edge exists, add one to its weight. If it doesn't exist, create it with weight one.) This is O(n^2) worst case, but practically it'll be much better due to the sparseness of purchases. The large number of customers slows it down, though. The slowest part, I suspect, is the last line. It's a lot of finding and re-finding edges between As and Bs and updating the edge properties. I don't see much way around it, though. I wrote another version that avoids this but is always O(n^2), and it takes about 15 minutes per A to check against all B (which would also take a month). The version above seems to be averaging 3 customers/sec, which doesn't seem that slow until you realize that some of these items were purchased by thousands of customers. I'd hate to give up on Neo4J. I really like the graph database concept. But can it handle data? I hope someone sees something I'm doing wrong. Thanks, Jeff Klann
Re: [Neo4j] Querying for nodes that have no relationhip to a specfic node
One benefit you of Neo4j is that you can get rid of these pesky background jobs and instead calculate such things on the fly quite fast, and not needing to store that calculated info at all. Tried it? 2010/7/28, Alberto Perdomo alberto.perd...@gmail.com: Hi everyone, I would have an SQL db for the app besides the graph db. I have users that I would store as nodes within the graph besides storing them in SQL as well. Within those nodes I store attributes like male/female, age or date of birth, etc. I would have one kind of relationship for friendship, which doesn't present any kind of problem and I would do the standard type of queries neo4jr-social provides (e.g. friend suggestions, degrees of separation, friends in common, ...) We want to measure the compatibility/taste match/whatever between users in background, meaning for instance how much you have in common. This is done in Ruby. The result will be an integer between 0 and 100. BTW, this value is symmetric, meaning it could be modelled as a bidirectional relationship. Let's say I have 10k users and for every user I calculate the match between him and 10 other users. If I store all the results I calculate I potentially up to 100k relationships every day / 3m relationships every month. If I store this in SQL it can turn into a bottleneck very fast. The table will grow soon too big and the queries will be slower and slower. That's when I started thinking in storing those relationships in Neo4j because it's meant to handle a very large number of nodes and relationships really efficiently. I can model that as a relationship and either store the value inside the relationship or code the relationship names as 'match_high, match_medium, match_low' Now back to step 1. Selecting the users I'll be calculating new relationships with. They must match certain criteria, e.g. female/male, similar age, etc. and it could be pseudo random. Now the first step if you think in SQL is to query for all users that match the criteria and don't have a relationship with user A. And then yesterday looking at the Neo4j docs I thought this kind of query cannot be done. I could select all the users that match the criteria from SQL, then query all the relationships for A from Neo4j, substract those from the array of valid users and pick randomly n users. Because n is a low value, perhaps 10, this looks to me like a very inefficient way of doing this. Also it will be fast at the beginning but it will get slower as the relationship density grows with time... Maybe I should consider a different strategy. I've been also considering only storing high or interesting values but it would be more interesting to have the n top users for A ordered by relationship value. If I go ahead with this then I could just go and store it within SQL. This is not what we strive for but if I don't find a better way I'll guess we'll have to live with that. Also the solution I find should be easily scalable. It should also apply when having for instance 100k users. Any thoughts or comments? What would you recommend? Thanks for help guys! Alberto. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
Thank you both for your responses. - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully that's a huge factor. - Tim, I like your algorithm trick! Would save a lot of reading/writing but would definitely require more memory due to the massive increase in # of edges. - Transactions are not the issue, unless reading AFTER comitting a transaction is somehow slower? I'm only committing after each of 7,000 items and like I said it took 8 hours to run through 90-some items... committing is not where the time is being spent. To gauge the performance problem, I wanted to see how many customers are purchasing each item and I'm concerned that even this query is taking a really long time. It's simple: For each item A Count the number of relationships to a customer It took 15 minutes to do 200 items. That's almost 5 seconds an item just to count the number of customers who purchased an item! (Looks like on average about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR query! Considering that Neo4J advertises it can traverse 1m relationships/sec on commodity hardware, I would expect this to be much faster. (Even if it were 50k customers per item, that'd be 7000items * 5customers / 1m traversals = 350 seconds. 6 minutes would be much more reasonable.) My commodity hardware will have a lot more memory tomorrow, hopefully that'll solve these problems! Thanks, Jeff Klann p.s. My propertystore is big because I was naive on import and stored everything as string properties (this will change). How does that affect performance? On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones bogol...@ymail.com wrote: I can't give too much help on this unfortunately, but as far as possibility 1) goes, my database contains around 8 million nodes, and I traverse them in about 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. I allocate a 1GB heap to the JDK. Inserts take a little longer because of the approach I use - inserting 200K nodes now takes a few minutes. I then have a separate step to remove duplicates that takes about 10-15 minutes. It seems to me that you might be better off doing something similar: creating a new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add this relationship between products in catalogues A and B. Then run a post-processing job that retrieves all PURCHASED_BOTH relationships for each product in catalogue A, and build an in-memory map so you only keep one of these relationships, and update the 'count' attribute in memory for that relationship. Delete the duplicates and commit. This way to get your desired result in 2 passes instead of doing everything in one go. It seems a bit of a fiddle and I can't guarantee it'll improve performance (just to stress - I may be waaay off the mark here, but it works for me). I think it will though because it'll mean that your loop only has to create relationships instead of performing updates. Oh, and make sure that you aren't performing one operation per transaction - you could group together several tens of thousands before committing (I do 50,000 inserts before committing when I'm running this post-processing operation, and it's fine). Tim - Original Message From: Jeff Klann jkl...@iupui.edu To: Neo4j user discussions user@lists.neo4j.org Sent: Wed, July 28, 2010 4:20:28 PM Subject: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using previously, but maybe I need to switch to a distributed map-reduce db in the cloud and give up the very nice graph modeling approach? I didn't think this would be a problem, because my data size is pretty moderate and Neo4J is supposed to be fast. 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB currently. But would this get my 20day traversal down to a few hours? Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing much else besides Neo4j, so I've got 650m physical RAM.
Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
Hi, Jeff. If you are committing after each item, it definitely will slow down performance. Start a single transaction, commit when you're all done the entire traversal, and report back the results. You will still see the changes you've made prior to committing the transaction, as long as you're on the same execution thread. Rick -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Jeff Klann Sent: Wednesday, July 28, 2010 5:43 PM To: Neo4j user discussions Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Thank you both for your responses. - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully that's a huge factor. - Tim, I like your algorithm trick! Would save a lot of reading/writing but would definitely require more memory due to the massive increase in # of edges. - Transactions are not the issue, unless reading AFTER comitting a transaction is somehow slower? I'm only committing after each of 7,000 items and like I said it took 8 hours to run through 90-some items... committing is not where the time is being spent. To gauge the performance problem, I wanted to see how many customers are purchasing each item and I'm concerned that even this query is taking a really long time. It's simple: For each item A Count the number of relationships to a customer It took 15 minutes to do 200 items. That's almost 5 seconds an item just to count the number of customers who purchased an item! (Looks like on average about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR query! Considering that Neo4J advertises it can traverse 1m relationships/sec on commodity hardware, I would expect this to be much faster. (Even if it were 50k customers per item, that'd be 7000items * 5customers / 1m traversals = 350 seconds. 6 minutes would be much more reasonable.) My commodity hardware will have a lot more memory tomorrow, hopefully that'll solve these problems! Thanks, Jeff Klann p.s. My propertystore is big because I was naive on import and stored everything as string properties (this will change). How does that affect performance? On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones bogol...@ymail.com wrote: I can't give too much help on this unfortunately, but as far as possibility 1) goes, my database contains around 8 million nodes, and I traverse them in about 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. I allocate a 1GB heap to the JDK. Inserts take a little longer because of the approach I use - inserting 200K nodes now takes a few minutes. I then have a separate step to remove duplicates that takes about 10-15 minutes. It seems to me that you might be better off doing something similar: creating a new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add this relationship between products in catalogues A and B. Then run a post-processing job that retrieves all PURCHASED_BOTH relationships for each product in catalogue A, and build an in-memory map so you only keep one of these relationships, and update the 'count' attribute in memory for that relationship. Delete the duplicates and commit. This way to get your desired result in 2 passes instead of doing everything in one go. It seems a bit of a fiddle and I can't guarantee it'll improve performance (just to stress - I may be waaay off the mark here, but it works for me). I think it will though because it'll mean that your loop only has to create relationships instead of performing updates. Oh, and make sure that you aren't performing one operation per transaction - you could group together several tens of thousands before committing (I do 50,000 inserts before committing when I'm running this post-processing operation, and it's fine). Tim - Original Message From: Jeff Klann jkl...@iupui.edu To: Neo4j user discussions user@lists.neo4j.org Sent: Wed, July 28, 2010 4:20:28 PM Subject: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just
Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
I don't think that's the problem. Here's why... When I was importing my data, it eventually slowed down to a crawl (though it was pretty fast at first). Someone pointed out that since I was trying to do it all in one transaction, it was filling the java heap too much. They suggested I commit after every 40,000 node/edge creations (that's empirically when the slowdown happened). I did and then the import zipped along just fine. I'm only committing after the outer pass through an item, which is again only after tens of thousands of writes/property updates. Hmm, that makes me wonder if it's possible I'm not committing often enough. Well, when I have more memory we'll see how it does. - Jeff Klann p.s. The simple counter in my last post isn't using transactions at all. On Wed, Jul 28, 2010 at 5:48 PM, Rick Bullotta rick.bullo...@burningskysoftware.com wrote: Hi, Jeff. If you are committing after each item, it definitely will slow down performance. Start a single transaction, commit when you're all done the entire traversal, and report back the results. You will still see the changes you've made prior to committing the transaction, as long as you're on the same execution thread. Rick -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Jeff Klann Sent: Wednesday, July 28, 2010 5:43 PM To: Neo4j user discussions Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run! Thank you both for your responses. - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully that's a huge factor. - Tim, I like your algorithm trick! Would save a lot of reading/writing but would definitely require more memory due to the massive increase in # of edges. - Transactions are not the issue, unless reading AFTER comitting a transaction is somehow slower? I'm only committing after each of 7,000 items and like I said it took 8 hours to run through 90-some items... committing is not where the time is being spent. To gauge the performance problem, I wanted to see how many customers are purchasing each item and I'm concerned that even this query is taking a really long time. It's simple: For each item A Count the number of relationships to a customer It took 15 minutes to do 200 items. That's almost 5 seconds an item just to count the number of customers who purchased an item! (Looks like on average about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR query! Considering that Neo4J advertises it can traverse 1m relationships/sec on commodity hardware, I would expect this to be much faster. (Even if it were 50k customers per item, that'd be 7000items * 5customers / 1m traversals = 350 seconds. 6 minutes would be much more reasonable.) My commodity hardware will have a lot more memory tomorrow, hopefully that'll solve these problems! Thanks, Jeff Klann p.s. My propertystore is big because I was naive on import and stored everything as string properties (this will change). How does that affect performance? On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones bogol...@ymail.com wrote: I can't give too much help on this unfortunately, but as far as possibility 1) goes, my database contains around 8 million nodes, and I traverse them in about 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. I allocate a 1GB heap to the JDK. Inserts take a little longer because of the approach I use - inserting 200K nodes now takes a few minutes. I then have a separate step to remove duplicates that takes about 10-15 minutes. It seems to me that you might be better off doing something similar: creating a new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add this relationship between products in catalogues A and B. Then run a post-processing job that retrieves all PURCHASED_BOTH relationships for each product in catalogue A, and build an in-memory map so you only keep one of these relationships, and update the 'count' attribute in memory for that relationship. Delete the duplicates and commit. This way to get your desired result in 2 passes instead of doing everything in one go. It seems a bit of a fiddle and I can't guarantee it'll improve performance (just to stress - I may be waaay off the mark here, but it works for me). I think it will though because it'll mean that your loop only has to create relationships instead of performing updates. Oh, and make sure that you aren't performing one operation per transaction - you could group together several tens of thousands before committing (I do 50,000 inserts before committing when I'm running this post-processing operation, and it's fine). Tim - Original Message From: Jeff Klann jkl...@iupui.edu To: Neo4j user discussions user@lists.neo4j.org Sent:
Re: [Neo4j] property value encoding
On Tue, Jul 27, 2010 at 22:29, Craig Taverner cr...@amanzi.com wrote: Mapping property values to a discrete set, and refering to them using their 'id' is quite reminiscent of a foreign key in a relational database. Yes, with a relational database I would create foreign keys and maybe bitmap indexes on columns used for search. But I was thinking about a compression algorithm like these: http://www.ibm.com/developerworks/data/library/techarticle/dm-0605ahuja/index.html Why not take the next step and make a node for each value, and link all data nodes to the value nodes? I've thought about using nodes and relationships to index these values, but as you've said, this would generate a big number of relationships. I've 200 properties indexed and the dictionary of encoded values contains 15k entries (many properties have a set of hundreds of possible values). Now I've only 600k nodes but each node has from one to several encoded properties. Considering that mantaining an index is expensive, maybe an acceptable trade-off is to put the dictionary of encoded values in the graph, but create relationships from dictionary entries to nodes only for OpenStreetMap properties used for search. In cases where very many data nodes link to very few index nodes, there is another trick I'm fond of, and that is the composite index I need more informations on the implementation of this composite index :) -- Davide Savazzi ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user