Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse
Yes, you can probably do this thing in one traversal. Shortest path will give you the shortest path(s) between two given nodes, but are interested in any path, right? And you can find paths to several different end nodes in one traversal. Just specify an Evaluator which knows about that, or let loose a breadth-first traversal and filter your paths afterwards. 2011/5/11 Michael Hunger michael.hun...@neotechnology.com Hey JueiTing, I'm not sure if Hadoop is needed here. What is the current performance characteristics for the shortest path you are using? You could take a decent machine and just fire up, e.g. blocks of 10k node pairs to a ThreadPoolExecutor with cores*2 threads. Each of those tasks only has to return the sum of the path lengths (and you know the block size) so you can sum the whole thing up onto a long and divide it by the number of pairs processed at the end? Perhaps instead of brute force looping one and a half times over all nodes it is perhaps better to do a breath first traversal over all nodes (regardless of relationships, just outgoing rels) and returning unique paths from the traverser and calculating the path lenghts from the start nodes to each _connected_ node on the paths (start node - path[0..n].nodes[1..n]) Cheers Michael Am 11.05.2011 um 21:12 schrieb Peter Neubauer: Hi JueiTing, I think this is a typical case for a massive Map/Reduce job. I am thinking of combining Hadoop works with replicas of the graph and then do the computation. I believe Paddy Fitzgerald has been working with these approaches and can give some feedback. Of course, given the size of the graph, that might prove a problem. OTOH, if there are no modifications during the computation, you could run the calculations on read-only databases from the same store. Would that work? Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://startupbootcamp.org/- Öresund - Innovation happens HERE. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Wed, May 11, 2011 at 12:47 AM, 翁瑞廷 wshir...@gmail.com wrote: Hi, I'm trying to use Neo4j graph database to store a large social network(more than 5,000,000 nodes) for academic research. I need to compute the separation degree(path length) between any two nodes in the graph then get the average degree of whole database. The solution I'm using use now is archieved by executing API GraphAlgoFactory.shortestPath, but it means I need to execute (n*(n-1))/2 times to get all path length. I don't think it's a very good idea :( So, I'm wondering that is there any function which can assign one node and whole DB as Input, and return the paths or only path lengths between the node and all other nodes in the graph as Output? Just like Dijkstra algorithm did. I know there is a Dijkstra function in the API, but it only return the path between two nodes. It'll be really helpful for me if there is any function can return all the result in one computation(even the whole result array computing by Dijkstra) Or does anyone has better idea to deal with this issue? I tried to implement Dijkstra by myself, but it was impossible to declare such big 2D array in the program...(OutOfMemoryException) If there is any advice, I'll really appreciate it. Thanks. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse
Thanks for all your response, Here is the size of the grapth db: NodesSize - 100,000 97MB 200,000 182MB 300,000 267MB ... 5,000,000 expect 5GB I've tried to use 5 virtual machines, each one has 2 cores and 1G memory, Running 2 threads on each VM. Nodes Time -- 50,000 20mins 150,00040hrs Obviously, when the amount of node increases, the spending time of executing GraphAlgoFactory.shortestPath increases heavily and non-linearly. It's hard to estimate how many machines I need if I want to deal with 5,000,000 nodes or even 10,000,000, I think Map/Reduce will be one solution for me, and I'll try to use BFS traversal which may reduce some duplicate procedures. Nice to discuss with you, Thanks. 2011/5/12 Michael Hunger michael.hun...@neotechnology.com Hey JueiTing, I'm not sure if Hadoop is needed here. What is the current performance characteristics for the shortest path you are using? You could take a decent machine and just fire up, e.g. blocks of 10k node pairs to a ThreadPoolExecutor with cores*2 threads. Each of those tasks only has to return the sum of the path lengths (and you know the block size) so you can sum the whole thing up onto a long and divide it by the number of pairs processed at the end? Perhaps instead of brute force looping one and a half times over all nodes it is perhaps better to do a breath first traversal over all nodes (regardless of relationships, just outgoing rels) and returning unique paths from the traverser and calculating the path lenghts from the start nodes to each _connected_ node on the paths (start node - path[0..n].nodes[1..n]) Cheers Michael Am 11.05.2011 um 21:12 schrieb Peter Neubauer: Hi JueiTing, I think this is a typical case for a massive Map/Reduce job. I am thinking of combining Hadoop works with replicas of the graph and then do the computation. I believe Paddy Fitzgerald has been working with these approaches and can give some feedback. Of course, given the size of the graph, that might prove a problem. OTOH, if there are no modifications during the computation, you could run the calculations on read-only databases from the same store. Would that work? Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://startupbootcamp.org/- Öresund - Innovation happens HERE. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Wed, May 11, 2011 at 12:47 AM, 翁瑞廷 wshir...@gmail.com wrote: Hi, I'm trying to use Neo4j graph database to store a large social network(more than 5,000,000 nodes) for academic research. I need to compute the separation degree(path length) between any two nodes in the graph then get the average degree of whole database. The solution I'm using use now is archieved by executing API GraphAlgoFactory.shortestPath, but it means I need to execute (n*(n-1))/2 times to get all path length. I don't think it's a very good idea :( So, I'm wondering that is there any function which can assign one node and whole DB as Input, and return the paths or only path lengths between the node and all other nodes in the graph as Output? Just like Dijkstra algorithm did. I know there is a Dijkstra function in the API, but it only return the path between two nodes. It'll be really helpful for me if there is any function can return all the result in one computation(even the whole result array computing by Dijkstra) Or does anyone has better idea to deal with this issue? I tried to implement Dijkstra by myself, but it was impossible to declare such big 2D array in the program...(OutOfMemoryException) If there is any advice, I'll really appreciate it. Thanks. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse
Hi, I had the same problem and solved it by assigning a distance label to any node. The procedure is: 1-take the starting node N and add it to a set A, define the set B 2-set the value d=1 3-for any node M in A: 3.1 set the label distance of M to d 3.2 for any node X which is connected to M and doesn't have the label distance add X to the set B 4-increase d of 1 5-empty A and put all the entries of B in it, then empty B 6-go to step 3 until A is empty I have an implementation of it, but it's a little bit old (I created it for a Neo4j 0.1 graph) - Original Message Da: Neo4j user discussions lt;user@lists.neo4j.orggt; To: Neo4j user discussions lt;user@lists.neo4j.orggt; Oggetto: Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse Data: 12/05/11 09:30 Thanks for all your response, Here is the size of the grapth db: NodesSize - 100,000 97MB 200,000 182MB 300,000 267MB ... 5,000,000 expect 5GB I've tried to use 5 virtual machines, each one has 2 cores and 1G memory, Running 2 threads on each VM. Nodes Time -- 50,000 gt;20mins 150,000gt;40hrs Obviously, when the amount of node increases, the spending time of executing GraphAlgoFactory.shortestPath increases heavily and non-linearly. It's hard to estimate how many machines I need if I want to deal with 5,000,000 nodes or even 10,000,000, I think Map/Reduce will be one solution for me, and I'll try to use BFS traversal which may reduce some duplicate procedures. Nice to discuss with you, Thanks. 2011/5/12 Michael Hunger lt;michael.hun...@neotechnology.comgt; gt; Hey JueiTing, gt; gt; I'm not sure if Hadoop is needed here. gt; What is the current performance characteristics for the shortest path you gt; are using? gt; gt; You could take a decent machine and just fire up, e.g. blocks of 10k node gt; pairs to a ThreadPoolExecutor with cores*2 threads. gt; Each of those tasks only has to return the sum of the path lengths (and you gt; know the block size) so you can sum the whole thing up onto a long and gt; divide it by the gt; number of pairs processed at the end? gt; gt; Perhaps instead of brute force looping one and a half times over all nodes gt; it is perhaps better to do a breath first traversal over all nodes gt; (regardless of relationships, just outgoing rels) and returning unique paths gt; from the traverser gt; and calculating the path lenghts from the start nodes to each _connected_ gt; node on the paths (start node -gt; path[0..n].nodes[1..n]) gt; gt; Cheers gt; gt; Michael gt; gt; Am 11.05.2011 um 21:12 schrieb Peter Neubauer: gt; gt; gt; Hi JueiTing, gt; gt; I think this is a typical case for a massive Map/Reduce job. I am gt; thinking gt; gt; of combining Hadoop works with replicas of the graph and then do the gt; gt; computation. I believe Paddy Fitzgerald has been working with these gt; gt; approaches and can give some feedback. gt; gt; gt; gt; Of course, given the size of the graph, that might prove a problem. OTOH, gt; if gt; gt; there are no modifications during the computation, you could run the gt; gt; calculations on read-only databases from the same store. Would that work? gt; gt; gt; gt; Cheers, gt; gt; gt; gt; /peter neubauer gt; gt; gt; gt; GTalk: neubauer.peter gt; gt; Skype peter.neubauer gt; gt; Phone +46 704 106975 gt; gt; LinkedIn http://www.linkedin.com/in/neubauer gt; gt; Twitter http://twitter.com/peterneubauer gt; gt; gt; gt; http://www.neo4j.org - Your high performance graph gt; database. gt; gt; http://startupbootcamp.org/ - Ouml;resund - Innovation happens HERE. gt; gt; http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. gt; gt; gt; gt; gt; gt; On Wed, May 11, 2011 at 12:47 AM, ccedil;iquest;ccedil;lsquo;aring;raquo;middot; lt;wshir...@gmail.comgt; wrote: gt; gt; gt; gt;gt; Hi, gt; gt;gt; gt; gt;gt; I'm trying to use Neo4j graph database to store a gt; gt;gt; large social network(more than 5,000,000 nodes) for academic research. gt; gt;gt; gt; gt;gt; I need to compute the separation degree(path length) between any two gt; nodes gt; gt;gt; in the graph then get the average degree of whole database. gt; gt;gt; The solution I'm using use now is archieved by executing API gt; gt;gt; GraphAlgoFactory.shortestPath, gt; gt;gt; but it means I need to execute (n*(n-1))/2 times to get all path length. gt; I gt; gt;gt; don't think it's a very good idea :( gt; gt;gt; gt; gt;gt; So, I'm wondering that is there any function which can assign one node gt; and gt; gt;gt; whole DB as Input, and return the gt; gt;gt; paths or only path lengths between the node and all other nodes
Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse
Hey JueiTing, I'm not sure if Hadoop is needed here. What is the current performance characteristics for the shortest path you are using? You could take a decent machine and just fire up, e.g. blocks of 10k node pairs to a ThreadPoolExecutor with cores*2 threads. Each of those tasks only has to return the sum of the path lengths (and you know the block size) so you can sum the whole thing up onto a long and divide it by the number of pairs processed at the end? Perhaps instead of brute force looping one and a half times over all nodes it is perhaps better to do a breath first traversal over all nodes (regardless of relationships, just outgoing rels) and returning unique paths from the traverser and calculating the path lenghts from the start nodes to each _connected_ node on the paths (start node - path[0..n].nodes[1..n]) Cheers Michael Am 11.05.2011 um 21:12 schrieb Peter Neubauer: Hi JueiTing, I think this is a typical case for a massive Map/Reduce job. I am thinking of combining Hadoop works with replicas of the graph and then do the computation. I believe Paddy Fitzgerald has been working with these approaches and can give some feedback. Of course, given the size of the graph, that might prove a problem. OTOH, if there are no modifications during the computation, you could run the calculations on read-only databases from the same store. Would that work? Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://startupbootcamp.org/- Öresund - Innovation happens HERE. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Wed, May 11, 2011 at 12:47 AM, 翁瑞廷 wshir...@gmail.com wrote: Hi, I'm trying to use Neo4j graph database to store a large social network(more than 5,000,000 nodes) for academic research. I need to compute the separation degree(path length) between any two nodes in the graph then get the average degree of whole database. The solution I'm using use now is archieved by executing API GraphAlgoFactory.shortestPath, but it means I need to execute (n*(n-1))/2 times to get all path length. I don't think it's a very good idea :( So, I'm wondering that is there any function which can assign one node and whole DB as Input, and return the paths or only path lengths between the node and all other nodes in the graph as Output? Just like Dijkstra algorithm did. I know there is a Dijkstra function in the API, but it only return the path between two nodes. It'll be really helpful for me if there is any function can return all the result in one computation(even the whole result array computing by Dijkstra) Or does anyone has better idea to deal with this issue? I tried to implement Dijkstra by myself, but it was impossible to declare such big 2D array in the program...(OutOfMemoryException) If there is any advice, I'll really appreciate it. Thanks. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user