Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse

2011-05-12 Thread Mattias Persson
Yes, you can probably do this thing in one traversal. Shortest path will
give you the shortest path(s) between two given nodes, but are interested in
any path, right? And you can find paths to several different end nodes in
one traversal. Just specify an Evaluator which knows about that, or let
loose a breadth-first traversal and filter your paths afterwards.

2011/5/11 Michael Hunger michael.hun...@neotechnology.com

 Hey JueiTing,

 I'm not sure if Hadoop is needed here.
 What is the current performance characteristics for the shortest path you
 are using?

 You could take a decent machine and just fire up, e.g. blocks of 10k node
 pairs to a ThreadPoolExecutor with cores*2 threads.
 Each of those tasks only has to return the sum of the path lengths (and you
 know the block size) so you can sum the whole thing up onto a long and
 divide it by the
 number of pairs processed at the end?

 Perhaps instead of brute force looping one and a half times over all nodes
 it is perhaps better to do a breath first traversal over all nodes
 (regardless of relationships, just outgoing rels) and returning unique paths
 from the traverser
 and calculating the path lenghts from the start nodes to each _connected_
 node on the paths (start node - path[0..n].nodes[1..n])

 Cheers

 Michael

 Am 11.05.2011 um 21:12 schrieb Peter Neubauer:

  Hi JueiTing,
  I think this is a typical case for a massive Map/Reduce job. I am
 thinking
  of combining Hadoop works with replicas of the graph and then do the
  computation. I believe Paddy Fitzgerald has been working with these
  approaches and can give some feedback.
 
  Of course, given the size of the graph, that might prove a problem. OTOH,
 if
  there are no modifications during the computation, you could run the
  calculations on read-only databases from the same store. Would that work?
 
  Cheers,
 
  /peter neubauer
 
  GTalk:  neubauer.peter
  Skype   peter.neubauer
  Phone   +46 704 106975
  LinkedIn   http://www.linkedin.com/in/neubauer
  Twitter  http://twitter.com/peterneubauer
 
  http://www.neo4j.org   - Your high performance graph
 database.
  http://startupbootcamp.org/- Öresund - Innovation happens HERE.
  http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
 
 
  On Wed, May 11, 2011 at 12:47 AM, 翁瑞廷 wshir...@gmail.com wrote:
 
  Hi,
 
  I'm trying to use Neo4j graph database to store a
  large social network(more than 5,000,000 nodes) for academic research.
 
  I need to compute the separation degree(path length) between any two
 nodes
  in the graph then get the average degree of whole database.
  The solution I'm using use now is archieved by executing API
  GraphAlgoFactory.shortestPath,
  but it means I need to execute (n*(n-1))/2 times to get all path length.
 I
  don't think it's a very good idea :(
 
  So, I'm wondering that is there any function which can assign one node
 and
  whole DB as Input, and return the
  paths or only path lengths between the node and all other nodes in the
  graph
  as Output? Just like Dijkstra algorithm did.
  I know there is a Dijkstra function in the API, but it only return the
 path
  between two nodes.
  It'll be really helpful for me if there is any function can return all
 the
  result in one computation(even the whole result array computing by
  Dijkstra)
 
  Or does anyone has better idea to deal with this issue?
  I tried to implement Dijkstra by myself, but it was impossible to
 declare
  such big 2D array in the program...(OutOfMemoryException)
 
  If there is any advice, I'll really appreciate it.
 
  Thanks.
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse

2011-05-12 Thread 翁瑞廷
Thanks for all your response,

Here is the size of the grapth db:

NodesSize
-
100,000  97MB
200,000  182MB
300,000  267MB
...
5,000,000   expect 5GB

I've tried to use 5 virtual machines, each one has 2 cores and 1G memory,
Running 2 threads on each VM.

Nodes   Time
--
50,000  20mins
150,00040hrs

Obviously, when the amount of node increases, the spending time of executing
GraphAlgoFactory.shortestPath increases heavily and non-linearly.
It's hard to estimate how many machines I need if I want to deal with
5,000,000 nodes or even 10,000,000,
I think Map/Reduce will be one solution for me, and I'll try to use BFS
traversal which may reduce some duplicate procedures.


Nice to discuss with you,

Thanks.

2011/5/12 Michael Hunger michael.hun...@neotechnology.com

 Hey JueiTing,

 I'm not sure if Hadoop is needed here.
 What is the current performance characteristics for the shortest path you
 are using?

 You could take a decent machine and just fire up, e.g. blocks of 10k node
 pairs to a ThreadPoolExecutor with cores*2 threads.
 Each of those tasks only has to return the sum of the path lengths (and you
 know the block size) so you can sum the whole thing up onto a long and
 divide it by the
 number of pairs processed at the end?

 Perhaps instead of brute force looping one and a half times over all nodes
 it is perhaps better to do a breath first traversal over all nodes
 (regardless of relationships, just outgoing rels) and returning unique paths
 from the traverser
 and calculating the path lenghts from the start nodes to each _connected_
 node on the paths (start node - path[0..n].nodes[1..n])

 Cheers

 Michael

 Am 11.05.2011 um 21:12 schrieb Peter Neubauer:

  Hi JueiTing,
  I think this is a typical case for a massive Map/Reduce job. I am
 thinking
  of combining Hadoop works with replicas of the graph and then do the
  computation. I believe Paddy Fitzgerald has been working with these
  approaches and can give some feedback.
 
  Of course, given the size of the graph, that might prove a problem. OTOH,
 if
  there are no modifications during the computation, you could run the
  calculations on read-only databases from the same store. Would that work?
 
  Cheers,
 
  /peter neubauer
 
  GTalk:  neubauer.peter
  Skype   peter.neubauer
  Phone   +46 704 106975
  LinkedIn   http://www.linkedin.com/in/neubauer
  Twitter  http://twitter.com/peterneubauer
 
  http://www.neo4j.org   - Your high performance graph
 database.
  http://startupbootcamp.org/- Öresund - Innovation happens HERE.
  http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
 
 
  On Wed, May 11, 2011 at 12:47 AM, 翁瑞廷 wshir...@gmail.com wrote:
 
  Hi,
 
  I'm trying to use Neo4j graph database to store a
  large social network(more than 5,000,000 nodes) for academic research.
 
  I need to compute the separation degree(path length) between any two
 nodes
  in the graph then get the average degree of whole database.
  The solution I'm using use now is archieved by executing API
  GraphAlgoFactory.shortestPath,
  but it means I need to execute (n*(n-1))/2 times to get all path length.
 I
  don't think it's a very good idea :(
 
  So, I'm wondering that is there any function which can assign one node
 and
  whole DB as Input, and return the
  paths or only path lengths between the node and all other nodes in the
  graph
  as Output? Just like Dijkstra algorithm did.
  I know there is a Dijkstra function in the API, but it only return the
 path
  between two nodes.
  It'll be really helpful for me if there is any function can return all
 the
  result in one computation(even the whole result array computing by
  Dijkstra)
 
  Or does anyone has better idea to deal with this issue?
  I tried to implement Dijkstra by myself, but it was impossible to
 declare
  such big 2D array in the program...(OutOfMemoryException)
 
  If there is any advice, I'll really appreciate it.
 
  Thanks.
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse

2011-05-12 Thread jacopo . farina
Hi,

I had the same problem and solved it by assigning a distance label to any
node.

The procedure is:

1-take the starting node N and add it to a set A, define the set B

2-set the value d=1

3-for any node M in A:

 3.1 set the label distance of M to d

3.2 for any node X which is connected to M and doesn't have the label
distance add X to the set B

4-increase d of 1

5-empty A and put all the entries of B in it, then empty B

 6-go to step 3 until A is empty


I have an implementation of it, but it's a little bit old (I created it for
a Neo4j 0.1 graph)

- Original Message 

 Da: Neo4j user discussions lt;user@lists.neo4j.orggt;

 To: Neo4j user discussions lt;user@lists.neo4j.orggt;

 Oggetto: Re: [Neo4j] finding all shortest paths between one node and all
other nodes in a large scale databse

 Data: 12/05/11 09:30

 

  Thanks for all your response,

 

 Here is the size of the grapth db:

 

 NodesSize

 -

 100,000  97MB

 200,000  182MB

 300,000  267MB

 ...

 5,000,000   expect 5GB

 

 I've tried to use 5 virtual machines, each one has 2 cores and 1G memory,

 Running 2 threads on each VM.

 

 Nodes   Time

 --

 50,000  gt;20mins

 150,000gt;40hrs

 

 Obviously, when the amount of node increases, the spending time of
executing

 GraphAlgoFactory.shortestPath increases heavily and non-linearly.

 It's hard to estimate how many machines I need if I want to deal with

 5,000,000 nodes or even 10,000,000,

 I think Map/Reduce will be one solution for me, and I'll try to use BFS

 traversal which may reduce some duplicate procedures.

 

 

 Nice to discuss with you,

 

 Thanks.

 

 2011/5/12 Michael Hunger lt;michael.hun...@neotechnology.comgt;

 

 gt; Hey JueiTing,

 gt;

 gt; I'm not sure if Hadoop is needed here.

 gt; What is the current performance characteristics for the shortest path
you

 gt; are using?

 gt;

 gt; You could take a decent machine and just fire up, e.g. blocks of 10k
node

 gt; pairs to a ThreadPoolExecutor with cores*2 threads.

 gt; Each of those tasks only has to return the sum of the path lengths
(and you

 gt; know the block size) so you can sum the whole thing up onto a long and

 gt; divide it by the

 gt; number of pairs processed at the end?

 gt;

 gt; Perhaps instead of brute force looping one and a half times over all
nodes

 gt; it is perhaps better to do a breath first traversal over all nodes

 gt; (regardless of relationships, just outgoing rels) and returning unique
paths

 gt; from the traverser

 gt; and calculating the path lenghts from the start nodes to each
_connected_

 gt; node on the paths (start node -gt; path[0..n].nodes[1..n])

 gt;

 gt; Cheers

 gt;

 gt; Michael

 gt;

 gt; Am 11.05.2011 um 21:12 schrieb Peter Neubauer:

 gt;

 gt; gt; Hi JueiTing,

 gt; gt; I think this is a typical case for a massive Map/Reduce job. I am

 gt; thinking

 gt; gt; of combining Hadoop works with replicas of the graph and then do
the

 gt; gt; computation. I believe Paddy Fitzgerald has been working with
these

 gt; gt; approaches and can give some feedback.

 gt; gt;

 gt; gt; Of course, given the size of the graph, that might prove a
problem. OTOH,

 gt; if

 gt; gt; there are no modifications during the computation, you could run
the

 gt; gt; calculations on read-only databases from the same store. Would
that work?

 gt; gt;

 gt; gt; Cheers,

 gt; gt;

 gt; gt; /peter neubauer

 gt; gt;

 gt; gt; GTalk:  neubauer.peter

 gt; gt; Skype   peter.neubauer

 gt; gt; Phone   +46 704 106975

 gt; gt; LinkedIn   http://www.linkedin.com/in/neubauer

 gt; gt; Twitter  http://twitter.com/peterneubauer

 gt; gt;

 gt; gt; http://www.neo4j.org - Your high performance graph

 gt; database.

 gt; gt; http://startupbootcamp.org/ - Ouml;resund - Innovation happens
HERE.

 gt; gt; http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing
party.

 gt; gt;

 gt; gt;

 gt; gt; On Wed, May 11, 2011 at 12:47 AM,
ccedil;iquest;ccedil;lsquo;žaring;raquo;middot;
lt;wshir...@gmail.comgt; wrote:

 gt; gt;

 gt; gt;gt; Hi,

 gt; gt;gt;

 gt; gt;gt; I'm trying to use Neo4j graph database to store a

 gt; gt;gt; large social network(more than 5,000,000 nodes) for academic
research.

 gt; gt;gt;

 gt; gt;gt; I need to compute the separation degree(path length) between
any two

 gt; nodes

 gt; gt;gt; in the graph then get the average degree of whole database.

 gt; gt;gt; The solution I'm using use now is archieved by executing API

 gt; gt;gt; GraphAlgoFactory.shortestPath,

 gt; gt;gt; but it means I need to execute (n*(n-1))/2 times to get all
path length.

 gt; I

 gt; gt;gt; don't think it's a very good idea :(

 gt; gt;gt;

 gt; gt;gt; So, I'm wondering that is there any function which can assign
one node

 gt; and

 gt; gt;gt; whole DB as Input, and return the

 gt; gt;gt; paths or only path lengths between the node and all other
nodes

Re: [Neo4j] finding all shortest paths between one node and all other nodes in a large scale databse

2011-05-11 Thread Michael Hunger
Hey JueiTing,

I'm not sure if Hadoop is needed here.
What is the current performance characteristics for the shortest path you are 
using?

You could take a decent machine and just fire up, e.g. blocks of 10k node pairs 
to a ThreadPoolExecutor with cores*2 threads.
Each of those tasks only has to return the sum of the path lengths (and you 
know the block size) so you can sum the whole thing up onto a long and divide 
it by the
number of pairs processed at the end? 

Perhaps instead of brute force looping one and a half times over all nodes it 
is perhaps better to do a breath first traversal over all nodes (regardless of 
relationships, just outgoing rels) and returning unique paths from the 
traverser 
and calculating the path lenghts from the start nodes to each _connected_ node 
on the paths (start node - path[0..n].nodes[1..n])

Cheers

Michael

Am 11.05.2011 um 21:12 schrieb Peter Neubauer:

 Hi JueiTing,
 I think this is a typical case for a massive Map/Reduce job. I am thinking
 of combining Hadoop works with replicas of the graph and then do the
 computation. I believe Paddy Fitzgerald has been working with these
 approaches and can give some feedback.
 
 Of course, given the size of the graph, that might prove a problem. OTOH, if
 there are no modifications during the computation, you could run the
 calculations on read-only databases from the same store. Would that work?
 
 Cheers,
 
 /peter neubauer
 
 GTalk:  neubauer.peter
 Skype   peter.neubauer
 Phone   +46 704 106975
 LinkedIn   http://www.linkedin.com/in/neubauer
 Twitter  http://twitter.com/peterneubauer
 
 http://www.neo4j.org   - Your high performance graph database.
 http://startupbootcamp.org/- Öresund - Innovation happens HERE.
 http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
 
 
 On Wed, May 11, 2011 at 12:47 AM, 翁瑞廷 wshir...@gmail.com wrote:
 
 Hi,
 
 I'm trying to use Neo4j graph database to store a
 large social network(more than 5,000,000 nodes) for academic research.
 
 I need to compute the separation degree(path length) between any two nodes
 in the graph then get the average degree of whole database.
 The solution I'm using use now is archieved by executing API
 GraphAlgoFactory.shortestPath,
 but it means I need to execute (n*(n-1))/2 times to get all path length. I
 don't think it's a very good idea :(
 
 So, I'm wondering that is there any function which can assign one node and
 whole DB as Input, and return the
 paths or only path lengths between the node and all other nodes in the
 graph
 as Output? Just like Dijkstra algorithm did.
 I know there is a Dijkstra function in the API, but it only return the path
 between two nodes.
 It'll be really helpful for me if there is any function can return all the
 result in one computation(even the whole result array computing by
 Dijkstra)
 
 Or does anyone has better idea to deal with this issue?
 I tried to implement Dijkstra by myself, but it was impossible to declare
 such big 2D array in the program...(OutOfMemoryException)
 
 If there is any advice, I'll really appreciate it.
 
 Thanks.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user