Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi, Could you please let me know how to do this? (or) Any suggestion Regards, Rajesh On Mon, Mar 2, 2015 at 4:47 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a below edge list. How to find the parents path for every vertex? > > Example : > > Vertex 1 path : 2, 3,

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-15 Thread Takeshi Yamamuro
Hi, I tried quick and simple tests though, ISTM the vertices below were correctly cached. Could you give me the differences between my codes and yours? import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ object Prog { def processInt(d: Int) = d * 2 } val g = GraphLoader.edge

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-08 Thread Kyle Ellrott
I changed the curGraph = curGraph.outerJoinVertices(curMessages)( (vid, vertex, message) => vertex.process(message.getOrElse(List[Message]()), ti) ).cache() to curGraph = curGraph.outerJoinVertices(curMessages)( (vid, vertex, message) => (vertex, message.getOrElse(Lis

Re: GraphX pregel: getting the current iteration number

2015-02-03 Thread Daniil Osipov
I don't think its possible to access. What I've done before is send the current or next iteration index with the message, where the message is a case class. HTH Dan On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell wrote: > Hi Folks, > > I'm new to GraphX and Scala and my sendMsg function needs

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-03 Thread Jay Hutfles
I think this is a separate issue with how the EdgeRDDImpl partitions edges. If you can merge this change in and rebuild, it should work: https://github.com/apache/spark/pull/4136/files If you can't, I just called the Graph.partitonBy() method right after construction my graph but before perfo

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-02 Thread NicolasC
On 01/29/2015 08:31 PM, Ankur Dave wrote: Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur Hello, Thanks for the patch. I applied it on Pregel.scala (in Spark 1.2.0 sources) and rebuilt Spark. During execution, at the 25th iteration of Pregel, che

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-02-02 Thread Yifan LI
I think this broadcast cleaning(memory block remove?) timeout exception was caused by: 15/02/02 11:48:49 ERROR TaskSchedulerImpl: Lost executor 13 on small18-tap1.common.lip6.fr: remote Akka client disassociated 15/02/02 11:48:49 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent e

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-02-02 Thread Yifan LI
Thanks, Sonal. But it seems to be an error happened when “cleaning broadcast”? BTW, what is the timeout of “[30 seconds]”? can I increase it? Best, Yifan LI > On 02 Feb 2015, at 11:12, Sonal Goyal wrote: > > That may be the cause of your issue. Take a look at the tuning guide[1] and >

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-02-02 Thread Sonal Goyal
That may be the cause of your issue. Take a look at the tuning guide[1] and maybe also profile your application. See if you can reuse your objects. 1. http://spark.apache.org/docs/latest/tuning.html Best Regards, Sonal Founder, Nube Technologies

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI
Yes, I think so, esp. for a pregel application… have any suggestion? Best, Yifan LI > On 30 Jan 2015, at 22:25, Sonal Goyal wrote: > > Is your code hitting frequent garbage collection? > > Best Regards, > Sonal > Founder, Nube Technologies > >

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Sonal Goyal
Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI wrote: > > > > Hi, > > I am running my graphx application on Spark 1.2.0(11 nodes cluster

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Ankur Dave
Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles wrote: > Just curious, is this set to be merged at some point? - To unsubscribe, e-mail:

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Jay Hutfles
Just curious, is this set to be merged at some point? On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave wrote: > At 2015-01-22 02:06:37 -0800, NicolasC wrote: > > I try to execute a simple program that runs the ShortestPaths algorithm > > (org.apache.spark.graphx.lib.ShortestPaths) on a small grid gr

Re: [GraphX] Integration with TinkerPop3/Gremlin

2015-01-26 Thread Nicolas Colson
TinkerPop has become an Apache Incubator project and seems to have Spark in mind in their proposal . That's good news! I hope there will be nice collaborations between the communities. On Wed, Jan 7, 2015 at 11:31 AM, Nicolas Colson wrote: >

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread Ankur Dave
At 2015-01-22 02:06:37 -0800, NicolasC wrote: > I try to execute a simple program that runs the ShortestPaths algorithm > (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. > I use Spark 1.2.0 downloaded from spark.apache.org. > > This program runs more than 2 hours when the grid s

RE: GraphX vs GraphLab

2015-01-13 Thread Buttler, David
would be if the AMP Lab or Databricks maintained a set of benchmarks on the web that showed how much each successive version of Spark improved. Dave From: Madabhattula Rajesh Kumar [mailto:mrajaf...@gmail.com] Sent: Monday, January 12, 2015 9:24 PM To: Buttler, David Subject: Re: GraphX vs

Re: [Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Ankur Dave
At 2014-12-05 02:26:52 -0800, Yifan LI wrote: > I have a graph in where each vertex keep several messages to some faraway > neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. > k = 5). > > now, I propose to distribute these messages to their corresponding > destinatio

Re: GraphX Pregel halting condition

2014-12-04 Thread Ankur Dave
There's no built-in support for doing this, so the best option is to copy and modify Pregel to check the accumulator at the end of each iteration. This is robust and shouldn't be too hard, since the Pregel code is short and only uses public GraphX APIs. Ankur At 2014-12-03 09:37:01 -0800, Jay

Re: GraphX / PageRank with edge weights

2014-11-18 Thread Ankur Dave
At 2014-11-13 21:28:52 +, "Ommen, Jurgen" wrote: > I'm using GraphX and playing around with its PageRank algorithm. However, I > can't see from the documentation how to use edge weight when running PageRank. > Is this possible to consider edge weights and how would I do it? There's no built-

Re: GraphX: Get edges for a vertex

2014-11-13 Thread Takeshi Yamamuro
Hi, I think that there are two solutions; 1. Invalid edges send Iterator.empty messages in sendMsg of the Pregel API. These messages make no effect on corresponding vertices. 2. Use GraphOps.(collectNeighbors/collectNeighborIds), not the Pregel API so as to handle active edge lists by yourself.

Re: GraphX and Spark

2014-11-04 Thread Kamal Banga
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can. On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan wrote: > Hi, > Can Spark achieve whatever GraphX can? > Keeping aside the performance comparison between Spark and GraphX, if I > want to implement any graph algorithm and I

Re: GraphX StackOverflowError

2014-10-28 Thread Ankur Dave
At 2014-10-28 16:27:20 +0300, Zuhair Khayyat wrote: > I am using connected components function of GraphX (on Spark 1.0.2) on some > graph. However for some reason the fails with StackOverflowError. The graph > is not too big; it contains 1 vertices and 50 edges. > > [...] > 14/10/28 16:08:

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks! On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave wrote: > On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh > wrote: > >> a related question, what is the best way to update the values of existing >> vertices and edges? >> > > Many of the Graph methods deal with updating the existing values

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh wrote: > a related question, what is the best way to update the values of existing > vertices and edges? > Many of the Graph methods deal with updating the existing values in bulk, including mapVertices, mapEdges, mapTriplets, mapReduceTriplets, and out

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur. indexedrdd sounds super helpful! a related question, what is the best way to update the values of existing vertices and edges? On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave wrote: > On Tue, Oct 14, 2014 at 12:36 PM, ll wrote: > >> hi again. just want to check in again to see if a

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 12:36 PM, ll wrote: > hi again. just want to check in again to see if anyone could advise on how > to implement a "mutable, growing graph" with graphx? > > we're building a graph is growing over time. it adds more vertices and > edges every iteration of our algorithm. >

Re: graphx - mutable?

2014-10-14 Thread ll
hi again. just want to check in again to see if anyone could advise on how to implement a "mutable, growing graph" with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of our algorithm. it doesn't look like there is an obvious way to add a

Re: GraphX: Types for the Nodes and Edges

2014-10-08 Thread Oshi
Also, 3) can a union operation be done when the list of attributes are different for each vertex type? Would really appreciate a basic example to create graph with multiple node types :) Oshi wrote > 1. vP and vA are RDDs, how do I convert them to vertexRDDs and perform the > union? > 2. Should

Re: GraphX: Types for the Nodes and Edges

2014-10-07 Thread Oshi
Hi again, Thank you for your suggestion :) I've tried to implement this method but I'm stuck trying to union the payload before creating the graph. Below is a really simplified snippet of what have worked so far. //Reading the articles given in json format val articles = sqlContext.jsonFile(pa

Re: GraphX Java API Timeline

2014-10-02 Thread Adams, Jeremiah
Thank you Ankur. Is there a branch you are working out of in github? *Jeremiah Adams* On Thu, Oct 2, 2014 at 1:02 PM, Ankur Dave wrote: > Yes, I'm working on a Java API for Spark 1.2. Here's the issue to track > progress: https://issues.apache.org/jira/browse/SPARK-3665 > > Ankur

Re: GraphX Java API Timeline

2014-10-02 Thread Ankur Dave
Yes, I'm working on a Java API for Spark 1.2. Here's the issue to track progress: https://issues.apache.org/jira/browse/SPARK-3665 Ankur On Thu, Oct 2, 2014 at 11:10 AM, Adams, Jeremiah wrote: > Are there any plans to create a java api for GraphX? If so, what is the

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Excellent! Thanks Andy. I will give it a go. On Thu, Oct 2, 2014 at 12:42 AM, andy petrella [via Apache Spark User List] wrote: > I'll try my best ;-). > > 1/ you could create a abstract type for the types (1 on top of Vs, 1 other > on top of Es types) than use the subclasses as payload in your

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than use the subclasses as payload in your VertexRDD or in your Edge. Regarding storage and files, it doesn't really matter (unless you want to use the OOTB loading method, thus you

Re: GraphX : AssertionError

2014-09-22 Thread Keith Massey
The triangle count also failed for me when I ran it on more than one node. There is this assertion in TriangleCount.scala that causes the failure: // double count should be even (divisible by two) assert((dblCount & 1) == 0) That did not hold true when I ran this on multiple nodes,

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-08 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI wrote: > But how to assign the storage level to a new vertices RDD that mapped from > an existing vertices RDD, > e.g. > *val newVertexRDD = > graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, > a:Array[VertexId]) => (id, initialHashMap(a))}*

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-05 Thread Yifan LI
Thank you, Ankur! :) But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) => (id, initialHashMap(a))}* the new one will be combined with th

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-03 Thread Ankur Dave
At 2014-09-03 17:58:09 +0200, Yifan LI wrote: > val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = > numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) > > Error: java.lang.UnsupportedOperationException: Cannot change storage l

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-03 Thread Yifan LI
Hi Ankur, Thanks so much for your advice. But it failed when I tried to set the storage level in constructing a graph. val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) Erro

Re: Graphx: undirected graph support

2014-08-28 Thread FokkoDriesprong
A bit in analogy with a linked-list a double linked-list. It might introduce overhead in terms of memory usage, but you could use two directed edges to substitute the uni-directed edge. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-undirected-graph-

Re: GraphX usecases

2014-08-25 Thread Ankur Dave
At 2014-08-25 11:23:37 -0700, Sunita Arvind wrote: > Does this "We introduce GraphX, which combines the advantages of both > data-parallel and graph-parallel systems by efficiently expressing graph > computation within the Spark data-parallel framework. We leverage new ideas > in distributed graph

Re: GraphX usecases

2014-08-25 Thread Sunita Arvind
Thanks for the clarification Ankur Appreciate it. Regards Sunita On Monday, August 25, 2014, Ankur Dave wrote: > At 2014-08-25 11:23:37 -0700, Sunita Arvind > wrote: > > Does this "We introduce GraphX, which combines the advantages of both > > data-parallel and graph-parallel systems by effici

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hi Ankur, thank you for your response. I already looked at the sample code you sent. And I think the modification you are referring to is on the "tryMatch" function of the PartialMatch class. I noticed you have a case in there that checks for a pattern match, and I think that's the code I need to m

Re: GraphX question about graph traversal

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:34:50 -0700, Cesar Arevalo wrote: > I would like to get the type B vertices that are connected through type A > vertices where the edges have a score greater than 5. So, from the example > above I would like to get V1 and V4. It sounds like you're trying to find paths in the grap

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hey, thanks for your response. And I had seen the triplets, but I'm not quite sure how the triplets would get me that V1 is connected to V4. Maybe I need to spend more time understanding it, I guess. -Cesar On Wed, Aug 20, 2014 at 10:56 AM, glxc wrote: > I don't know if Pregel would be neces

Re: GraphX question about graph traversal

2014-08-20 Thread glxc
I don't know if Pregel would be necessary since it's not iterative You could filter the graph by looking at edge triplets, and testing if source =B, dest =A, and edge value > 5 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-question-about-graph-trav

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote: > I am testing our application(similar to "personalised page rank" using > Pregel, and note that each vertex property will need pretty much more space > to store after new iteration) [...] But when we ran it on larger graph(e.g. LiveJouranl), it

Re: GraphX Pagerank application

2014-08-15 Thread Ankur Dave
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers < alexander.rigg...@gmail.com> wrote: > To perform the page rank I have to create a graph object, adding the edges > by setting sourceID=id and distID=brand. In GraphLab there is function: g = > SGraph().add_edges(data, src_field='id', dst_field='b

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
OK, I think I've figured it out. It seems to be a bug which has been reported at: https://issues.apache.org/jira/browse/SPARK-2823 and https://github.com/apache/spark/pull/1763. As it says: "If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition n

Re: [GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Ankur Dave
At 2014-08-04 20:52:26 +0800, Bin wrote: > I wonder how spark parameters, e.g., number of paralellism, affect Pregel > performance? Specifically, sendmessage, mergemessage, and vertexprogram? > > I have tried label propagation on a 300,000 edges graph, and I found that no > paralellism is much f

Re: GraphX runs without Spark?

2014-08-03 Thread Deep Pradhan
We need to pass the URL only when we are using the interactive shell right? Now, I am not using the interactive shell, I am just doing ./bin/run-example.. when I am in the Spark directory. >>If not, Spark may be ignoring your single-node cluster and defaulting to local mode. What does this

Re: GraphX runs without Spark?

2014-08-03 Thread Ankur Dave
At 2014-08-03 13:14:52 +0530, Deep Pradhan wrote: > I have a single node cluster on which I have Spark running. I ran some > graphx codes on some data set. Now when I stop all the workers in the > cluster (sbin/stop-all.sh), the codes still run and gives the answers. Why > is it so? I mean does gr

Re: GraphX

2014-08-02 Thread Deep Pradhan
I am aware of how to run the LiveJournalPageRank However, I tried what Ankur had suggested, and I got the result. I have one question on that. Running either by bin/run-examples or by invoking the Analytics in GraphX, both of them finally call Analytics, right? So why not club all the codes in the

Re: GraphX

2014-08-02 Thread Deep Pradhan
I am aware of how to run the LiveJournalPageRank However, I tried what Ankur had suggested, and I got the result. I have one question on that. Running either by bin/run-examples or by invoking the Analytics in GraphX, both of them finally call Analytics, right? So why not club all the codes in the

Re: [GraphX] how to compute only a subset of vertices in the whole graph?

2014-08-02 Thread Ankur Dave
At 2014-08-02 19:04:22 +0200, Yifan LI wrote: > But I am thinking of if I can compute only some selected vertexes(hubs), not > to do "update" on every vertex… > > is it possible to do this using Pregel API? The Pregel API already only runs vprog on vertices that received messages in the previou

Re: GraphX

2014-08-02 Thread Ankur Dave
At 2014-08-02 21:29:33 +0530, Deep Pradhan wrote: > How should I run graphx codes? At the moment it's a little more complicated to run the GraphX algorithms than the Spark examples due to SPARK-1986 [1]. There is a driver program in org.apache.spark.graphx.lib.Analytics which you can invoke usi

Re: GraphX

2014-08-02 Thread Yifan LI
Try this: ./bin/run-example graphx.LiveJournalPageRank <…> On Aug 2, 2014, at 5:55 PM, Deep Pradhan wrote: > Hi, > I am running Spark in a single node cluster. I am able to run the codes in > Spark like SparkPageRank.scala, SparkKMeans.scala by the following command, > bin/run-examples org.ap

Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Ankur Dave
At 2014-08-01 11:23:49 +0800, Bin wrote: > I am wondering what is the best way to construct a graph? > > Say I have some attributes for each user, and specific weight for each user > pair. The way I am currently doing is first read user information and edge > triple into two arrays, then use sc.

Re: [GraphX] The best way to construct a graph

2014-07-31 Thread shijiaxin
I think you can try GraphLoader.edgeListFile, and then use join to associate the attributes with each vertex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-The-best-way-to-construct-a-graph-tp11122p11127.html Sent from the Apache Spark User List mail

Re: GraphX Pragel implementation

2014-07-31 Thread Ankur Dave
On Wed, Jul 30, 2014 at 04:55 PM, Arun Kumar wrote: > For my implementation to work the vprog function which is responsible for > handling in coming messages and the sendMsg function should be aware of > which super step they are in. > Is it possible to pass super step information in this methods

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
On Wed, Jul 30, 2014 at 11:32 PM, Jeffrey Picard wrote: > That worked! The entire thing ran in about an hour and a half, thanks! Great! > Is there by chance an easy way to build spark apps using the master branch > build of spark? I’ve been having to use the spark-shell. The easiest way is pro

Re: GraphX Connected Components

2014-07-30 Thread Jeffrey Picard
On Jul 30, 2014, at 4:39 PM, Ankur Dave wrote: > Jeffrey Picard writes: >> I tried unpersisting the edges and vertices of the graph by hand, then >> persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see >> the same behavior in connected components however, and the same th

Re: Graphx : Perfomance comparison over cluster

2014-07-30 Thread Ankur Dave
ShreyanshB writes: >> The version with in-memory shuffle is here: >> https://github.com/amplab/graphx2/commits/vldb. > > It'd be great if you can tell me how to configure and invoke this spark > version. Sorry for the delay on this. Assuming you're planning to launch an EC2 cluster, here's how t

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
Jeffrey Picard writes: > I tried unpersisting the edges and vertices of the graph by hand, then > persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see > the same behavior in connected components however, and the same thing you > described in the storage page. Unfortunately

Re: GraphX Connected Components

2014-07-30 Thread Jeffrey Picard
On Jul 30, 2014, at 5:18 AM, Ankur Dave wrote: > Jeffrey Picard writes: >> As the program runs I’m seeing each iteration take longer and longer to >> complete, this seems counter intuitive to me, especially since I am seeing >> the shuffle read/write amounts decrease with each iteration. I wo

Re: GraphX Pragel implementation

2014-07-30 Thread Arun Kumar
Hello Ankur, For my implementation to work the vprog function which is responsible for handling in coming messages and the sendMsg function should be aware of which super step they are in. Is it possible to pass super step information in this methods? Can u through some light on how to approach t

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
Jeffrey Picard writes: > As the program runs I’m seeing each iteration take longer and longer to > complete, this seems counter intuitive to me, especially since I am seeing > the shuffle read/write amounts decrease with each iteration. I would think > that as more and more vertices converged t

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
👍thx! Le 29 juil. 2014 22:09, "Ankur Dave" a écrit : > andy petrella writes: > > Oh I was almost sure that lookup was optimized using the partition info > > It does use the partitioner to run only one task, but within that task it > has to scan the entire partition: > > https://github.com/apache

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
andy petrella writes: > Oh I was almost sure that lookup was optimized using the partition info It does use the partitioner to run only one task, but within that task it has to scan the entire partition: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDD

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
Oh I was almost sure that lookup was optimized using the partition info Le 29 juil. 2014 21:25, "Ankur Dave" a écrit : > Yifan LI writes: > > Maybe you could get the vertex, for instance, which id is 80, by using: > > > > graph.vertices.filter{case(id, _) => id==80}.collect > > > > but I am not

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
Yifan LI writes: > Maybe you could get the vertex, for instance, which id is 80, by using: > > graph.vertices.filter{case(id, _) => id==80}.collect > > but I am not sure this is the exactly efficient way.(it will scan the whole > table? if it can not get benefit from index of VertexRDD table) Un

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
'lookup' on RDD (pair) maybe? Le 29 juil. 2014 12:04, "Yifan LI" a écrit : > Hi Bin, > > Maybe you could get the vertex, for instance, which id is 80, by using: > > *graph.vertices.filter{case(id, _) => id==80}.collect* > > but I am not sure this is the exactly efficient way.(it will scan the > w

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Yifan LI
Hi Bin, Maybe you could get the vertex, for instance, which id is 80, by using: graph.vertices.filter{case(id, _) => id==80}.collect but I am not sure this is the exactly efficient way.(it will scan the whole table? if it can not get benefit from index of VertexRDD table) @Ankur, is there any

Re: graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
never mind I think its just the GC taking its time while I got many gigabytes of unused cached rdds that I cannot get rid of easily On Jul 26, 2014 4:44 PM, "Koert Kuipers" wrote: > i have graphx queries running inside a service where i collect the results > to the driver and do not hold any refe

Re: GraphX Pragel implementation

2014-07-25 Thread Arun Kumar
Hi Thanks for the quick response.I am new to scala and some help will be required Regards -Arun On Fri, Jul 25, 2014 at 10:37 AM, Ankur Dave wrote: > On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar wrote: > >> While using pregel API for Iterations how to figure out which super step >> the itera

Re: GraphX Pragel implementation

2014-07-24 Thread Ankur Dave
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar wrote: > While using pregel API for Iterations how to figure out which super step > the iteration currently in. The Pregel API doesn't currently expose this, but it's very straightforward to modify Pregel.scala

Re: GraphX Pragel implementation

2014-07-24 Thread Arun Kumar
Hi While using pregel API for Iterations how to figure out which super step the iteration currently in. Regards Arun On Thu, Jul 17, 2014 at 4:24 PM, Arun Kumar wrote: > Hi > > > > I am trying to implement belief propagation algorithm in GraphX using the > pragel API. > > *def* pregel[A] > >

Re: Graphx : Perfomance comparison over cluster

2014-07-23 Thread ShreyanshB
Thanks Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're interested in b

Re: Graphx : Perfomance comparison over cluster

2014-07-20 Thread Ankur Dave
On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB wrote: > > Does the suggested version with in-memory shuffle affects performance too > much? We've observed a 2-3x speedup from it, at least on larger graphs like twitter-2010 and uk-2007-05

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread ShreyanshB
Thanks a lot Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're intereste

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread Ankur Dave
Thanks for your interest. I should point out that the numbers in the arXiv paper are from GraphX running on top of a custom version of Spark with an experimental in-memory shuffle prototype. As a result, if you benchmark GraphX at the current master, it's expected that it will be 2-3x slower than G

Re: GraphX Pragel implementation

2014-07-18 Thread Arun Kumar
Thanks On Fri, Jul 18, 2014 at 12:22 AM, Ankur Dave wrote: > If your sendMsg function needs to know the incoming messages as well as > the vertex value, you could define VD to be a tuple of the vertex value and > the last received message. The vprog function would then store the incoming > mess

Re: GraphX Pragel implementation

2014-07-17 Thread Ankur Dave
If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store the incoming messages into the tuple, allowing sendMsg to access them. For example, if

Re: Graphx traversal and merge interesting edges

2014-07-14 Thread HHB
Hi Ankur, FYI - in a naive attempt to enhance your solution, managed to create MergePatternPath. I think it works in expected way (atleast for the traversing problem in last email). I modified your code a bit. Also instead of EdgePattern I used List of Functions that match the whole edge trip

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Perfect! Thanks Ankur. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9488.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
Spark just uses opens up inter-slave TCP connections for message passing during shuffles (I think the relevant code is in ConnectionManager). Since TCP automatically determines the optimal sending rate, Spark doesn't need any configu

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Great! Thanks a lot. Hate to say this but I promise this is last quickie I looked at the configurations but I didn't find any parameter to tune for network bandwidth i.e. Is there anyway to tell graphx (spark) that I'm using 1G network or 10G network or infinite band? Does it figure out on its ow

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
I don't think it should affect performance very much, because GraphX doesn't serialize ShippableVertexPartition in the "fast path" of mapReduceTriplets execution (instead it calls ShippableVertexPartition.shipVertexAttributes and serializes the result). I think it should only get serialized for spe

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Thanks a lot Ankur, I'll follow that. A last quick Does that error affect performance? ~Shreyansh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9462.html Sent from the Apache Spark User List

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB wrote: > > -- Is it a correct way to load file to get best performance? Yes, edgeListFile should be efficient at loading the edges. -- What should be the partition size? =computing node or =cores? In general it should be a multiple of the number of

Re: GraphX: how to specify partition strategy?

2014-07-10 Thread Ankur Dave
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI wrote: > > - how to "build the latest version of Spark from the master branch, which > contains a fix"? Instead of downloading a prebuilt Spark release from http://spark.apache.org/downloads.html, follow the instructions under "Development Version" on th

Re: Graphx traversal and merge interesting edges

2014-07-08 Thread HHB
Hi Ankur, I was trying out the PatterMatcher it works for smaller path, but I see that for the longer ones it continues to run forever... Here's what I am trying: https://gist.github.com/hihellobolke/dd2dc0fcebba485975d1 (The example of 3 share traders transacting in appl shares) The first e

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is carefully

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Ankur Dave
Well, the alternative is to do a deep equality check on the index arrays, which would be somewhat expensive since these are pretty large arrays (one element per vertex in the graph). But, in case the reference equality check fails, it actually might be a good idea to do the deep check before resort

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave wrote: > When joining two VertexRDDs with identical indexes, GraphX can use a fast > code path (a zip join without any hash lookups). However, the check for > identical inde

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Ankur Dave
When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check for identical indexes is performed using reference equality. Without caching, two copies of the index are created. Although the two indexes are structurally

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Koert Kuipers
thanks for replying. why is joining two vertexrdds without caching slow? what is recomputed unnecessarily? i am not sure what is different here from joining 2 regular RDDs (where nobody seems to recommend to cache before joining i think...) On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave wrote: > O

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread HHB
Thanks Ankur, Cannot thank you enough for this!!! I am reading your example still digesting & grokking it though :-) I was breaking my head over this for past few hours. In my last futile attempts over past few hours. I was looking at Pregel... E.g if that could be used to see at what step

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread Ankur Dave
Interesting problem! My understanding is that you want to (1) find paths matching a particular pattern, and (2) add edges between the start and end vertices of the matched paths. For (1), I implemented a pattern matcher for GraphX

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
Oh, I just read your message more carefully and noticed that you're joining a regular RDD with a VertexRDD. In that case I'm not sure why the warning is occurring, but it might be worth caching both operands (graph.vertices and the regular RDD) just to be sure. Ankur

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
A common reason for the "Joining ... is slow" message is that you're joining VertexRDDs without having cached them first. This will cause Spark to recompute unnecessarily, and as a side effect, the same index will get created twice and GraphX won't be able to do an efficient zip join. For example,

<    1   2   3   >