Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
Hi,

I must admit I don't know much about this Fruchterman-Reingold (call
it FR) visualization using GraphX and Kubernetes..But you are
suggesting this slowdown issue starts after the second iteration, and
caching/persisting the graph after each iteration does not help. FR
involves many computations between vertex pairs. In MapReduce (or
shuffle) steps, Data might be shuffled across the network, impacting
performance for large graphs. The usual steps to verify this is
through Spark UI in Stages, SQL and execute tabbs, You will see the
time taken for each step and the amount of read/write  etc. Also
repeatedly creating and destroying GraphX graphs in each iteration may
lead to garbage collection (GC) overhead.So you should consider r
profiling your application to identify bottlenecks and pinpoint which
part of the code is causing the slowdown.  As I mentioned Spark offers
profiling tools like Spark UI or third-party libraries.for this
purpose.

HTH


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".



On Sun, 17 Mar 2024 at 18:45, Marek Berith  wrote:
>
> Dear community,
> for my diploma thesis, we are implementing a distributed version of
> Fruchterman-Reingold visualization algorithm, using GraphX and Kubernetes. Our
> solution is a backend that continously computes new positions of vertices in a
> graph and sends them via RabbitMQ to a consumer. Fruchterman-Reingold is an
> iterative algorithm, meaning that in each iteration repulsive and attractive
> forces between vertices are computed and then new positions of vertices based
> on those forces are computed. Graph vertices and edges are stored in a GraphX
> graph structure. Forces between vertices are computed using MapReduce(between
> each pair of vertices) and aggregateMessages(for vertices connected via
> edges). After an iteration of the algorithm, the recomputed positions from the
> RDD are serialized using collect and sent to the RabbitMQ queue.
>
> Here comes the issue. The first two iterations of the algorithm seem to be
> quick, but at the third iteration, the algorithm is very slow until it reaches
> a point at which it cannot finish an iteration in real time. It seems like
> caching of the graph may be an issue, because if we serialize the graph after
> each iteration in an array and create new graph from the array in the new
> iteration, we get a constant usage of memory and each iteration takes the same
> amount of time. We had already tried to cache/persist/checkpoint the graph
> after each iteration but it didn't help, so maybe we are doing something
> wrong. We do not think that serializing the graph into an array should be the
> solution for such a complex library like Apache Spark. I'm also not very
> confident how this fix will affect performance for large graphs or in parallel
> environment. We are attaching a short example of code that shows doing an
> iteration of the algorithm, input and output example.
>
> We would appreciate if you could help us fix this issue or give us any
> meaningful ideas, as we had tried everything that came to mind.
>
> We look forward to your reply.
> Thank you, Marek Berith
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[GraphX]: Prevent recomputation of DAG

2024-03-17 Thread Marek Berith

Dear community,
for my diploma thesis, we are implementing a distributed version of 
Fruchterman-Reingold visualization algorithm, using GraphX and Kubernetes. Our 
solution is a backend that continously computes new positions of vertices in a 
graph and sends them via RabbitMQ to a consumer. Fruchterman-Reingold is an 
iterative algorithm, meaning that in each iteration repulsive and attractive 
forces between vertices are computed and then new positions of vertices based 
on those forces are computed. Graph vertices and edges are stored in a GraphX 
graph structure. Forces between vertices are computed using MapReduce(between 
each pair of vertices) and aggregateMessages(for vertices connected via 
edges). After an iteration of the algorithm, the recomputed positions from the 
RDD are serialized using collect and sent to the RabbitMQ queue.


Here comes the issue. The first two iterations of the algorithm seem to be 
quick, but at the third iteration, the algorithm is very slow until it reaches 
a point at which it cannot finish an iteration in real time. It seems like 
caching of the graph may be an issue, because if we serialize the graph after 
each iteration in an array and create new graph from the array in the new 
iteration, we get a constant usage of memory and each iteration takes the same 
amount of time. We had already tried to cache/persist/checkpoint the graph 
after each iteration but it didn't help, so maybe we are doing something 
wrong. We do not think that serializing the graph into an array should be the 
solution for such a complex library like Apache Spark. I'm also not very 
confident how this fix will affect performance for large graphs or in parallel 
environment. We are attaching a short example of code that shows doing an 
iteration of the algorithm, input and output example.


We would appreciate if you could help us fix this issue or give us any 
meaningful ideas, as we had tried everything that came to mind.


We look forward to your reply.
Thank you, Marek Berith
 def iterate(
  sc: SparkContext,
  graph: graphx.Graph[GeneralVertex, EdgeProperty],
  metaGraph: graphx.Graph[GeneralVertex, EdgeProperty])
  : (graphx.Graph[GeneralVertex, EdgeProperty], graphx.Graph[GeneralVertex, 
EdgeProperty]) = {
val attractiveDisplacement: VertexRDD[(VertexId, Vector)] =
  this.calculateAttractiveForces(graph)
val repulsiveDisplacement: RDD[(VertexId, Vector)] = 
this.calculateRepulsiveForces(graph)
val metaVertexDisplacement: RDD[(VertexId, Vector)] =
  this.calculateMetaVertexForces(graph, metaGraph.vertices)
val metaEdgeDisplacement: RDD[(VertexId, Vector)] =
  this.calculateMetaEdgeForces(metaGraph)
val displacements: RDD[(VertexId, Vector)] = this.combineDisplacements(
  attractiveDisplacement,
  repulsiveDisplacement,
  metaVertexDisplacement,
  metaEdgeDisplacement)
val newVertices: RDD[(VertexId, GeneralVertex)] = 
this.displaceVertices(graph, displacements)
val newGraph = graphx.Graph(newVertices, graph.edges)
// persist or checkpoint or cache? or something else?
newGraph.persist()
metaGraph.persist()
(newGraph, metaGraph)
  }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

State of GraphX and GraphFrames

2023-04-23 Thread g
Hello,

I am currently doing my Master thesis on data provenance on Apache Spark and 
would like to extend the provenance capabilities to include GraphX/GraphFrames. 
I am curious what the current status of both GraphX and GraphFrames is. It 
seems that GraphX is no longer being updated (but still supported) as noted in 
the excellent High Performance Spark book by Rachel Warren & Holden Karau. As 
for GraphFrames, it also seems that it is in a similar situation seeing that 
the pace of commits to the repo <https://github.com/graphframes/graphframes> 
has also been quite low over the last years.

Could anyone enlighten me on what the current state of graph processing is in 
Apache Spark?

Kind regards,
Gilles Magalhaes

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread lee
That is, every pagerank value has no relationship to 1, right? As long as we 
focus on the size of each pagerank value in Graphx, we don't need to focus on 
the range, is that right?


| |
李杰
|
|
leedd1...@163.com
|
 Replied Message 
| From | Sean Owen |
| Date | 3/28/2023 22:33 |
| To | lee |
| Cc | user@spark.apache.org |
| Subject | Re: What is the range of the PageRank value of graphx |
From the docs:


 * Note that this is not the "normalized" PageRank and as a consequence pages 
that have no
 * inlinks will have a PageRank of alpha. In particular, the pageranks may have 
some values
 * greater than 1.



On Tue, Mar 28, 2023 at 9:11 AM lee  wrote:

When I calculate pagerank using HugeGraph, each pagerank value is less than 1, 
and the total of pageranks is 1. However, the PageRank value of graphx is often 
greater than 1, so what is the range of the PageRank value of graphx?




||
李杰
|
|
leedd1...@163.com
|

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen
>From the docs:

 * Note that this is not the "normalized" PageRank and as a consequence
pages that have no
 * inlinks will have a PageRank of alpha. In particular, the pageranks may
have some values
 * greater than 1.

On Tue, Mar 28, 2023 at 9:11 AM lee  wrote:

> When I calculate pagerank using HugeGraph, each pagerank value is less
> than 1, and the total of pageranks is 1. However, the PageRank value of
> graphx is often greater than 1, so what is the range of the PageRank value
> of graphx?
>
>
>
>
>
>
> 李杰
> leedd1...@163.com
>  
> <https://dashi.163.com/projects/signature-manager/detail/index.html?ftlId=1&name=%E6%9D%8E%E6%9D%B0&uid=leedd1912%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fsmc4215b668fdb6b5ca355a1c3319c4a0e.jpg&items=%5B%22leedd1912%40163.com%22%5D>
>
>


What is the range of the PageRank value of graphx

2023-03-28 Thread lee
When I calculate pagerank using HugeGraph, each pagerank value is less than 1, 
and the total of pageranks is 1. However, the PageRank value of graphx is often 
greater than 1, so what is the range of the PageRank value of graphx?




||
李杰
|
|
leedd1...@163.com
|

Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
Yes, MLlib <https://spark.apache.org/mllib/> is actively developed. You can
have a look at github and filter on closed and ML github and filter on
closed and ML
<https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aclosed+label%3AML>



fre. 25. mar. 2022 kl. 22:15 skrev Bitfox :

> BTW , is MLlib still in active development?
>
> Thanks
>
> On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:
>
>> GraphX is not active, though still there and does continue to build and
>> test with each Spark release. GraphFrames kind of superseded it, but is
>> also not super active FWIW.
>>
>> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez
>>  wrote:
>>
>>> Hello!
>>>
>>>
>>>
>>> My team and I are evaluating GraphX as a possible solution. Would
>>> someone be able to speak to the support of this Spark feature? Is there
>>> active development or is GraphX in maintenance mode (e.g. updated to ensure
>>> functionality with new Spark releases)?
>>>
>>>
>>>
>>> Thanks in advance for your help!
>>>
>>>
>>>
>>> --
>>>
>>> Jacob H. Marquez
>>>
>>> He/Him
>>>
>>> Data & Applied Scientist
>>>
>>> Microsoft Cloud Data Sciences
>>>
>>>
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: GraphX Support

2022-03-25 Thread Bitfox
BTW , is MLlib still in active development?

Thanks

On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:

> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
> wrote:
>
>> Hello!
>>
>>
>>
>> My team and I are evaluating GraphX as a possible solution. Would someone
>> be able to speak to the support of this Spark feature? Is there active
>> development or is GraphX in maintenance mode (e.g. updated to ensure
>> functionality with new Spark releases)?
>>
>>
>>
>> Thanks in advance for your help!
>>
>>
>>
>> --
>>
>> Jacob H. Marquez
>>
>> He/Him
>>
>> Data & Applied Scientist
>>
>> Microsoft Cloud Data Sciences
>>
>>
>>
>


Re: [EXTERNAL] Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
One alternative can be to use Spark and ArangoDB <https://www.arangodb.com>

Introducing the new ArangoDB Datasource for Apache Spark
<https://www.arangodb.com/2022/03/introducing-the-new-arangodb-datasource-for-apache-spark/>


ArongoDB is a open source graphs DB with a lot of good graphs utils and
documentation <https://www.arangodb.com/docs/stable/graphs.html>

tir. 22. mar. 2022 kl. 00:49 skrev Jacob Marquez
:

> Awesome, thank you!
>
>
>
> *From:* Sean Owen 
> *Sent:* Monday, March 21, 2022 4:11 PM
> *To:* Jacob Marquez 
> *Cc:* user@spark.apache.org
> *Subject:* [EXTERNAL] Re: GraphX Support
>
>
>
> You don't often get email from sro...@gmail.com. Learn why this is
> important <http://aka.ms/LearnAboutSenderIdentification>
>
> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
>
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez <
> jac...@microsoft.com.invalid> wrote:
>
> Hello!
>
>
>
> My team and I are evaluating GraphX as a possible solution. Would someone
> be able to speak to the support of this Spark feature? Is there active
> development or is GraphX in maintenance mode (e.g. updated to ensure
> functionality with new Spark releases)?
>
>
>
> Thanks in advance for your help!
>
>
>
> --
>
> Jacob H. Marquez
>
> He/Him
>
> Data & Applied Scientist
>
> Microsoft Cloud Data Sciences
>
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: GraphX Support

2022-03-22 Thread Enrico Minack
Right, GraphFrames is not very active and maintainers don't even have 
the capacity to make releases.


Enrico


Am 22.03.22 um 00:10 schrieb Sean Owen:
GraphX is not active, though still there and does continue to build 
and test with each Spark release. GraphFrames kind of superseded it, 
but is also not super active FWIW.


On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
 wrote:


Hello!

My team and I are evaluating GraphX as a possible solution. Would
someone be able to speak to the support of this Spark feature? Is
there active development or is GraphX in maintenance mode (e.g.
updated to ensure functionality with new Spark releases)?

Thanks in advance for your help!

--

Jacob H. Marquez

He/Him

Data & Applied Scientist

Microsoft Cloud Data Sciences



RE: [EXTERNAL] Re: GraphX Support

2022-03-21 Thread Jacob Marquez
Awesome, thank you!

From: Sean Owen 
Sent: Monday, March 21, 2022 4:11 PM
To: Jacob Marquez 
Cc: user@spark.apache.org
Subject: [EXTERNAL] Re: GraphX Support

You don't often get email from sro...@gmail.com<mailto:sro...@gmail.com>. Learn 
why this is important<http://aka.ms/LearnAboutSenderIdentification>
GraphX is not active, though still there and does continue to build and test 
with each Spark release. GraphFrames kind of superseded it, but is also not 
super active FWIW.

On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
mailto:jac...@microsoft.com.invalid>> wrote:
Hello!

My team and I are evaluating GraphX as a possible solution. Would someone be 
able to speak to the support of this Spark feature? Is there active development 
or is GraphX in maintenance mode (e.g. updated to ensure functionality with new 
Spark releases)?

Thanks in advance for your help!

--
Jacob H. Marquez
He/Him
Data & Applied Scientist
Microsoft Cloud Data Sciences



Re: GraphX Support

2022-03-21 Thread Sean Owen
GraphX is not active, though still there and does continue to build and
test with each Spark release. GraphFrames kind of superseded it, but is
also not super active FWIW.

On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
wrote:

> Hello!
>
>
>
> My team and I are evaluating GraphX as a possible solution. Would someone
> be able to speak to the support of this Spark feature? Is there active
> development or is GraphX in maintenance mode (e.g. updated to ensure
> functionality with new Spark releases)?
>
>
>
> Thanks in advance for your help!
>
>
>
> --
>
> Jacob H. Marquez
>
> He/Him
>
> Data & Applied Scientist
>
> Microsoft Cloud Data Sciences
>
>
>


GraphX Support

2022-03-21 Thread Jacob Marquez
Hello!

My team and I are evaluating GraphX as a possible solution. Would someone be 
able to speak to the support of this Spark feature? Is there active development 
or is GraphX in maintenance mode (e.g. updated to ensure functionality with new 
Spark releases)?

Thanks in advance for your help!

--
Jacob H. Marquez
He/Him
Data & Applied Scientist
Microsoft Cloud Data Sciences



GraphX Pregel: Access current iteration i

2021-11-08 Thread Jannik Rau
Hi all,

I'm using the Pregel.apply function in an iterative graph algorithm.

In the algorithm, the update of a vertex's value is treated differently in the first and last of the n total iterations.
Calling Pregel.apply with maxIterations = 1, then with maxIterations = n - 2 and again with maxIterations = 1 results in three times going through the initial procedure.
If one could access the current iteration i, one could distinguish in the vertex program, how to update a vertex's value.

I've searched for similar questions and came across this one:

https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/%3cca+hoc9nzp38slmpb8r1tpa_nb8ivfhcm8jhmm6vwon4o6x1...@mail.gmail.com%3E

The answer suggests to include the current iteration in every message, which adds a lot of memory when working with large graphs.

Do you plan to adjust the apply function, or maybe refactor "var i" as an object variable of the Pregel object?
Or don't you plan to do this and rather recommend me using a different Graphx utility, which is designed for such a scenario?

Thanks for any answer in advance!


Kind Regards,

 

Jannik

 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is there a good way for Spark GraphX to pull JanusGraph data?

2020-10-26 Thread Lucien
Hi all.
As the title,Is there any good plan? Or other suggestions, thanks for all
answers.

-- 

Best regards
Lucien


Connected components using GraphFrames is significantly slower than GraphX?

2020-02-16 Thread kant kodali
Hi All,

Trying to understand why connected components algorithms runs much slower
than the graphX equivalent?

Graphx code creates 16 stages.

GraphFrame graphFrame = GraphFrame.fromEdges(edges);
Dataset connectedComponents =
graphFrame.connectedComponents().setAlgorithm("graphx").run();

and the GraphFrames code below creates 55 stages.

GraphFrame graphFrame = GraphFrame.fromEdges(edges);
Dataset connectedComponents =
graphFrame.connectedComponents().run();

Any ideas on how to make GraphFrames faster? Also what is the latest
Graph Processing Library/Framework I should be using? I feel like
there isn't lot of work going on in either GraphFrames or GraphX so I
am just curious on what I should use for long term?

Thanks!


Re: GraphX performance feedback

2019-11-28 Thread mahzad kalantari
Ok thanks!

Le jeu. 28 nov. 2019 à 11:27, Phillip Henry  a
écrit :

> I saw a large improvement in my GraphX processing by:
>
> - using fewer partitions
> - using fewer executors but with much more memory.
>
> YMMV.
>
> Phillip
>
> On Mon, 25 Nov 2019, 19:14 mahzad kalantari, 
> wrote:
>
>> Thanks for your answer, my use case is friend recommandation for 200
>> million profils.
>>
>> Le lun. 25 nov. 2019 à 14:10, Jörn Franke  a
>> écrit :
>>
>>> I think it depends what you want do. Interactive big data graph
>>> analytics are probably better of in Janusgraph or similar.
>>> Batch processing (once-off) can be still fine in graphx - you have
>>> though to carefully design the process.
>>>
>>> Am 25.11.2019 um 20:04 schrieb mahzad kalantari <
>>> mahzad.kalant...@gmail.com>:
>>>
>>> 
>>> Hi all
>>>
>>> My question is about GraphX, I 'm looking for user feedbacks on the
>>> performance.
>>>
>>> I read this paper written by Facebook team that says Graphx has very
>>> poor performance.
>>>
>>> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>>>
>>>
>>> Has anyone already encountered performance problems with Graphx, and is
>>> it a good choice if I want to do large scale graph modelling?
>>>
>>>
>>> Thanks!
>>>
>>> Mahzad
>>>
>>>


Re: GraphX performance feedback

2019-11-28 Thread Phillip Henry
I saw a large improvement in my GraphX processing by:

- using fewer partitions
- using fewer executors but with much more memory.

YMMV.

Phillip

On Mon, 25 Nov 2019, 19:14 mahzad kalantari, 
wrote:

> Thanks for your answer, my use case is friend recommandation for 200
> million profils.
>
> Le lun. 25 nov. 2019 à 14:10, Jörn Franke  a écrit :
>
>> I think it depends what you want do. Interactive big data graph analytics
>> are probably better of in Janusgraph or similar.
>> Batch processing (once-off) can be still fine in graphx - you have though
>> to carefully design the process.
>>
>> Am 25.11.2019 um 20:04 schrieb mahzad kalantari <
>> mahzad.kalant...@gmail.com>:
>>
>> 
>> Hi all
>>
>> My question is about GraphX, I 'm looking for user feedbacks on the
>> performance.
>>
>> I read this paper written by Facebook team that says Graphx has very poor
>> performance.
>>
>> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>>
>>
>> Has anyone already encountered performance problems with Graphx, and is
>> it a good choice if I want to do large scale graph modelling?
>>
>>
>> Thanks!
>>
>> Mahzad
>>
>>


Re: GraphX performance feedback

2019-11-25 Thread mahzad kalantari
Thanks for your answer, my use case is friend recommandation for 200
million profils.

Le lun. 25 nov. 2019 à 14:10, Jörn Franke  a écrit :

> I think it depends what you want do. Interactive big data graph analytics
> are probably better of in Janusgraph or similar.
> Batch processing (once-off) can be still fine in graphx - you have though
> to carefully design the process.
>
> Am 25.11.2019 um 20:04 schrieb mahzad kalantari <
> mahzad.kalant...@gmail.com>:
>
> 
> Hi all
>
> My question is about GraphX, I 'm looking for user feedbacks on the
> performance.
>
> I read this paper written by Facebook team that says Graphx has very poor
> performance.
>
> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>
>
> Has anyone already encountered performance problems with Graphx, and is it
> a good choice if I want to do large scale graph modelling?
>
>
> Thanks!
>
> Mahzad
>
>


Re: GraphX performance feedback

2019-11-25 Thread Jörn Franke
I think it depends what you want do. Interactive big data graph analytics are 
probably better of in Janusgraph or similar. 
Batch processing (once-off) can be still fine in graphx - you have though to 
carefully design the process. 

> Am 25.11.2019 um 20:04 schrieb mahzad kalantari :
> 
> 
> Hi all
> 
> My question is about GraphX, I 'm looking for user feedbacks on the 
> performance.
> 
> I read this paper written by Facebook team that says Graphx has very poor 
> performance.
> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>   
> 
> Has anyone already encountered performance problems with Graphx, and is it a 
> good choice if I want to do large scale graph modelling?
> 
> 
> Thanks!
> 
> Mahzad 


GraphX performance feedback

2019-11-25 Thread mahzad kalantari
Hi all

My question is about GraphX, I 'm looking for user feedbacks on the
performance.

I read this paper written by Facebook team that says Graphx has very poor
performance.
https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/


Has anyone already encountered performance problems with Graphx, and is it
a good choice if I want to do large scale graph modelling?


Thanks!

Mahzad


Re: graphx vs graphframes

2019-10-17 Thread Nicolas Paris
Hi Alastair

Cypher support looks like promising and the dev list thread discussion
is interesting. 
thanks for your feedback. 

On Thu, Oct 17, 2019 at 09:19:28AM +0100, Alastair Green wrote:
> Hi Nicolas, 
> 
> I was following the current thread on the dev channel about Spark
> Graph, including Cypher support, 
> 
> http://apache-spark-developers-list.1001551.n3.nabble.com/
> Add-spark-dependency-on-on-org-opencypher-okapi-shade-okapi-td28118.html
> 
> and I remembered your post.
> 
> Actually, GraphX and GraphFrames are both not being developed actively, so far
> as I can tell. 
> 
> The only activity on GraphX in the last two years was a fix for Scala 2.13
> functionality: to quote the PR 
> 
> 
> ### Does this PR introduce any user-facing change?
> 
> No behavior change at all.
> 
> The only activity on GraphFrames since the addition of Pregel support in Scala
> back in December 2018, has been build/test improvements and recent builds
> against 2.4 and 3.0 snapshots. I’m not sure there was a lot of functional
> change before that either. 
> 
> The efforts to provide graph processing in Spark with the more full-featured
> Cypher query language that you can see in the proposed 3.0 changes discussed 
> in
> the dev list, and the related openCypher/morpheus project (which among many
> other things allows you to cast a Morpheus graph into a GraphX graph) and
> extends the proposed 3.0 changes in a compatible way, are active. 
> 
> Yrs, 
> 
> Alastair
> 
> 
> Alastair Green
> 
> Query Languages Standards and Research
> 
> 
> Neo4j UK Ltd
> 
> Union House
> 182-194 Union Street
> London, SE1 0LH
> 
> 
> +44 795 841 2107
> 
> 
> On Sun, Sep 22, 2019 at 21:17, Nicolas Paris  wrote:
> 
> hi all
> 
> graphframes was intended to replace graphx.
> 
> however the former looks not maintained anymore while the latter is
> still active.
> 
> any thought ?
> --
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: graphx vs graphframes

2019-10-17 Thread Alastair Green
Hi Nicolas,
I was following the current thread on the dev channel about Spark Graph, 
including Cypher support,
http://apache-spark-developers-list.1001551.n3.nabble.com/Add-spark-dependency-on-on-org-opencypher-okapi-shade-okapi-td28118.html
 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Add-spark-dependency-on-on-org-opencypher-okapi-shade-okapi-td28118.html]
and I remembered your post.
Actually, GraphX and GraphFrames are both not being developed actively, so far 
as I can tell.
The only activity on GraphX in the last two years was a fix for Scala 2.13 
functionality: to quote the PR
### Does this PR introduce any user-facing change?No behavior change at all.

The only activity on GraphFrames since the addition of Pregel support in Scala 
back in December 2018, has been build/test improvements and recent builds 
against 2.4 and 3.0 snapshots. I’m not sure there was a lot of functional 
change before that either.
The efforts to provide graph processing in Spark with the more full-featured 
Cypher query language that you can see in the proposed 3.0 changes discussed in 
the dev list, and the related openCypher/morpheus project (which among many 
other things allows you to cast a Morpheus graph into a GraphX graph) and 
extends the proposed 3.0 changes in a compatible way, are active.
Yrs,
Alastair
Alastair Green

Query Languages Standards and Research




Neo4j UK Ltd

Union House
182-194 Union Street
London, SE1 0LH




+44 795 841 2107


On Sun, Sep 22, 2019 at 21:17, Nicolas Paris  wrote:
hi all

graphframes was intended to replace graphx.

however the former looks not maintained anymore while the latter is
still active.

any thought ?
--
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

graphx vs graphframes

2019-09-22 Thread Nicolas Paris
hi all

graphframes was intended to replace graphx.

however the former looks not maintained anymore while the latter is
still active.

any thought ?
-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to Load a Graphx Graph from a parquet file?

2019-08-29 Thread Alexander Czech
 Hey all,
I want to load a parquet containing my edges into an Graph my code so far
looks like this:

val edgesDF = spark.read.parquet("/path/to/edges/parquet/")
val edgesRDD = edgesDF.rdd
val graph = Graph.fromEdgeTuples(edgesRDD, 1)

But simply this produces an error:

[error]  found   :
org.apache.spark.rdd.RDD[org.apache.spark.sql.Row][error]  required:
org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId,
org.apache.spark.graphx.VertexId)][error] (which expands to)
org.apache.spark.rdd.RDD[(Long, Long)][error] Error occurred in an
application involving default arguments.[error] val graph =
Graph.fromEdgeTuples(edgesRDD, 1)

I tried to declare the edgesRDD like the following code but this just
moves the error by doing this:
val edgesDF = spark.read.parquet("/path/to/edges/parquet/")val
edgesRDD : RDD[(Long,Long)] = edgesDF.rdd
val graph = Graph.fromEdgeTuples(edgesRDD, 1)
[error] 
/home/alex/ownCloud/JupyterNotebooks/Diss_scripte/Webgraph_analysis/pagerankscala/src/main/scala/pagerank.scala:17:44:
type mismatch;
[error]  found   : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error]  required: org.apache.spark.rdd.RDD[(Long, Long)]
[error] val edgesRDD : RDD[(Long,Long)] = edgesDF.rdd

So I guess I have to transform
org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] into
 org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId,
org.apache.spark.graphx.VertexId)](which expands to)
org.apache.spark.rdd.RDD[(Long, Long)]

how can I achieve this ?


GraphX parameters tuning

2019-05-16 Thread muaz-32
Hi everyone. 
I am doing my master thesis in the topic of Automatic parameter tuning of
graph processing frameworks. Now, we are aiming to optimize GraphX jobs. I
have an initial list of parameters which we would like to tune:
spark.memory.fraction
spark.executor.memory
spark.shuffle.compress
spark.default.parallelism
spark.shuffle.file.buffer
spark.io.compress.codec
spark.speculation
Plus some parameters related to java GC. 
I would like to hear your feedback: If anyone thinks that there's a missing
parameter which should be tuned or if there is a parameter in the list which
is not very important.

Best,
Muaz Twaty
EURA NOVA



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Request for a working example of using Pregel API in GraphX using Spark Scala

2019-05-05 Thread Basavaraj
Hello All

I am a beginner in Spark, trying to use GraphX for an iterative processing by 
connecting to Kafka Stream Processing

Looking for any git reference to real application example, in Scala 

Please revert with any reference to it, or if someone is trying to build, I 
could join them 

Regards
Basavaraj K N




smime.p7s
Description: S/MIME cryptographic signature


Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
If I understand correctly this would set the split size in the Hadoop
configuration when reading file. I can see that being useful when you want
to create more partitions than what the block size in HDFS might dictate.
Instead what I want to do is to create a single partition for each file
written by task (from say a previous job) i.e. data in part-0 forms
partition 1, part-1 forms partition 2 and so on and so forth.

- Bilal

On Tue, Apr 16, 2019, 6:00 AM Manu Zhang  wrote:

> You may try
> `sparkContext.hadoopConfiguration().set("mapred.max.split.size",
> "33554432")` to tune the partition size when reading from HDFS.
>
> Thanks,
> Manu Zhang
>
> On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:
>
>> Hi,
>>
>> I have implemented a custom partitioning algorithm to partition graphs in
>> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
>> files in the output folder with the number of files equal to the number of
>> Partitions.
>>
>> However, reading back the edges creates number of partitions that are
>> equal to the number of blocks in the HDFS folder. Is there a way to instead
>> create the same number of partitions as the number of files written to HDFS
>> while preserving the original partitioning?
>>
>> I would like to avoid repartitioning.
>>
>> Thanks.
>> - Bilal
>>
>


Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread Manu Zhang
You may try
`sparkContext.hadoopConfiguration().set("mapred.max.split.size",
"33554432")` to tune the partition size when reading from HDFS.

Thanks,
Manu Zhang

On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:

> Hi,
>
> I have implemented a custom partitioning algorithm to partition graphs in
> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
> files in the output folder with the number of files equal to the number of
> Partitions.
>
> However, reading back the edges creates number of partitions that are
> equal to the number of blocks in the HDFS folder. Is there a way to instead
> create the same number of partitions as the number of files written to HDFS
> while preserving the original partitioning?
>
> I would like to avoid repartitioning.
>
> Thanks.
> - Bilal
>


[GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread M Bilal
Hi,

I have implemented a custom partitioning algorithm to partition graphs in
GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
files in the output folder with the number of files equal to the number of
Partitions.

However, reading back the edges creates number of partitions that are equal
to the number of blocks in the HDFS folder. Is there a way to instead
create the same number of partitions as the number of files written to HDFS
while preserving the original partitioning?

I would like to avoid repartitioning.

Thanks.
- Bilal


[GraphX] - OOM Java Heap Space

2018-10-28 Thread Thodoris Zois
Hello,

I have the edges of a graph stored as parquet files (about 3GB). I am loading 
the graph and trying to compute the total number of triplets and triangles. 
Here is my code:

val edges_parq = sqlContext.read.option("header","true").parquet(args(0) + 
"/year=" + year) 
val edges: RDD[Edge[Int]] = edges_parq.rdd.map(row => 
Edge(row(0).asInstanceOf[Int].toInt, row(1).asInstanceOf[Int].toInt))
val graph = Graph.fromEdges(edges, 
1.toInt).partitionBy(PartitionStrategy.RandomVertexCut)

// The actual computation
var numberOfTriplets = graph.triplets.count
val tmp =  graph.triangleCount().vertices.filter{ case (vid, count) => count > 
0 }
var numberOfTriangles = tmp.map(a => a._2).sum()

Even though it manages to compute the number of triplets, I can’t compute the 
number of triangles. Every time I get an exception OOM - Java Heap Space on 
some executors and the application fails.
I am using 100 executors (1 core and 6GBs per executor). I have tried to use 
'hdfsConf.set("mapreduce.input.fileinputformat.split.maxsize", "33554432”)’ in 
the code but still no results.

Here are some of my configurations:
--conf spark.driver.memory=20G 
--conf spark.driver.maxResultSize=20G 
--conf spark.yarn.executor.memoryOverhead=6144 

- Thodoris
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark-GraphX] Conductance, Bridge Ratio & Diameter

2018-10-18 Thread Thodoris Zois
Hello, 

I am trying to compute conductance, bridge ratio and diameter on a given graph 
but I face some problems.

- For the conductance my problem is how to compute the cuts so that they are 
kinda semi-clustered. Is the partitioningBy from GraphX related to dividing a 
graph into multiple subgraphs that are clustered or semi-clustered together? If 
not, the could you please provide me some help? (an open-source implementation 
or something so I can proceed).

- For the bridge ratio I have made an implementation but it is too naive and it 
takes a lot of time to finish. So if anybody can provide some help then I would 
be really grateful. 

- For the diameter i found: (https://github.com/Cecca/graphx-diameter 
<https://github.com/Cecca/graphx-diameter>) but after a certain point even if 
the graph is 300MB it hangs on reducing and it keeps “running” for 14 hours. 
After a certain point I sometimes get SparkListenerBus has stopped. Any ideas?

Thank you very much for your help!
- Thodoris 

what is the query language used for graphX?

2018-05-02 Thread kant kodali
Hi All,

what is the query language used for graphX? are there any plans to
introduce gremlin or is that idea being dropped and go with Spark SQL?

Thanks!


Depth First Search in GraphX

2018-04-22 Thread abagavat
Has anyone come across involving Depth First Search in Spark GraphX?

Just wondering if that could be possible with Spark GraphX. I searched a
lot. But found just results of BFS. If someone have an idea about it, please
share with me. I would love to learn about it's possibility in Spark GraphX.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Log analysis with GraphX

2018-02-22 Thread JB Data
A new one created with my basic *datayse.*

@*JB*Δ <http://jbigdata.fr>


2018-02-21 13:14 GMT+01:00 Philippe de Rochambeau :

> Hi JB,
> which column in the 8 line DS do you regress ?
>
>
> Le 21 févr. 2018 09:47, JB Data  a écrit :
>
> Hi,
>
> Interesting discussion, let me add my *shell* point of view.
> My focus is only Prediction, to avoid pure DS to "crier aux loups", I warn
> how simple my *datayse* of the problem is :
> - No use of the button in the model, only page navigation.
> - User navigation 're-init' when click on a page ever clicked
> - My dataset is only 8 rows large !
> 2018-01-02 12:00:00;OKK;PAG1;1234555
> 2018-01-02 12:01:01;NEX;PAG2;1234555
> 2018-01-02 12:00:02;OKK;PAG1;5556667
> 2018-01-02 12:01:03;NEX;PAG3;5556667
> 2018-01-02 12:01:04;OKK;PAG3;1234555
> 2018-01-02 12:01:04;NEX;PAG1;1234555
> 2018-01-02 12:01:04;NEX;PAG3;1234555
> 2018-01-02 12:01:04;OKK;PAG2;5556667
>
> Anyway... After 250 python lines...
> *Regression with SKlearn*
> Mean squared error: 2.39
> Variance score: -0.54
> *Regression with Keras*
> Results: -4.60 (3.83) MSE
>
> No doubt that increasing *wc -l* will increase *MSE*.
> DL is the nowdays magic wand that everyone wants to shake above data
> But mastering the wand is not for everyone (myself included), use wand
> with parsimony...
>
> I create this group <https://github.com/dev2score/python/issues/1> as a
> prerequisite of -1 feedback  :-)
>
>
> @*JB*Δ <http://jbigdata.fr>
>
>
> 2018-02-10 16:28 GMT+01:00 Philippe de Rochambeau :
>
> Hi Jörn,
> thank you for replying.
> By « path analysis », I mean « the user’s navigation from page to page on
> the website » and by « clicking trends »  I mean «  which buttons does
> he/she click and in what order ». In other words, I’d like to measure, make
> sense out of, and perhaps, predict user behavior.
>
> Philippe
>
>
> > Le 10 févr. 2018 à 16:03, Jörn Franke  a écrit :
> >
> > What do you mean by path analysis and clicking trends?
> >
> > If you want to use typical graph algorithm such as longest path,
> shortest path (to detect issues with your navigation page) or page rank
> then probably yes. Similarly if you do a/b testing to compare if you sell
> more with different navigation or product proposals.
> >
> > Really depends your analysis. Only if it looks like a graph does not
> mean you need to do graph analysis .
> > Then another critical element is how to visualize the results of your
> graph analysis (does not have to be a graph to visualize, but it could be
> also a table with if/then rules , eg if product placed at top right then
> 50% more people buy it).
> >
> > However if you want to do some other  analysis such as random forests or
> Markov chains then graphx alone will not help you much.
> >
> >> On 10. Feb 2018, at 15:49, Philippe de Rochambeau 
> wrote:
> >>
> >> Hello,
> >>
> >> Let’s say a website log is structured as follows:
> >>
> >> ;;;
> >>
> >> eg.
> >>
> >> 2018-01-02 12:00:00;OKK;PAG1;1234555
> >> 2018-01-02 12:01:01;NEX;PAG1;1234555
> >> 2018-01-02 12:00:02;OKK;PAG1;5556667
> >> 2018-01-02 12:01:03;NEX;PAG1;5556667
> >>
> >> where OKK stands for the OK Button on Page 1, NEX, the Next Button on
> Page 2, …
> >>
> >> Is GraphX the appropriate tool to analyse the website users’ paths and
> clicking trends,
> >>
> >> Many thanks.
> >>
> >> Philippe
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>


Re: Log analysis with GraphX

2018-02-21 Thread JB Data
Hi,

Interesting discussion, let me add my *shell* point of view.
My focus is only Prediction, to avoid pure DS to "crier aux loups", I warn
how simple my *datayse* of the problem is :
- No use of the button in the model, only page navigation.
- User navigation 're-init' when click on a page ever clicked
- My dataset is only 8 rows large !
2018-01-02 12:00:00;OKK;PAG1;1234555
2018-01-02 12:01:01;NEX;PAG2;1234555
2018-01-02 12:00:02;OKK;PAG1;5556667
2018-01-02 12:01:03;NEX;PAG3;5556667
2018-01-02 12:01:04;OKK;PAG3;1234555
2018-01-02 12:01:04;NEX;PAG1;1234555
2018-01-02 12:01:04;NEX;PAG3;1234555
2018-01-02 12:01:04;OKK;PAG2;5556667

Anyway... After 250 python lines...
*Regression with SKlearn*
Mean squared error: 2.39
Variance score: -0.54
*Regression with Keras*
Results: -4.60 (3.83) MSE

No doubt that increasing *wc -l* will increase *MSE*.
DL is the nowdays magic wand that everyone wants to shake above data
But mastering the wand is not for everyone (myself included), use wand with
parsimony...

I create this group <https://github.com/dev2score/python/issues/1> as a
prerequisite of -1 feedback  :-)


@*JB*Δ <http://jbigdata.fr>


2018-02-10 16:28 GMT+01:00 Philippe de Rochambeau :

> Hi Jörn,
> thank you for replying.
> By « path analysis », I mean « the user’s navigation from page to page on
> the website » and by « clicking trends »  I mean «  which buttons does
> he/she click and in what order ». In other words, I’d like to measure, make
> sense out of, and perhaps, predict user behavior.
>
> Philippe
>
>
> > Le 10 févr. 2018 à 16:03, Jörn Franke  a écrit :
> >
> > What do you mean by path analysis and clicking trends?
> >
> > If you want to use typical graph algorithm such as longest path,
> shortest path (to detect issues with your navigation page) or page rank
> then probably yes. Similarly if you do a/b testing to compare if you sell
> more with different navigation or product proposals.
> >
> > Really depends your analysis. Only if it looks like a graph does not
> mean you need to do graph analysis .
> > Then another critical element is how to visualize the results of your
> graph analysis (does not have to be a graph to visualize, but it could be
> also a table with if/then rules , eg if product placed at top right then
> 50% more people buy it).
> >
> > However if you want to do some other  analysis such as random forests or
> Markov chains then graphx alone will not help you much.
> >
> >> On 10. Feb 2018, at 15:49, Philippe de Rochambeau 
> wrote:
> >>
> >> Hello,
> >>
> >> Let’s say a website log is structured as follows:
> >>
> >> ;;;
> >>
> >> eg.
> >>
> >> 2018-01-02 12:00:00;OKK;PAG1;1234555
> >> 2018-01-02 12:01:01;NEX;PAG1;1234555
> >> 2018-01-02 12:00:02;OKK;PAG1;5556667
> >> 2018-01-02 12:01:03;NEX;PAG1;5556667
> >>
> >> where OKK stands for the OK Button on Page 1, NEX, the Next Button on
> Page 2, …
> >>
> >> Is GraphX the appropriate tool to analyse the website users’ paths and
> clicking trends,
> >>
> >> Many thanks.
> >>
> >> Philippe
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Does Pyspark Support Graphx?

2018-02-19 Thread xiaobo
When using the --jars option, we should include it every time we submit a job , 
it seems add the jars to the classpath to every slave node a spark is only way 
to "install" spark packages.




-- Original --
From: Nicholas Hakobian 
Date: Tue,Feb 20,2018 3:37 AM
To: xiaobo 
Cc: Denny Lee , user@spark.apache.org 

Subject: Re: Does Pyspark Support Graphx?



If you copy the Jar file and all of the dependencies to the machines, you can 
manually add them to the classpath. If you are using Yarn and HDFS you can 
alternatively use --jars and point it to the hdfs locations of the jar files 
and it will (in most cases) distribute them to the worker nodes at job 
submission time.

Nicholas Szandor Hakobian, Ph.D.Staff Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com














On Sun, Feb 18, 2018 at 7:24 PM, xiaobo  wrote:
Another question is how to install graphframes permanently when the spark nodes 
can not connect to the internet.




-- Original --
From: Denny Lee 
Date: Mon,Feb 19,2018 10:23 AM
To: xiaobo 
Cc: user@spark.apache.org 
Subject: Re: Does Pyspark Support Graphx?



Note the --packages option works for both PySpark and Spark (Scala).  For the 
SparkLauncher class, you should be able to include packages ala:

spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")


On Sun, Feb 18, 2018 at 3:30 PM xiaobo  wrote:

Hi Denny,
The pyspark script uses the --packages option to load graphframe library, what 
about the SparkLauncher class? 




-- Original --
From: Denny Lee 
Date: Sun,Feb 18,2018 11:07 AM
To: 94035420 
Cc: user@spark.apache.org 



Subject: Re: Does Pyspark Support Graphx?



That??s correct - you can use GraphFrames though as it does support PySpark.  
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

I can not find anything for graphx module in the python API document, does it 
mean it is not supported yet?

Re: Does Pyspark Support Graphx?

2018-02-19 Thread Nicholas Hakobian
If you copy the Jar file and all of the dependencies to the machines, you
can manually add them to the classpath. If you are using Yarn and HDFS you
can alternatively use --jars and point it to the hdfs locations of the jar
files and it will (in most cases) distribute them to the worker nodes at
job submission time.


Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Sun, Feb 18, 2018 at 7:24 PM, xiaobo  wrote:

> Another question is how to install graphframes permanently when the spark
> nodes can not connect to the internet.
>
>
>
> -- Original --
> *From:* Denny Lee 
> *Date:* Mon,Feb 19,2018 10:23 AM
> *To:* xiaobo 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: Does Pyspark Support Graphx?
>
> Note the --packages option works for both PySpark and Spark (Scala).  For
> the SparkLauncher class, you should be able to include packages ala:
>
> spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")
>
>
> On Sun, Feb 18, 2018 at 3:30 PM xiaobo  wrote:
>
>> Hi Denny,
>> The pyspark script uses the --packages option to load graphframe library,
>> what about the SparkLauncher class?
>>
>>
>>
>> -- Original --
>> *From:* Denny Lee 
>> *Date:* Sun,Feb 18,2018 11:07 AM
>> *To:* 94035420 
>> *Cc:* user@spark.apache.org 
>> *Subject:* Re: Does Pyspark Support Graphx?
>> That’s correct - you can use GraphFrames though as it does support
>> PySpark.
>> On Sat, Feb 17, 2018 at 17:36 94035420  wrote:
>>
>>> I can not find anything for graphx module in the python API document,
>>> does it mean it is not supported yet?
>>>
>>


Re: Does Pyspark Support Graphx?

2018-02-18 Thread xiaobo
Another question is how to install graphframes permanently when the spark nodes 
can not connect to the internet.




-- Original --
From: Denny Lee 
Date: Mon,Feb 19,2018 10:23 AM
To: xiaobo 
Cc: user@spark.apache.org 
Subject: Re: Does Pyspark Support Graphx?



Note the --packages option works for both PySpark and Spark (Scala).  For the 
SparkLauncher class, you should be able to include packages ala:

spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")


On Sun, Feb 18, 2018 at 3:30 PM xiaobo  wrote:

Hi Denny,
The pyspark script uses the --packages option to load graphframe library, what 
about the SparkLauncher class? 




-- Original --
From: Denny Lee 
Date: Sun,Feb 18,2018 11:07 AM
To: 94035420 
Cc: user@spark.apache.org 



Subject: Re: Does Pyspark Support Graphx?



That??s correct - you can use GraphFrames though as it does support PySpark.  
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

I can not find anything for graphx module in the python API document, does it 
mean it is not supported yet?

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Denny Lee
Note the --packages option works for both PySpark and Spark (Scala).  For
the SparkLauncher class, you should be able to include packages ala:

spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")


On Sun, Feb 18, 2018 at 3:30 PM xiaobo  wrote:

> Hi Denny,
> The pyspark script uses the --packages option to load graphframe library,
> what about the SparkLauncher class?
>
>
>
> -- Original --
> *From:* Denny Lee 
> *Date:* Sun,Feb 18,2018 11:07 AM
> *To:* 94035420 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: Does Pyspark Support Graphx?
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420  wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>


Re: Does Pyspark Support Graphx?

2018-02-18 Thread xiaobo
Hi Denny,
The pyspark script uses the --packages option to load graphframe library, what 
about the SparkLauncher class? 




-- Original --
From: Denny Lee 
Date: Sun,Feb 18,2018 11:07 AM
To: 94035420 
Cc: user@spark.apache.org 
Subject: Re: Does Pyspark Support Graphx?



That??s correct - you can use GraphFrames though as it does support PySpark.  
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

I can not find anything for graphx module in the python API document, does it 
mean it is not supported yet?

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Felix Cheung
Hi - I’m maintaining it. As of now there is an issue with 2.2 that breaks 
personalized page rank, and that’s largely the reason there isn’t a release for 
2.2 support.

There are attempts to address this issue - if you are interested we would love 
for your help.


From: Nicolas Paris 
Sent: Sunday, February 18, 2018 12:31:27 AM
To: Denny Lee
Cc: xiaobo; user@spark.apache.org
Subject: Re: Does Pyspark Support Graphx?

> Most likely not as most of the effort is currently on GraphFrames  - a great
> blog post on the what GraphFrames offers can be found at: https://

Is the graphframes package still active ? The github repository
indicates it's not extremelly active. Right now, there is no available
package for spark-2.2 so that one need to compile it from sources.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Pyspark Support Graphx?

2018-02-18 Thread Nicolas Paris
> Most likely not as most of the effort is currently on GraphFrames  - a great
> blog post on the what GraphFrames offers can be found at: https://

Is the graphframes package still active ? The github repository
indicates it's not extremelly active. Right now, there is no available
package for spark-2.2 so that one need to compile it from sources.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
Most likely not as most of the effort is currently on GraphFrames  - a
great blog post on the what GraphFrames offers can be found at:
https://databricks.com/blog/2016/03/03/introducing-graphframes.html.   Is
there a particular scenario or situation that you're addressing that
requires GraphX vs. GraphFrames?

On Sat, Feb 17, 2018 at 8:26 PM xiaobo  wrote:

> Thanks Denny, will it be supported in the near future?
>
>
>
> -- Original --
> *From:* Denny Lee 
> *Date:* Sun,Feb 18,2018 11:05 AM
> *To:* 94035420 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: Does Pyspark Support Graphx?
>
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420  wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>


Re: Does Pyspark Support Graphx?

2018-02-17 Thread xiaobo
Thanks Denny, will it be supported in the near future?




-- Original --
From: Denny Lee 
Date: Sun,Feb 18,2018 11:05 AM
To: 94035420 
Cc: user@spark.apache.org 
Subject: Re: Does Pyspark Support Graphx?



That??s correct - you can use GraphFrames though as it does support PySpark.  
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

I can not find anything for graphx module in the python API document, does it 
mean it is not supported yet?

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
That’s correct - you can use GraphFrames though as it does support PySpark.
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

> I can not find anything for graphx module in the python API document, does
> it mean it is not supported yet?
>


Does Pyspark Support Graphx?

2018-02-17 Thread 94035420
I can not find anything for graphx module in the python API document, does it 
mean it is not supported yet?

[Spark GraphX pregel] default value for EdgeDirection not consistent between programming guide and API documentation

2018-02-13 Thread Ramon Bejar Torres

Hi,

I just wanted to notice that in the API doc page for the pregel operator 
(graphX API for spark 2.2.1):


http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)%E2%87%92VD,(EdgeTriplet[VD,ED])%E2%87%92Iterator[(VertexId,A)],(A,A)%E2%87%92A)(ClassTag[A]):Graph[VD,ED] 



It says that the default value for the activeDir parameter is 
EdgeDirection.Either



However, in the GraphX programming guide:

http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api 



in the type signature for the pregel operator the default value 
indicated is EdgeDirection.Out. For consistency, I think that it should 
be changed in the programming guide to the actual default value (Either).



Regards,

Ramon Béjar

DIEI - UdL


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Log analysis with GraphX

2018-02-10 Thread Philippe de Rochambeau
Hi Jörn,
thank you for replying.
By « path analysis », I mean « the user’s navigation from page to page on the 
website » and by « clicking trends »  I mean «  which buttons does he/she click 
and in what order ». In other words, I’d like to measure, make sense out of, 
and perhaps, predict user behavior.

Philippe


> Le 10 févr. 2018 à 16:03, Jörn Franke  a écrit :
> 
> What do you mean by path analysis and clicking trends?
> 
> If you want to use typical graph algorithm such as longest path, shortest 
> path (to detect issues with your navigation page) or page rank then probably 
> yes. Similarly if you do a/b testing to compare if you sell more with 
> different navigation or product proposals. 
> 
> Really depends your analysis. Only if it looks like a graph does not mean you 
> need to do graph analysis . 
> Then another critical element is how to visualize the results of your graph 
> analysis (does not have to be a graph to visualize, but it could be also a 
> table with if/then rules , eg if product placed at top right then 50% more 
> people buy it). 
> 
> However if you want to do some other  analysis such as random forests or 
> Markov chains then graphx alone will not help you much.
> 
>> On 10. Feb 2018, at 15:49, Philippe de Rochambeau  wrote:
>> 
>> Hello,
>> 
>> Let’s say a website log is structured as follows:
>> 
>> ;;;
>> 
>> eg.
>> 
>> 2018-01-02 12:00:00;OKK;PAG1;1234555
>> 2018-01-02 12:01:01;NEX;PAG1;1234555
>> 2018-01-02 12:00:02;OKK;PAG1;5556667
>> 2018-01-02 12:01:03;NEX;PAG1;5556667
>> 
>> where OKK stands for the OK Button on Page 1, NEX, the Next Button on Page 
>> 2, …
>> 
>> Is GraphX the appropriate tool to analyse the website users’ paths and 
>> clicking trends, 
>> 
>> Many thanks.
>> 
>> Philippe
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Log analysis with GraphX

2018-02-10 Thread Jörn Franke
What do you mean by path analysis and clicking trends?

If you want to use typical graph algorithm such as longest path, shortest path 
(to detect issues with your navigation page) or page rank then probably yes. 
Similarly if you do a/b testing to compare if you sell more with different 
navigation or product proposals. 

Really depends your analysis. Only if it looks like a graph does not mean you 
need to do graph analysis . 
Then another critical element is how to visualize the results of your graph 
analysis (does not have to be a graph to visualize, but it could be also a 
table with if/then rules , eg if product placed at top right then 50% more 
people buy it). 

However if you want to do some other  analysis such as random forests or Markov 
chains then graphx alone will not help you much.

> On 10. Feb 2018, at 15:49, Philippe de Rochambeau  wrote:
> 
> Hello,
> 
> Let’s say a website log is structured as follows:
> 
> ;;;
> 
> eg.
> 
> 2018-01-02 12:00:00;OKK;PAG1;1234555
> 2018-01-02 12:01:01;NEX;PAG1;1234555
> 2018-01-02 12:00:02;OKK;PAG1;5556667
> 2018-01-02 12:01:03;NEX;PAG1;5556667
> 
> where OKK stands for the OK Button on Page 1, NEX, the Next Button on Page 2, 
> …
> 
> Is GraphX the appropriate tool to analyse the website users’ paths and 
> clicking trends, 
> 
> Many thanks.
> 
> Philippe
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Log analysis with GraphX

2018-02-10 Thread Philippe de Rochambeau
Hello,

Let’s say a website log is structured as follows:

;;;

eg.

2018-01-02 12:00:00;OKK;PAG1;1234555
2018-01-02 12:01:01;NEX;PAG1;1234555
2018-01-02 12:00:02;OKK;PAG1;5556667
2018-01-02 12:01:03;NEX;PAG1;5556667

where OKK stands for the OK Button on Page 1, NEX, the Next Button on Page 2, …

Is GraphX the appropriate tool to analyse the website users’ paths and clicking 
trends, 

Many thanks.

Philippe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark GraphX: java.lang.ArrayIndexOutOfBoundsException: -1

2017-10-16 Thread Andy Long
We have hit a bug with GraphX when calling the connectedComponents function,
where it errors with the following error
java.lang.ArrayIndexOutOfBoundsException: -1

I've found this bug report: https://issues.apache.org/jira/browse/SPARK-5480

Has anyone else hit this issue and if so did how did you fix it or work
around it? I'm running on Spark 1.6.2 with Scala 2.10

Stack trace from the Shell:

17/10/13 17:05:58 ERROR TaskSetManager: Task 12 in stage 2036.0 failed 4
times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12
in stage 2036.0 failed 4 times, most recent failure: Lost task 12.3 in stage
2036.0 (TID 106840, cl-bigdata5.hosting.dbg.internal):
java.lang.ArrayIndexOutOfBoundsException: -1
at
org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at
org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:120)
at
org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:118)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLo

Re: Spark 2.1.1 Graphx graph loader GC overhead error

2017-07-11 Thread Aritra Mandal
yncxcw wrote
> hi,
> 
> I think if the OOM occurs before the computation begins, the input data is
> probably too big to fit in memory. I remembered that the graph data would
> expand when loading the data input memory. And the scale of expanding is
> pretty huge( based on my experiment on Pagerank).
> 
> 
> Wei  Chen

Hello Wei,

Thanks for the suggestions.

I tried this small piece of code with StorageLevel.MEMORY_AND_DISK I removed
the pregel call just to test.
But still the code failed with OOM in the graphload stage

/val ygraph=GraphLoader.edgeListFile(sc,args(1),
true,32,StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK).partitionBy(PartitionStrategy.RandomVertexCut)

println(ygraph.vertices.count())/


Is there a way to calculate the maximum size of a graph that a given
configuration of the cluster can process.

Aritra





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-1-1-Graphx-graph-loader-GC-overhead-error-tp28841p28851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.1.1 Graphx graph loader GC overhead error

2017-07-10 Thread Aritra Mandal
yncxcw wrote
> hi,
> 
> It highly depends on the algorithms you are going to apply to your data
> sets.  Graph applications are usually memory hungry and probably cause
> long
> GC or even OOM.
> 
> Suggestions include:  1. make some highly reused RDD as
> StorageLevel.MEMORY_ONLY
> and leave the rest  MEMORY_AND_DISK.
> 
> 2. slight decrease the parallelism for
> each executor.
> 
> 
> Wei Chen


Thanks for the response have a implementation of K core decomposition
running using pregel framework.

I will try constructing the graph with storagelevel:MEMORY_AND_DISK and 
post the outcome here

The GC overhead error is happening even before the algorithm starts its
pregel iterations it failing in the GraphLoader.fromEdgeList stage.

Aritra



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-1-1-Graphx-graph-loader-GC-overhead-error-tp28841p28843.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is GraphX really deprecated?

2017-05-15 Thread Sergey Zhemzhitsky
GraphFrames seems promising but it still has a lot of algorithms, which involve
in one way or another GraphX, or run on top of GraphX according to github
repo (
https://github.com/graphframes/graphframes/tree/master/src/main/scala/org/graphframes/lib),
and in case of RDDs and semistructured data it's not really necessary to
include another library that just will delegate to GraphX, which is still
shipped with Spark as the default graph-processing module.

Also doesn't Pregel-like programming abstraction of GraphX (although it is
on top of RDD joins) seem to be more natural than a number of join steps of
GraphFrames? I believe such an abstraction wouldn't hurt GraphFrames too.



On May 14, 2017 19:07, "Jules Damji"  wrote:

GraphFrames is not part of Spark Core as is Structured Streaming; it's
still open-source and part of Spark packages. But I anticipate that as it
becomes more at parity with all GraphX in algorithms & functionality, it's
not unreasonable to anticipate its inevitable wide adoption and preference.

To get a flavor have a go at it https://databricks.com/blog
/2016/03/03/introducing-graphframes.html

Cheers
Jules

Sent from my iPhone
Pardon the dumb thumb typos :)

On May 13, 2017, at 2:01 PM, Jacek Laskowski  wrote:

Hi,

I'd like to hear the official statement too.

My take on GraphX and Spark Streaming is that they are long dead projects
with GraphFrames and Structured Streaming taking their place, respectively.

Jacek

On 13 May 2017 3:00 p.m., "Sergey Zhemzhitsky"  wrote:

> Hello Spark users,
>
> I just would like to know whether the GraphX component should be
> considered deprecated and no longer actively maintained
> and should not be considered when starting new graph-processing projects
> on top of Spark in favour of other
> graph-processing frameworks?
>
> I'm asking because
>
> 1. According to some discussions in GitHub pull requests, there are
> thoughts that GraphX is not under active development and
> can probably be deprecated soon.
>
> https://github.com/apache/spark/pull/15125
>
> 2. According to Jira activities GraphX component seems to be not very
> active and quite a lot of improvements are
> resolved as "Won't fix" event with pull requests provided.
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20S
> PARK%20AND%20component%20%3D%20GraphX%20AND%20resolution%
> 20in%20(%22Unresolved%22%2C%20%22Won%27t%20Fix%22%2C%20%22Wo
> n%27t%20Do%22%2C%20Later%2C%20%22Not%20A%20Bug%22%2C%20%
> 22Not%20A%20Problem%22)%20ORDER%20BY%20created%20DESC
>
> So, I'm wondering what the community who uses GraphX, and commiters who
> develop it think regarding this Spark component?
>
> Kind regards,
> Sergey
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Is GraphX really deprecated?

2017-05-13 Thread Jacek Laskowski
Hi,

I'd like to hear the official statement too.

My take on GraphX and Spark Streaming is that they are long dead projects
with GraphFrames and Structured Streaming taking their place, respectively.

Jacek

On 13 May 2017 3:00 p.m., "Sergey Zhemzhitsky"  wrote:

> Hello Spark users,
>
> I just would like to know whether the GraphX component should be
> considered deprecated and no longer actively maintained
> and should not be considered when starting new graph-processing projects
> on top of Spark in favour of other
> graph-processing frameworks?
>
> I'm asking because
>
> 1. According to some discussions in GitHub pull requests, there are
> thoughts that GraphX is not under active development and
> can probably be deprecated soon.
>
> https://github.com/apache/spark/pull/15125
>
> 2. According to Jira activities GraphX component seems to be not very
> active and quite a lot of improvements are
> resolved as "Won't fix" event with pull requests provided.
>
> https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20SPARK%20AND%20component%20%3D%20GraphX%20AND%20resolution%20in%20(%
> 22Unresolved%22%2C%20%22Won%27t%20Fix%22%2C%20%22Won%27t%
> 20Do%22%2C%20Later%2C%20%22Not%20A%20Bug%22%2C%20%22Not%20A%20Problem%22)%
> 20ORDER%20BY%20created%20DESC
>
> So, I'm wondering what the community who uses GraphX, and commiters who
> develop it think regarding this Spark component?
>
> Kind regards,
> Sergey
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Is GraphX really deprecated?

2017-05-13 Thread Sergey Zhemzhitsky
Hello Spark users,

I just would like to know whether the GraphX component should be considered 
deprecated and no longer actively maintained
and should not be considered when starting new graph-processing projects on top 
of Spark in favour of other
graph-processing frameworks?

I'm asking because

1. According to some discussions in GitHub pull requests, there are thoughts 
that GraphX is not under active development and
can probably be deprecated soon.

https://github.com/apache/spark/pull/15125

2. According to Jira activities GraphX component seems to be not very active 
and quite a lot of improvements are
resolved as "Won't fix" event with pull requests provided.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20GraphX%20AND%20resolution%20in%20(%22Unresolved%22%2C%20%22Won%27t%20Fix%22%2C%20%22Won%27t%20Do%22%2C%20Later%2C%20%22Not%20A%20Bug%22%2C%20%22Not%20A%20Problem%22)%20ORDER%20BY%20created%20DESC

So, I'm wondering what the community who uses GraphX, and commiters who develop 
it think regarding this Spark component?

Kind regards,
Sergey



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX subgraph from list of VertexIds

2017-05-12 Thread Robineast
it would be listVertices.contains(vid) wouldn't it?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-subgraph-from-list-of-VertexIds-tp28677p28679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Research paper used in GraphX

2017-03-31 Thread Md. Rezaul Karim
Hi All,

Could anyone please tell me which research paper(s) was/were used to
implement the metrics like strongly connected components, page rank,
triangle count, closeness centrality, clustering coefficient etc. in Spark
GrpahX?




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
Ph.D. Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
>From the section on Pregel API in the GraphX programming guide: '... the
Pregel operator in GraphX is a bulk-synchronous parallel messaging
abstraction /constrained to the topology of the graph/.'. Does that answer
your question? Did you read the programming guide?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
GraphX is not synonymous with Pregel. To quote the  GraphX programming guide
<http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api>  
'GraphX exposes a variant of the Pregel API.'. There is no compute()
function in GraphX - see the Pregel API section of the programming guide for
details on how GraphX implements a Pregel-like API



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
Not that I'm aware of. Where did you read that?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28523.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Graphx Examples for ALS

2017-02-17 Thread Irving Duran
Not sure I follow your question.  Do you want to use ALS or GraphX?


Thank You,

Irving Duran

On Fri, Feb 17, 2017 at 7:07 AM, balaji9058  wrote:

> Hi,
>
> Where can i find the the ALS recommendation algorithm for large data set?
>
> Please feel to share your ideas/algorithms/logic to build recommendation
> engine by using spark graphx
>
> Thanks in advance.
>
> Thanks,
> Balaji
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Graphx-Examples-for-ALS-tp28401.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Graphx Examples for ALS

2017-02-17 Thread balaji9058
Hi,

Where can i find the the ALS recommendation algorithm for large data set?

Please feel to share your ideas/algorithms/logic to build recommendation
engine by using spark graphx

Thanks in advance.

Thanks,
Balaji



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Examples-for-ALS-tp28401.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Bipartite projection with Graphx

2017-02-03 Thread balaji9058
Hi,

Is possible Bipartite projection with Graphx

Rdd1
#id name
1   x1
2   x2
3   x3
4   x4
5   x5
6   x6
7   x7
8   x8

Rdd2
#id name
10001   y1
10002   y2
10003   y3
10004   y4
10005   y5
10006   y6

EdgeList
#src id Dest id
1   10001
1   10002
2   10001
2   10002
2   10004
3   10003
3   10005
4   10001
4   10004
5   10003
5   10005
6   10003
6   10006
7   10005
7   10006
8   10005

  val nodes = Rdd1++ Rdd2
 val Network = Graph(nodes, links)

with above network need to create projection graphs like x1-x2 weight (see
the image in below wiki link)
example:

https://en.wikipedia.org/wiki/Bipartite_network_projection





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bipartite-projection-with-Graphx-tp28360.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Examples in graphx

2017-01-30 Thread Ankur Srivastava
The one issue with using Neo4j is that you need to persist the whole graph
on one single machine i.e you can not shard the graph. I am not sure what
is the size of your graph but if it is huge one way to shard could be to
use the Component Id to shard. You can generate Component Id by running
ConnectedComponent on your Graph in GrpahX of GraphFrames.

But GraphX or GraphFrame expect the data in to Dataframes (RDD) vertices
and edges and it really relies on the relational nature of these entities
to run any algorithm. AFAIK same is the case with Giraph too so if you want
to use GraphFrames  as your processing engine you can chose to persist your
data in Hive tables and not in native graph format.

Hope this helps.

Thanks
Ankur

On Sun, Jan 29, 2017 at 10:27 AM, Felix Cheung 
wrote:

> Which graph do you are thinking about?
> Here's one for neo4j
>
> https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/
>
> --
> *From:* Deepak Sharma 
> *Sent:* Sunday, January 29, 2017 4:28:19 AM
> *To:* spark users
> *Subject:* Examples in graphx
>
> Hi There,
> Are there any examples of using GraphX along with any graph DB?
> I am looking to persist the graph in graph based DB and then read it back
> in spark , process using graphx.
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Examples in graphx

2017-01-29 Thread Felix Cheung
Which graph do you are thinking about?
Here's one for neo4j

https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/


From: Deepak Sharma 
Sent: Sunday, January 29, 2017 4:28:19 AM
To: spark users
Subject: Examples in graphx

Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back in 
spark , process using graphx.

--
Thanks
Deepak
www.bigdatabig.com<http://www.bigdatabig.com>
www.keosha.net<http://www.keosha.net>


nvitation to speak about GraphX at London Opensource Graph Technologies Meetup

2017-01-29 Thread haikal
Hi everyone,

We're starting a new meetup in London: Opensource Graph Technologies. Our
goal is to increase the awareness of opensource graph technologies and their
applications to the London developer community. In the past week, there have
been 119 people signing up to the group, and we're hoping to continue
growing the community with the series of talks that we'll be holding.

The first meetup we're planning to host is during the week of the 6th of
March, in Central London. We would like to include GraphX as one of the
technologies being introduced to the London developer community.

Is anyone interested in giving a 20-minute talk, introducing GraphX (any
general or specific application of it) to the community?

Below is the link to the meetup group and page that we've set up. We'll be
updating the details and speakers list in the coming days, but I would very
love to reserve at least one talk for GraphX!

https://www.meetup.com/graphs/events/237191885/




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/nvitation-to-speak-about-GraphX-at-London-Opensource-Graph-Technologies-Meetup-tp28342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Examples in graphx

2017-01-29 Thread Deepak Sharma
Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back
in spark , process using graphx.

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Shortest path performance in Graphx with Spark

2017-01-11 Thread Irving Duran
Hi Gerard,
How are you starting spark? Are you allocating enough RAM for processing? I
think the default is 512mb.  Try to doing the following and see if it helps
(based on the size of your dataset, you might not need all 8gb).

$SPARK_HOME/bin/spark-shell \
  --master local[4] \
  --executor-memory 8G \
  --driver-memory 8G



Thank You,

Irving Duran

On Tue, Jan 10, 2017 at 12:20 PM, Gerard Casey 
wrote:

> Hello everyone,
>
> I am creating a graph from a `gz` compressed `json` file of `edge` and
> `vertices` type.
>
> I have put the files in a dropbox folder [here][1]
>
> I load and map these `json` records to create the `vertices` and `edge`
> types required by `graphx` like this:
>
> val vertices_raw = sqlContext.read.json("path/vertices.json.gz")
> val vertices = vertices_raw.rdd.map(row=> ((row.getAs[String]("toid").
> stripPrefix("osgb").toLong),row.getAs[Long]("index")))
> val verticesRDD: RDD[(VertexId, Long)] = vertices
> val edges_raw = sqlContext.read.json("path/edges.json.gz")
> val edgesRDD = edges_raw.rdd.map(row=>(Edge(row.getAs[String]("
> positiveNode").stripPrefix("osgb").toLong, row.getAs[String]("
> negativeNode").stripPrefix("osgb").toLong, row.getAs[Double]("length"
> val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD,
> edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut)
>
> I then use this `dijkstra` implementation I found to compute a shortest
> path between two vertices:
>
> def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = {
>   var g2 = g.mapVertices(
> (vid, vd) => (false, if (vid == origin) 0 else
> Double.MaxValue, List[VertexId]())
>   )
>   for (i <- 1L to g.vertices.count - 1) {
> val currentVertexId: VertexId =
> g2.vertices.filter(!_._2._1)
>   .fold((0L, (false, Double.MaxValue, List[VertexId]((
> (a, b) => if (a._2._2 < b._2._2) a else b)
>   ._1
>
> val newDistances: VertexRDD[(Double, List[VertexId])] =
>   g2.aggregateMessages[(Double, List[VertexId])](
> ctx => if (ctx.srcId == currentVertexId) {
>   ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3
> :+ ctx.srcId))
> },
> (a, b) => if (a._1 < b._1) a else b
>   )
> g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => {
>   val newSumVal = newSum.getOrElse((Double.MaxValue,
> List[VertexId]()))
>   (
> vd._1 || vid == currentVertexId,
> math.min(vd._2, newSumVal._1),
> if (vd._2 < newSumVal._1) vd._3 else newSumVal._2
> )
> })
> }
>
>   g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
> (vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]()))
>   .productIterator.toList.tail
>   ))
> }
>
> I take two random vertex id's:
>
> val v1 = 400028222916L
> val v2 = 400031019012L
>
> and compute the path between them:
>
> val results = dijkstra(my_graph, v1).vertices.map(_._2).collect
>
> I am unable to compute this locally on my laptop without getting a
> stackoverflow error. I have 8GB RAM and 2.6 GHz Intel Core i5 processor. I
> can see that it is using 3 out of 4 cores available. I can load this graph
> and compute shortest on average around 10 paths per second with the
> `igraph` library in Python on exactly the same graph. Is this an
> inefficient means of computing paths? At scale, on multiple nodes the paths
> will compute (no stackoverflow error) but it is still 30/40seconds per path
> computation. I must be missing something.
>
> Thanks
>
>   [1]: https://www.dropbox.com/sh/9ug5ikr6j357q7j/AACDBR9UdM0g_
> ck_ykB8KXPXa?dl=0
>


Shortest path performance in Graphx with Spark

2017-01-10 Thread Gerard Casey
Hello everyone,

I am creating a graph from a `gz` compressed `json` file of `edge` and 
`vertices` type.

I have put the files in a dropbox folder [here][1]

I load and map these `json` records to create the `vertices` and `edge` types 
required by `graphx` like this:

val vertices_raw = sqlContext.read.json("path/vertices.json.gz")
val vertices = vertices_raw.rdd.map(row=> 
((row.getAs[String]("toid").stripPrefix("osgb").toLong),row.getAs[Long]("index")))
val verticesRDD: RDD[(VertexId, Long)] = vertices
val edges_raw = sqlContext.read.json("path/edges.json.gz")
val edgesRDD = 
edges_raw.rdd.map(row=>(Edge(row.getAs[String]("positiveNode").stripPrefix("osgb").toLong,
 row.getAs[String]("negativeNode").stripPrefix("osgb").toLong, 
row.getAs[Double]("length"
val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, 
edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut)

I then use this `dijkstra` implementation I found to compute a shortest path 
between two vertices:

def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = {
  var g2 = g.mapVertices(
(vid, vd) => (false, if (vid == origin) 0 else Double.MaxValue, 
List[VertexId]())
  )
  for (i <- 1L to g.vertices.count - 1) {
val currentVertexId: VertexId = g2.vertices.filter(!_._2._1)
  .fold((0L, (false, Double.MaxValue, List[VertexId]((
(a, b) => if (a._2._2 < b._2._2) a else b)
  ._1

val newDistances: VertexRDD[(Double, List[VertexId])] =
  g2.aggregateMessages[(Double, List[VertexId])](
ctx => if (ctx.srcId == currentVertexId) {
  ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3 :+ 
ctx.srcId))
},
(a, b) => if (a._1 < b._1) a else b
  )
g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => {
  val newSumVal = newSum.getOrElse((Double.MaxValue, 
List[VertexId]()))
  (
vd._1 || vid == currentVertexId,
math.min(vd._2, newSumVal._1),
if (vd._2 < newSumVal._1) vd._3 else newSumVal._2
)
})
}

  g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
(vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]()))
  .productIterator.toList.tail
  ))
}

I take two random vertex id's:

val v1 = 400028222916L
val v2 = 400031019012L

and compute the path between them:

val results = dijkstra(my_graph, v1).vertices.map(_._2).collect

I am unable to compute this locally on my laptop without getting a 
stackoverflow error. I have 8GB RAM and 2.6 GHz Intel Core i5 processor. I can 
see that it is using 3 out of 4 cores available. I can load this graph and 
compute shortest on average around 10 paths per second with the `igraph` 
library in Python on exactly the same graph. Is this an inefficient means of 
computing paths? At scale, on multiple nodes the paths will compute (no 
stackoverflow error) but it is still 30/40seconds per path computation. I must 
be missing something. 

Thanks 

  [1]: https://www.dropbox.com/sh/9ug5ikr6j357q7j/AACDBR9UdM0g_ck_ykB8KXPXa?dl=0

[Spark GraphX] Graph Aggregation

2017-01-04 Thread Will Swank
Hi All - I'm new to Spark and GraphX and I'm trying to perform a
simple sum operation for a graph.  I have posted this question to
StackOverflow and also on the gitter channel to no avail.  I'm
wondering if someone can help me out.  The StackOverflow link is here:
http://stackoverflow.com/questions/41451947/spark-graphx-aggregation-summation

The problem as posted is included below (formatting is better on SO):

I'm trying to compute the sum of node values in a spark graphx graph.
In short the graph is a tree and the top node (root) should sum all
children and their children. My graph is actually a tree that looks
like this and the expected summed value should be 1850:

 ++
 +--->|  VertexID 14
 |   ||  Value: 1000
 +---+--+++
+>  | VertexId 11
||  | Value: ++
|+--+ Sum of 14 & 24  |  VertexId 24
+---+++-->|  Value: 550
|| VertexId 20   ++
|| Value:
+++Sum of 11 & 911
  |
  |   +-+
  +---> | VertexId 911
  | | Value: 300
  +-+

The first stab at this looks like this:

val vertices: RDD[(VertexId, Int)] =
  sc.parallelize(Array((20L, 0)
, (11L, 0)
, (14L, 1000)
, (24L, 550)
, (911L, 300)
  ))

  //note that the last value in the edge is for factor (positive or negative)
val edges: RDD[Edge[Int]] =
  sc.parallelize(Array(
Edge(14L, 11L, 1),
Edge(24L, 11L, 1),
Edge(11L, 20L, 1),
Edge(911L, 20L, 1)
  ))

val dataItemGraph = Graph(vertices, edges)


val sum: VertexRDD[(Int, BigDecimal, Int)] =
dataItemGraph.aggregateMessages[(Int, BigDecimal, Int)](
  sendMsg = { triplet => triplet.sendToDst(1, triplet.srcAttr, 1) },
  mergeMsg = { (a, b) => (a._1, a._2 * a._3 + b._2 * b._3, 1) }
)

sum.collect.foreach(println)

This returns the following:

(20,(1,300,1))
(11,(1,1550,1))

It's doing the sum for vertex 11 but it's not rolling up to the root
node (vertex 20). What am I missing or is there a better way of doing
this? Of course the tree can be of arbitrary size and each vertex can
have an arbitrary number of children edges.

Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Graphx with Database

2016-12-30 Thread Felix Cheung
You might want to check out GraphFrames - to load database data (as Spark 
DataFrame) and build graphs with them


https://github.com/graphframes/graphframes

_
From: balaji9058 mailto:kssb...@gmail.com>>
Sent: Monday, December 26, 2016 9:27 PM
Subject: Spark Graphx with Database
To: mailto:user@spark.apache.org>>


Hi All,

I would like to know about spark graphx execution/processing with
database.Yes, i understand spark graphx is in-memory processing but some
extent we can manage querying but would like to do much more complex query
or processing.Please suggest me the usecase or steps for the same.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Graphx-with-Database-tp28253.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com<http://Nabble.com>.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>





Spark Graphx with Database

2016-12-26 Thread balaji9058
Hi All,

I would like to know about spark graphx execution/processing with
database.Yes, i understand spark graphx is in-memory processing but some
extent we can manage querying but would like to do much more complex query
or processing.Please suggest me the usecase or steps for the same.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Graphx-with-Database-tp28253.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to load edge with properties file useing GraphX

2016-12-15 Thread Felix Cheung
Have you checked out https://github.com/graphframes/graphframes?

It might be easier to work with DataFrame.



From: zjp_j...@163.com 
Sent: Thursday, December 15, 2016 7:23:57 PM
To: user
Subject: How to load edge with properties file useing GraphX

Hi,
   I want to load a edge file  and vertex attriInfos file as follow ,how can i 
use these two files create Graph ?
  edge file -> "SrcId,DestId,propertis...  "   vertex attriInfos file-> "VID, 
properties..."

   I learned about there have a GraphLoader object  that can load edge file 
with no properties  and then join Vertex properties to create Graph. So the 
issue is how to then attach edge properties.

   Thanks.


zjp_j...@163.com


How to load edge with properties file useing GraphX

2016-12-15 Thread zjp_j...@163.com
Hi,
   I want to load a edge file  and vertex attriInfos file as follow ,how can i 
use these two files create Graph ?
  edge file -> "SrcId,DestId,propertis...  "   vertex attriInfos file-> "VID, 
properties..."   
   
   I learned about there have a GraphLoader object  that can load edge file 
with no properties  and then join Vertex properties to create Graph. So the 
issue is how to then attach edge properties.

   Thanks.



zjp_j...@163.com


Re: Graphx triplet comparison

2016-12-14 Thread Robineast
You are trying to invoke 1 RDD action inside another, that won't work. If you
want to do what you are attempting you need to .collect() each triplet to
the driver and iterate over that.

HOWEVER you almost certainly don't want to do that, not if your data are
anything other than a trivial size. In essence you are doing a cartesian
join followed by a filter - that doesn't scale. You might want to consider
joining one triplet RDD to another and then evaluating the condition.



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Graphx triplet comparison

2016-12-13 Thread balaji9058
Hi Thanks for reply.

Here is my code:
class BusStopNode(val name: String,val mode:String,val maxpasengers :Int)
extends Serializable
case class busstop(override val name: String,override val mode:String,val
shelterId: String, override val maxpasengers :Int) extends
BusStopNode(name,mode,maxpasengers) with Serializable
case class busNodeDetails(override val name: String,override val
mode:String,val srcId: Int,val destId :Int,val arrivalTime :Int,override val
maxpasengers :Int) extends BusStopNode(name,mode,maxpasengers) with
Serializable
case class routeDetails(override val name: String,override val
mode:String,val srcId: Int,val destId :Int,override val maxpasengers :Int)
extends BusStopNode(name,mode,maxpasengers) with Serializable

val busstopRDD: RDD[(VertexId, BusStopNode)] =
  sc.textFile("\\BusStopNameMini.txt").filter(!_.startsWith("#")).
map { line =>
  val row = line split ","
  (row(0).toInt, new
busstop(row(0),row(3),row(1)+row(0),row(2).toInt))
}

busstopRDD.foreach(println)

val busNodeDetailsRdd: RDD[(VertexId, BusStopNode)] =
  sc.textFile("\\RouteDetails.txt").filter(!_.startsWith("#")).
map { line =>
  val row = line split ","
  (row(0).toInt, new
busNodeDetails(row(0),row(4),row(1).toInt,row(2).toInt,row(3).toInt,0))
}
busNodeDetailsRdd.foreach(println)

 val detailedStats: RDD[Edge[BusStopNode]] =
sc.textFile("\\routesEdgeNew.txt").
filter(! _.startsWith("#")).
map {line =>
val row = line split ','
Edge(row(0).toInt, row(1).toInt,new BusStopNode(row(2),
row(3),1)
   )}

val busGraph = busstopRDD ++ busNodeDetailsRdd
busGraph.foreach(println)
val mainGraph = Graph(busGraph, detailedStats)
mainGraph.triplets.foreach(println)
 val subGraph = mainGraph subgraph (epred = _.srcAttr.name == "101")
 //Working Fine
 for (subTriplet <- subGraph.triplets) {
 println(subTriplet.dstAttr.name)
 }
 
 //Working fine
  for (mainTriplet <- mainGraph.triplets) {
 println(subTriplet.dstAttr.name)
 }
 
 //causing error while iterating both at same time
 for (subTriplet <- subGraph.triplets) {
for (mainTriplet <- mainGraph.triplets) {   //Nullpointer exception
is causing here
   if
(subTriplet.dstAttr.name.toString.equals(mainTriplet.dstAttr.name)) {

  println("hello")//success case on both destination names of of
subgraph and maingraph
}
  }
}
}

BusStopNameMini.txt
101,bs,10,B
102,bs,10,B
103,bs,20,B
104,bs,14,B
105,bs,8,B


RouteDetails.txt

#101,102,104  4 5 6
#102,103 3 4
#103,105,104 2 3 4
#104,102,101  4 5 6
#104,1015
#105,104,102 5 6 2
1,101,104,5,R
2,102,103,5,R
3,103,104,5,R
4,102,103,5,R
5,104,101,5,R
6,105,102,5,R

routesEdgeNew.txt it contains two types of edges are bus to bus with edge
value is distance and bus to route with edge value as time
#101,102,104  4 5 6
#102,103 3 4
#103,105,104 2 3 4
#104,102,101  4 5 6
#104,1015
#105,104,102 5 6 2
101,102,4,BS
102,104,5,BS
102,103,3,BS
103,105,4,BS
105,104,3,BS
104,102,4,BS
102,101,5,BS
104,101,5,BS
105,104,5,BS
104,102,6,BS
101,1,4,R,102
101,1,4,R,103
102,2,5,R
103,3,6,R
103,3,5,R
104,4,7,R
105,5,4,Z
101,2,9,R
105,5,4,R
105,2,5,R
104,2,5,R
103,1,4,R
101,103,4,BS
101,104,4,BS
101,105,4,BS
101,103,5,BS
101,104,5,BS
101,105,5,BS
1,101,4,R







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Graphx triplet comparison

2016-12-13 Thread Robineast
No sure what you are asking. What's wrong with:

triplet1.filter(condition3)
triplet2.filter(condition3)




-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Graphx triplet comparison

2016-12-12 Thread balaji9058
Hi,

I would like to know how to do graphx triplet comparison in scala.

Example there are two triplets;

val triplet1 = mainGraph.triplet.filter(condition1)
val triplet2 = mainGraph.triplet.filter(condition2)

now i want to do compare triplet1 & triplet2 with condition3





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Graphx triplet loops causing null pointer exception

2016-12-12 Thread balaji9058
HI,

I am getting null pointer exception when i am executing the triplet loop
inside another triplet loop

works fine below :
 for (mainTriplet <- mainGraph.triplets) {
println(mainTriplet.dstAttr.name) 
   }

works fine below :
 for (subTriplet <- subGrapgh.triplets) {
println(subTriplet .dstAttr.name) 
   }

Below is the causing issue :
1. for (subTriplet <- subGrapgh.triplets) {
2. println(subTriplet .dstAttr.name)  
3. for (mainTriplet <- mainGraph.triplets) {
 4.println(mainTriplet.dstAttr.name)   
// here i want to do operation like
subTriplet.dstAttr.name.toString.equals(mainTriplet.dstAttr.name)
6.}
7.}  

line number 2 prints only one time and getting below error immediately

ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
java.lang.NullPointerException
at
org.apache.spark.graphx.impl.GraphImpl.triplets$lzycompute(GraphImpl.scala:50)
at org.apache.spark.graphx.impl.GraphImpl.triplets(GraphImpl.scala:49)


Please help and let me know if anything required in my explanation



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-loops-causing-null-pointer-exception-tp28195.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [GraphX] Extreme scheduler delay

2016-12-06 Thread Sean Owen
(For what it is worth, I happened to look into this with Anton earlier and
am also pretty convinced it's related to GraphX rather than the app. It's
somewhat difficult to debug what gets sent in the closure AFAICT.)

On Tue, Dec 6, 2016 at 7:49 PM AntonIpp  wrote:

> Hi everyone,
>
> I have a small Scala test project which uses GraphX and for some reason has
> extreme scheduler delay when executed on the cluster. The problem is not
> related to the cluster configuration, as other GraphX applications run
> without any issue.
> I have attached the source code ( MatrixTest.scala
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.scala
> >
> ), it creates a sort of a  GraphGenerators.gridGraph
> <
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$
> >
> (but with diagonal edges too) using data from a matrix inside the Map
> class.
> There are in reality only 4 lines related to GraphX itself: creating a
> VertexRDD, creating an EdgeRDD, creating a Graph and then calling
> graph.edges.count.
> As you can see on the  Spark History Server
> <
> http://cdhdns-mn0.westeurope.cloudapp.azure.com:18088/history/application_1480677653852_0050/jobs/
> >
> , the task has very significant scheduler delay. There is also the
> following
> warning in the logs (I have attached them too:  MatrixTest.log
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.log
> >
> ) : "WARN scheduler.TaskSetManager: Stage 0 contains a task of very large
> size (2905 KB). The maximum recommended task size is 100 KB."
> This also happens with .aggregateMessages.collect and Pregel. I have tested
> with Spark 1.6 and 2.0, different levels of parallelism, different number
> of
> executors, etc but the scheduler delay is still there and grows more and
> more extreme as the number of vertices and edges grows.
>
> Does anyone have any idea as to what could be the source of the issue?
> Thank you!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Extreme-scheduler-delay-tp28162.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[GraphX] Extreme scheduler delay

2016-12-06 Thread AntonIpp
Hi everyone,

I have a small Scala test project which uses GraphX and for some reason has
extreme scheduler delay when executed on the cluster. The problem is not
related to the cluster configuration, as other GraphX applications run
without any issue.
I have attached the source code ( MatrixTest.scala
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.scala>
 
), it creates a sort of a  GraphGenerators.gridGraph
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$>
  
(but with diagonal edges too) using data from a matrix inside the Map class.
There are in reality only 4 lines related to GraphX itself: creating a
VertexRDD, creating an EdgeRDD, creating a Graph and then calling
graph.edges.count. 
As you can see on the  Spark History Server
<http://cdhdns-mn0.westeurope.cloudapp.azure.com:18088/history/application_1480677653852_0050/jobs/>
 
, the task has very significant scheduler delay. There is also the following
warning in the logs (I have attached them too:  MatrixTest.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.log>
 
) : "WARN scheduler.TaskSetManager: Stage 0 contains a task of very large
size (2905 KB). The maximum recommended task size is 100 KB."
This also happens with .aggregateMessages.collect and Pregel. I have tested
with Spark 1.6 and 2.0, different levels of parallelism, different number of
executors, etc but the scheduler delay is still there and grows more and
more extreme as the number of vertices and edges grows.

Does anyone have any idea as to what could be the source of the issue?
Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Extreme-scheduler-delay-tp28162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-28 Thread rohit13k
Found the exact issue. If the vertex attribute is a complex object with
mutable objects the edge triplet does not update the new state once already
the vertex attributes are shipped but if the vertex attributes are immutable
objects then there is no issue. below is a code for the same. Just changing
the mutable hashmap to immutable hashmap solves the issues. ( this is not a
fix for the bug, either this limitation should be made aware of the users
are the bug needs to be fixed for immutable objects.)

import org.apache.spark.graphx._
import com.alibaba.fastjson.JSONObject
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.log4j.Logger
import org.apache.log4j.Level
import scala.collection.mutable.HashMap


object PregelTest {
  val logger = Logger.getLogger(getClass().getName());
  def run(graph: Graph[HashMap[String, Int], HashMap[String, Int]]):
Graph[HashMap[String, Int], HashMap[String, Int]] = {

def vProg(v: VertexId, attr: HashMap[String, Int], msg: Integer):
HashMap[String, Int] = {
  var updatedAttr = attr
  
  if (msg < 0) {
// init message received 
if (v.equals(0.asInstanceOf[VertexId])) updatedAttr =
attr.+=("LENGTH" -> 0)
else updatedAttr = attr.+=("LENGTH" -> Integer.MAX_VALUE)
  } else {
updatedAttr = attr.+=("LENGTH" -> (msg + 1))
  }
  updatedAttr
}

def sendMsg(triplet: EdgeTriplet[HashMap[String, Int], HashMap[String,
Int]]): Iterator[(VertexId, Integer)] = {
  val len = triplet.srcAttr.get("LENGTH").get
  // send a msg if last hub is reachable 
  if (len < Integer.MAX_VALUE) Iterator((triplet.dstId, len))
  else Iterator.empty
}

def mergeMsg(msg1: Integer, msg2: Integer): Integer = {
  if (msg1 < msg2) msg1 else msg2
}

Pregel(graph, new Integer(-1), 3, EdgeDirection.Either)(vProg, sendMsg,
mergeMsg)
  }

  def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("Pregel Test")
conf.set("spark.master", "local")
val sc = new SparkContext(conf)
val test = new HashMap[String, Int]

// create a simplest test graph with 3 nodes and 2 edges 
val vertexList = Array(
  (0.asInstanceOf[VertexId], new HashMap[String, Int]),
  (1.asInstanceOf[VertexId], new HashMap[String, Int]),
  (2.asInstanceOf[VertexId], new HashMap[String, Int]))
val edgeList = Array(
  Edge(0.asInstanceOf[VertexId], 1.asInstanceOf[VertexId], new
HashMap[String, Int]),
  Edge(1.asInstanceOf[VertexId], 2.asInstanceOf[VertexId], new
HashMap[String, Int]))

val vertexRdd = sc.parallelize(vertexList)
val edgeRdd = sc.parallelize(edgeList)
val g = Graph[HashMap[String, Int], HashMap[String, Int]](vertexRdd,
edgeRdd)

// run test code 
val lpa = run(g)
lpa.vertices.collect().map(println)
  }
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-24 Thread 吴 郎
Thank you, Dale, I've realized in what situation this bug would be activated. 
Actually, it seems that any user-defined class with dynamic fields (such Map, 
List...) could not be used as message, or it'll lost in the next supersteps. to 
figure this out, I tried to deep-copy an new message object everytime the 
vertex program runs, and it works till now, though it's obviously not an 
elegant way. 

fuz woo
 
--
致好!
吴   郎

---
国防科大计算机学院

湖南省长沙市开福区 邮编:410073
Email: fuz@qq.com






 




-- Original --
From: "Dale Wang"; 
Date: 2016年11月24日(星期四) 中午11:10
To: "吴 郎"; 
Cc: "user"; 
Subject: Re: GraphX Pregel not update vertex state properly, cause messages loss




The problem comes from the inconsistency between graph’s triplet view  and 
vertex view. The message may not be lost but the message is just not  sent in 
sendMsgfunction because sendMsg function gets wrong value  of srcAttr! 
 
 It is not a new bug. I met a similar bug that appeared in version 1.2.1  
according to  JIAR-6378 before. I  can reproduce that inconsistency bug with a 
small and simple program  (See that JIRA issue for more details). It seems that 
in some situation  the triplet view of a Graph object does not update 
consistently with  vertex view. The GraphX Pregel API heavily relies on  
mapReduceTriplets(old)/aggregateMessages(new) API who heavily relies  on the 
correct behavior of the triplet view of a graph. Thus this bug  influences on 
behavior of Pregel API.
 
 Though I cannot figure out why the bug appears either, but I suspect  that the 
bug has some connection with the data type of the vertex  property. If you use 
primitive types such as Double and Long, it is  OK. But if you use some 
self-defined type with mutable fields such as  mutable Map and mutable 
ArrayBuffer, the bug appears. In your case I  notice that you use JSONObject as 
your vertex’s data type. After  looking up the definition ofJSONObject, 
JSONObject has a java map as  its field to store data which is mutable. To 
temporarily avoid the bug,  you can modify the data type of your vertex 
property to avoid any  mutable data type by replacing mutable data collection 
to immutable data  collection provided by Scala and replacing var field to val 
field.  At least, that suggestion works for me.
 
 Zhaokang Wang
 ​



2016-11-18 11:47 GMT+08:00 fuz_woo :
hi,everyone, I encountered a strange problem these days when i'm attempting
 to use the GraphX Pregel interface to implement a simple
 single-source-shortest-path algorithm.
 below is my code:
 
 import com.alibaba.fastjson.JSONObject
 import org.apache.spark.graphx._
 
 import org.apache.spark.{SparkConf, SparkContext}
 
 object PregelTest {
 
   def run(graph: Graph[JSONObject, JSONObject]): Graph[JSONObject,
 JSONObject] = {
 
 def vProg(v: VertexId, attr: JSONObject, msg: Integer): JSONObject = {
   if ( msg < 0 ) {
 // init message received
 if ( v.equals(0.asInstanceOf[VertexId]) ) attr.put("LENGTH", 0)
 else attr.put("LENGTH", Integer.MAX_VALUE)
   } else {
 attr.put("LENGTH", msg+1)
   }
   attr
 }
 
 def sendMsg(triplet: EdgeTriplet[JSONObject, JSONObject]):
 Iterator[(VertexId, Integer)] = {
   val len = triplet.srcAttr.getInteger("LENGTH")
   // send a msg if last hub is reachable
   if ( len, it seems that the
 messages sent to vertex 2 was lost unexpectedly. I then tracked the debugger
 to file Pregel.scala,  where I saw the code:
 
 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28100/%E7%B2%98%E8%B4%B4%E5%9B%BE%E7%89%87.png>
 
 In the first iteration 0, the variable messages in line 138 is reconstructed
 , and then recomputed in line 143, in where activeMessages got a value 0,
 which means the messages is lost.
 then I set a breakpoint in line 138, and before its execution I execute an
 expression " g.triplets().collect() " which just collects the updated graph
 data. after I done this and execute the rest code, the messages is no longer
 empty and activeMessages got value 1 as expected.
 
 I have tested the code with both spark&&graphx 1.4 and 1.6 in scala 2.10,
 and got the same result.
 
 I must say this problem makes me really confused, I've spent almost 2 weeks
 to resolve it and I have no idea how to do it now. If this is not a bug, I
 totally can't understand why just executing a non-disturb expression (
 g.triplets().collect(), it just collect the data and do noting computing )
 could changing the essential, it's really ridiculous.
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100.html
 Sent from the Apache Spark User List mailing list a

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread Dale Wang
The problem comes from the inconsistency between graph’s triplet view and
vertex view. The message may not be lost but the message is just not sent
in sendMsgfunction because sendMsg function gets wrong value of srcAttr!

It is not a new bug. I met a similar bug that appeared in version 1.2.1
according to JIAR-6378 <https://issues.apache.org/jira/browse/SPARK-6378>
before. I can reproduce that inconsistency bug with a small and simple
program (See that JIRA issue for more details). It seems that in some
situation the triplet view of a Graph object does not update consistently
with vertex view. The GraphX Pregel API heavily relies on
mapReduceTriplets(old)/aggregateMessages(new) API who heavily relies on the
correct behavior of the triplet view of a graph. Thus this bug influences
on behavior of Pregel API.

Though I cannot figure out why the bug appears either, but I suspect that
the bug has some connection with the data type of the vertex property. If
you use *primitive* types such as Double and Long, it is OK. But if you use
some self-defined type with mutable fields such as mutable Map and mutable
ArrayBuffer, the bug appears. In your case I notice that you use JSONObject
as your vertex’s data type. After looking up the definition ofJSONObject,
JSONObject has a java map as its field to store data which is mutable. To
temporarily avoid the bug, you can modify the data type of your vertex
property to avoid any mutable data type by replacing mutable data
collection to immutable data collection provided by Scala and replacing var
field to val field. At least, that suggestion works for me.

Zhaokang Wang
​

2016-11-18 11:47 GMT+08:00 fuz_woo :

> hi,everyone, I encountered a strange problem these days when i'm attempting
> to use the GraphX Pregel interface to implement a simple
> single-source-shortest-path algorithm.
> below is my code:
>
> import com.alibaba.fastjson.JSONObject
> import org.apache.spark.graphx._
>
> import org.apache.spark.{SparkConf, SparkContext}
>
> object PregelTest {
>
>   def run(graph: Graph[JSONObject, JSONObject]): Graph[JSONObject,
> JSONObject] = {
>
> def vProg(v: VertexId, attr: JSONObject, msg: Integer): JSONObject = {
>   if ( msg < 0 ) {
> // init message received
> if ( v.equals(0.asInstanceOf[VertexId]) ) attr.put("LENGTH", 0)
> else attr.put("LENGTH", Integer.MAX_VALUE)
>   } else {
> attr.put("LENGTH", msg+1)
>   }
>   attr
> }
>
> def sendMsg(triplet: EdgeTriplet[JSONObject, JSONObject]):
> Iterator[(VertexId, Integer)] = {
>   val len = triplet.srcAttr.getInteger("LENGTH")
>   // send a msg if last hub is reachable
>   if ( len   else Iterator.empty
> }
>
> def mergeMsg(msg1: Integer, msg2: Integer): Integer = {
>   if ( msg1 < msg2 ) msg1 else msg2
> }
>
> Pregel(graph, new Integer(-1), 3, EdgeDirection.Out)(vProg, sendMsg,
> mergeMsg)
>   }
>
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("Pregel Test")
> conf.set("spark.master", "local")
> val sc = new SparkContext(conf)
>
> // create a simplest test graph with 3 nodes and 2 edges
> val vertexList = Array(
>   (0.asInstanceOf[VertexId], new JSONObject()),
>   (1.asInstanceOf[VertexId], new JSONObject()),
>   (2.asInstanceOf[VertexId], new JSONObject()))
> val edgeList = Array(
>   Edge(0.asInstanceOf[VertexId], 1.asInstanceOf[VertexId], new
> JSONObject()),
>   Edge(1.asInstanceOf[VertexId], 2.asInstanceOf[VertexId], new
> JSONObject()))
>
> val vertexRdd = sc.parallelize(vertexList)
> val edgeRdd = sc.parallelize(edgeList)
> val g = Graph[JSONObject, JSONObject](vertexRdd, edgeRdd)
>
> // run test code
> val lpa = run(g)
> lpa
>   }
> }
>
> and after i run the code, I got a incorrect result in which the vertex 2
> has
> a "LENGTH" label valued <Integer.MAX_VALUE>, it seems that the
> messages sent to vertex 2 was lost unexpectedly. I then tracked the
> debugger
> to file Pregel.scala,  where I saw the code:
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28100/%E7%B2%98%E8%B4%B4%E5%9B%BE%E7%89%87.png>
>
> In the first iteration 0, the variable messages in line 138 is
> reconstructed
> , and then recomputed in line 143, in where activeMessages got a value 0,
> which means the messages is lost.
> then I set a breakpoint in line 138, and before its execution I execute an
> expression " g.triplets().collect() " which just collects the updated graph
> data. after I done this and execute the rest code, the messages is no
> longer

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k
Created a JIRA for the same

https://issues.apache.org/jira/browse/SPARK-18568



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28124.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k
Hi 

I am facing a similar issue. It's not that the message is getting lost or
something. The vertex 1 attributes changes in super step 1 but when the
sendMsg gets the vertex attribute from the edge triplet in the 2nd superstep
it stills has the old value of vertex 1 and not the latest value. So as per
your code no new msg will be generated in the superstep. I think the bug is
in the replicatedVertexView where the srcAttr and dstAttr of the
edgeTripplet is updated from the latest version of the vertex after each
superstep.

How to get this bug raised? I am struggling to find an exact solution for it
except for recreating the graph after every superstep to reinforce edge
triplets to have the latest value of the vertex. but this is not a good
solution performance wise.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28123.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



GraphX Pregel not update vertex state properly, cause messages loss

2016-11-17 Thread fuz_woo
hi,everyone, I encountered a strange problem these days when i'm attempting
to use the GraphX Pregel interface to implement a simple
single-source-shortest-path algorithm.
below is my code:

import com.alibaba.fastjson.JSONObject
import org.apache.spark.graphx._

import org.apache.spark.{SparkConf, SparkContext}

object PregelTest {

  def run(graph: Graph[JSONObject, JSONObject]): Graph[JSONObject,
JSONObject] = {

def vProg(v: VertexId, attr: JSONObject, msg: Integer): JSONObject = {
  if ( msg < 0 ) {
// init message received
if ( v.equals(0.asInstanceOf[VertexId]) ) attr.put("LENGTH", 0)
else attr.put("LENGTH", Integer.MAX_VALUE)
  } else {
attr.put("LENGTH", msg+1)
  }
  attr
}

def sendMsg(triplet: EdgeTriplet[JSONObject, JSONObject]):
Iterator[(VertexId, Integer)] = {
  val len = triplet.srcAttr.getInteger("LENGTH")
  // send a msg if last hub is reachable
  if ( len, it seems that the
messages sent to vertex 2 was lost unexpectedly. I then tracked the debugger
to file Pregel.scala,  where I saw the code:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28100/%E7%B2%98%E8%B4%B4%E5%9B%BE%E7%89%87.png>
 
 
In the first iteration 0, the variable messages in line 138 is reconstructed
, and then recomputed in line 143, in where activeMessages got a value 0,
which means the messages is lost.
then I set a breakpoint in line 138, and before its execution I execute an
expression " g.triplets().collect() " which just collects the updated graph
data. after I done this and execute the rest code, the messages is no longer
empty and activeMessages got value 1 as expected.  

I have tested the code with both spark&&graphx 1.4 and 1.6 in scala 2.10,
and got the same result.

I must say this problem makes me really confused, I've spent almost 2 weeks
to resolve it and I have no idea how to do it now. If this is not a bug, I
totally can't understand why just executing a non-disturb expression (
g.triplets().collect(), it just collect the data and do noting computing )
could changing the essential, it's really ridiculous.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



GraphX updating vertex property

2016-11-15 Thread Saliya Ekanayake
Hi,

I have created a property graph using GraphX. Each vertex has an integer
array as a property. I'd like to update the values of theses arrays without
creating new graph objects.

Is this possible in Spark?

Thank you,
Saliya

-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


GraphX and Public Transport Shortest Paths

2016-11-08 Thread Gerard Casey
Hi all,

I’m doing a quick lit review.

Consider I have a graph that has link weights dependent on time. I.e., a bus on 
this road gives a journey time (link weight) of x at time y. This is a classic 
public transport shortest path problem. 

This is a weighted directed graph that is time dependent. Are there any 
resources relating to a shortest path algorithm for such a graph? I suspect 
someone may have done this using GTFS data in some way

Cheers

Geroid
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Connected Components

2016-11-08 Thread Robineast
Have you tried this?
https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Connected-Components-tp10869p28049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Need to know about GraphX and Streaming

2016-11-02 Thread Md. Mahedi Kaysar
Hi All,
I am new in Spark GraphX. I am trying to understand it for analysing graph
streaming data. I know Spark has streaming modules that works on both
Tabular and DStream mechanism.
I am wondering if it is possible to leverage streaming APIs in GraphX for
analysing the real-time graph streams. What would be the overhead and
overall performance if I use it? I want to know the on going research of
graph streaming.

It would be very helpful if someone explain me about it. Thanks in advance.


Kind Regards,

Mahedi


Re: java.lang.NoSuchMethodError - GraphX

2016-10-25 Thread Brian Wilson
I have discovered that this dijkstra's function was written for scala 1.6. The 
remainder of my code is 2.11.

I have checked the functions within the dijkstra function and can’t see any 
that are illegal. For example `mapVertices`, `aggregateMessages` and 
`outerJoinVertices` are all being used correctly.

What else could this be?

Thanks

Brian

> On 25 Oct 2016, at 08:47, Brian Wilson  wrote:
> 
> Thank you Michael! This looks perfect but I have a `NoSuchMethodError` that I 
> cannot understand. 
> 
> I am trying to implement a weighted shortest path algorithm from your `Spark 
> GraphX in Action` book. The part in question is Listing 6.4 "Executing the 
> shortest path algorithm that uses breadcrumbs"  from Chapter 6 [here][1].
> 
> I have my own graph that I create from two RDDs. There are `344436` vertices 
> and `772983` edges. I can perform an unweighted shortest path computation 
> using the native GraphX library and I'm confident in the graph construction. 
> 
> In this case I use their Dijkstra's implementation as follows:
> 
> val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, 
> edgesRDD).cache()
>   
>   def dijkstra[VD](g:Graph[VD,Double], origin:VertexId) = {
>   var g2 = g.mapVertices(
>   (vid,vd) => (false, if (vid == origin) 0 else 
> Double.MaxValue,
>   
> List[VertexId]()))
> 
>   for (i <- 1L to g.vertices.count-1) {
>   val currentVertexId =
>   g2.vertices.filter(!_._2._1)
>   
> .fold((0L,(false,Double.MaxValue,List[VertexId](((a,b) =>
>   if (a._2._2 < b._2._2) 
> a else b)
>   ._1
> 
>   val newDistances = 
> g2.aggregateMessages[(Double,List[VertexId])](
>   ctx => if (ctx.srcId == 
> currentVertexId)
>
> ctx.sendToDst((ctx.srcAttr._2 + ctx.attr,
>   
> ctx.srcAttr._3 :+ ctx.srcId)),
>   (a,b) => if (a._1 < b._1) a 
> else b)
> 
>   g2 = g2.outerJoinVertices(newDistances)((vid, 
> vd, newSum) => {
>   val newSumVal =
>   
> newSum.getOrElse((Double.MaxValue,List[VertexId]()))
>   (vd._1 || vid == currentVertexId,
>   math.min(vd._2, newSumVal._1),
>   if (vd._2 < newSumVal._1) vd._3 else 
> newSumVal._2)})
>   }
>   
>   g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
>   (vd, 
> dist.getOrElse((false,Double.MaxValue,List[VertexId]()))
>.productIterator.toList.tail))
>   }
> 
>   //  Path Finding - random node from which to find all paths
>   val v1 = 400028222916L
> 
> I then call their function with my graph and a random vertex ID. Previously I 
> had issues with `v1` not being recognised as `long` type and the `L` suffix 
> solved this.
> 
>   val results = dijkstra(my_graph, 1L).vertices.map(_._2).collect
>   
>   println(results)
>   
> However, this returns the following:
> 
> Error: Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
>   at GraphX$.dijkstra$1(GraphX.scala:51)
>   at GraphX$.main(GraphX.scala:85)
>   at GraphX.main(GraphX.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSu

Re: java.lang.NoSuchMethodError - GraphX

2016-10-24 Thread Brian Wilson
Thank you Michael! This looks perfect but I have a `NoSuchMethodError` that I 
cannot understand. 

I am trying to implement a weighted shortest path algorithm from your `Spark 
GraphX in Action` book. The part in question is Listing 6.4 "Executing the 
shortest path algorithm that uses breadcrumbs"  from Chapter 6 [here][1].

I have my own graph that I create from two RDDs. There are `344436` vertices 
and `772983` edges. I can perform an unweighted shortest path computation using 
the native GraphX library and I'm confident in the graph construction. 

In this case I use their Dijkstra's implementation as follows:

val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, 
edgesRDD).cache()

def dijkstra[VD](g:Graph[VD,Double], origin:VertexId) = {
var g2 = g.mapVertices(
(vid,vd) => (false, if (vid == origin) 0 else 
Double.MaxValue,

List[VertexId]()))

for (i <- 1L to g.vertices.count-1) {
val currentVertexId =
g2.vertices.filter(!_._2._1)

.fold((0L,(false,Double.MaxValue,List[VertexId](((a,b) =>
if (a._2._2 < b._2._2) 
a else b)
._1

val newDistances = 
g2.aggregateMessages[(Double,List[VertexId])](
ctx => if (ctx.srcId == 
currentVertexId)
 
ctx.sendToDst((ctx.srcAttr._2 + ctx.attr,

ctx.srcAttr._3 :+ ctx.srcId)),
(a,b) => if (a._1 < b._1) a 
else b)

g2 = g2.outerJoinVertices(newDistances)((vid, 
vd, newSum) => {
val newSumVal =

newSum.getOrElse((Double.MaxValue,List[VertexId]()))
(vd._1 || vid == currentVertexId,
math.min(vd._2, newSumVal._1),
if (vd._2 < newSumVal._1) vd._3 else 
newSumVal._2)})
}

g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
(vd, 
dist.getOrElse((false,Double.MaxValue,List[VertexId]()))
 .productIterator.toList.tail))
}

//  Path Finding - random node from which to find all paths
val v1 = 400028222916L

I then call their function with my graph and a random vertex ID. Previously I 
had issues with `v1` not being recognised as `long` type and the `L` suffix 
solved this.

val results = dijkstra(my_graph, 1L).vertices.map(_._2).collect

println(results)

However, this returns the following:

Error: Exception in thread "main" java.lang.NoSuchMethodError: 
scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
at GraphX$.dijkstra$1(GraphX.scala:51)
at GraphX$.main(GraphX.scala:85)
at GraphX.main(GraphX.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Line 51 refers to the line `var g2 = g.mapVertices(`
Line 85 refers to the line `val results = dijkstra(my_graph, 
1L).vertices.map(_._2).collect`

What method is this exception referring to? I am able to package with `sbt` 
without error and I canno see what method I am calling whcih does not exist. 

Many thanks!

Brian

  [1]: https://www.manning.com/books/spark-graphx-in-action#downloads 
<https://www.manning.com/books/spark-graphx-in-action#downloads>

> On 24 Oct 2016, at 16:54, Michael Malak  wrote:
> 
> Chapter 6 of my book implements Dijkstra's Algorithm. The source code is 
> available to download for free. 
> https://www.manning.com/books/spark-graphx-in-action 
> <https://www.manning.com/books/s

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with 
d3.js to render graphs using d3's force-directed rendering algorithm. The 
source code can be downloaded for free from 
https://www.manning.com/books/spark-graphx-in-action
  From: agc studio 
 To: user@spark.apache.org 
 Sent: Sunday, September 11, 2016 5:59 PM
 Subject: GraphX drawing algorithm
   
Hi all,
I was wondering if a force-directed graph drawing algorithm has been 
implemented for graphX?

Thanks

   

GraphX drawing algorithm

2016-09-11 Thread agc studio
Hi all,

I was wondering if a force-directed graph drawing algorithm has been
implemented for graphX?

Thanks


Problem with Graphx and number of partitions

2016-08-31 Thread alvarobrandon
Helo everyone:

I have a problem when setting the number of partitions inside Graphx with
the ConnectedComponents function. When I launch the application with the
default number of partition everything runs smoothly. However when I
increase the number of partitions to 150 for example ( it happens with
bigger values as well) it gets stuck in stage 5 in the last task.

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27629/Screen_Shot_2016-08-31_at_13.png>
 

with the following error


[Stage
5:=>  
(190 + 10) / 200]241.445: [GC [PSYoungGen: 118560K->800K(233472K)]
418401K->301406K(932864K), 0.0029430 secs] [Times: user=0.02 sys=0.00,
real=0.01 secs]
[Stage
5:=>(199
+ 1) / 200]16/08/31 11:09:23 ERROR spark.ContextCleaner: Error cleaning
broadcast 4
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120
seconds. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at
scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at
scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
at
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
in 120 seconds
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
... 7 more

The way I set the number of partitions is when reading the graph through:

val graph = GraphLoader.edgeListFile(sc, input, true, minEdge,
StorageLevel.MEMORY_AND_DISK, StorageLevel.MEMORY_AND_DISK)
val res = graph.connectedComponents().vertices

The version of 

Future of GraphX

2016-08-24 Thread mas
Hi, 

I am wondering if there is any current work going on optimizations of
GraphX? 
I am aware of GraphFrames that is built on Data frame. However, is there any
plane to build GraphX's version on newer Spark APIs, i.e., Datasets or Spark
2.0?

Furthermore, Is there any plan to incorporate Graph Streaming.

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Future-of-GraphX-tp27592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



GraphX VerticesRDD issue - java.lang.ArrayStoreException: java.lang.Long

2016-08-18 Thread Gerard Casey
Dear all,

I am building a graph from two JSON files.

Spark version 1.6.1

Creating Edge and Vertex RDDs from JSON files.

The vertex JSON files looks like this:

{"toid": "osgb400031043205", "index": 1, "point": [508180.748, 
195333.973]}
{"toid": "osgb400031043206", "index": 2, "point": [508163.122, 
195316.627]}
{"toid": "osgb400031043207", "index": 3, "point": [508172.075, 
195325.719]}
{"toid": "osgb400031043208", "index": 4, "point": [508513, 196023]}

val vertices_raw = sqlContext.read.json("vertices.json.gz")

val vertices = vertices_raw.rdd.map(row=> 
((row.getAs[String]("toid").stripPrefix("osgb").toLong),row.getAs[String]("index")))

val verticesRDD: RDD[(VertexId, String)] = vertices

The edges JSON file looks like this:

{"index": 1, "term": "Private Road - Restricted Access", "nature": 
"Single Carriageway", "negativeNode": "osgb400023183407", "toid": 
"osgb400023296573", "length": 112.8275895775762, "polyline": [492019.481, 
156567.076, 492028, 156567, 492041.667, 156570.536, 492063.65, 156578.067, 
492126.5, 156602], "positiveNode": "osgb400023183409"}
{"index": 2, "term": "Private Road - Restricted Access", "nature": 
"Single Carriageway", "negativeNode": "osgb400023763485", "toid": 
"osgb400023296574", "length": 141.57731318733806, "polyline": [492144.493, 
156762.059, 492149.35, 156750, 492195.75, 156630], "positiveNode": 
"osgb400023183408"}

val edges_raw = sqlContext.read.json("edges.json.gz")

val edgesRDD = 
edges_raw.rdd.map(row=>(Edge(row.getAs[String]("positiveNode").stripPrefix("osgb").toLong,
 row.getAs[String]("negativeNode").stripPrefix("osgb").toLong, 
row.getAs[Double]("length"

I have an EdgesRDD that I can inspect

[IN] edgesRDD
res10: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Double]] = 
MapPartitionsRDD[19] at map at :38
[IN] edgesRDD.foreach(println)

Edge(505125036254,505125036231,42.26548472559799)
Edge(505125651333,505125651330,29.557979625165135)
Edge(505125651329,505125651330,81.9310872300414)

I have a verticesRDD

[IN] verticesRDD
res12: org.apache.spark.rdd.RDD[(Long, String)] = MapPartitionsRDD[9] at 
map at :38

[IN] verticesRDD.foreach(println)
(505125651331,343722)
(505125651332,343723)
(505125651333,343724)

I then combine these to create a graph.

[IN] val graph: Graph[(String),Double] = Graph(verticesRDD, edgesRDD)
graph: org.apache.spark.graphx.Graph[String,Double] = 
org.apache.spark.graphx.impl.GraphImpl@303bbd02

I can inspect the edgesRDD within the graph object:

[IN] graph.edges.foreach(println)

Edge(505125774813,400029917080,72.9742898009203)
Edge(505125774814,505125774813,49.87951589790352)
Edge(505125775080,400029936370,69.62871049042008)

However, when I inspect the verticesRDD:

[IN] graph.vertices.foreach(println)
Is there an issue with my graph construction? 

ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 13)
java.lang.ArrayStoreException: java.lang.Long
at 
scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
at 
org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap.setMerge(GraphXPrimitiveKeyOpenHashMap.scala:87)
at 
org.apache.spark.graphx.impl.ShippableVertexPartition$$anonfun$apply$5.apply(ShippableVertexPartition.scala:61)
at 
org.apache.spark.graphx.impl.ShippableVertexPartition$$anonfun$apply$5.apply(ShippableVertexPartition.scala:60)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.graphx.impl.ShippableVertexPartition$.apply(ShippableVertexPartition.scala:60)
at 
org.apache.spark.graphx.VertexRDD$$anonfun$2.apply(VertexRDD.scala:328)
at 
org.apache.spark.graphx.VertexRDD$$anonfun$2.apply(VertexRDD.scala:325)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExec

  1   2   3   4   5   6   7   >