date:20141118

At 2014-11-11 01:51:43 +, Buttler, David buttl...@llnl.gov wrote:
 I am building a graph from a large CSV file.  Each record contains a couple 
 of nodes and about 10 edges.  When I try to load a large portion of the 
 graph, using multiple partitions, I get inconsistent results in the number of 
 edges between different runs.  However, if I use a single partition, or a 
 small portion of the CSV file (say 1000 rows), then I get a consistent number 
 of edges.  Is there anything I should be aware of as to why this could be 
 happening in GraphX?

Is it possible there's some nondeterminism in the way you're reading the file? 
It would be helpful if you could post the code you're using to load the graph.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Slave Node Management in Standalone Cluster

2014-11-18 Thread Akhil Das

1. You can comment the rest of the workers from the conf/slaves file and do
a stop-slaves.sh from that machine to stop the specific worker.

2. There is no direct command for it, but you can do something like the
following:

$ curl localhost:8080 | grep Applications -C 10 | head -n20

Where localhost is your master machine and 8080 is the web ui port.

You can also look at the metrics
http://spark.apache.org/docs/latest/monitoring.html#metrics options for
more sophisticated version.


Thanks
Best Regards

On Tue, Nov 18, 2014 at 1:57 PM, Kenichi Maehashi 
webmas...@kenichimaehashi.com wrote:

 Hi,

 I'm operating Spark in standalone cluster configuration (3 slaves) and
 have some question.

 1. How can I stop a slave on the specific node?
Under `sbin/` directory, there are
 `start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but
 no `stop-slave`. Are there any way to stop the specific (e.g., the
 2nd) slave via command line?

 2. How can I check cluster status from command line?
Are there any way to confirm that all Master / Workers are up and
 working without using Web UI?

 Thanks in advance!

 --
 Kenichi Maehashi

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: GraphX / PageRank with edge weights

At 2014-11-13 21:28:52 +, Ommen, Jurgen omme0...@stthomas.edu wrote:
 I'm using GraphX and playing around with its PageRank algorithm. However, I 
 can't see from the documentation how to use edge weight when running PageRank.
 Is this possible to consider edge weights and how would I do it?

There's no built-in way to prefer certain edges over others; edge weights are 
just set to the inverse of the outdegree of the source vertex. But it's simple 
to modify the PageRank code [1] to use custom edge weights instead: (1) copy 
the PageRank.run method body to your own project, (2) change the type signature 
of the input graph from Graph[VD, ED] to Graph[VD, Double], and (3) remove the 
calls to outerJoinVertices and mapTriplets on line 86 and 88.

Ankur

[1] 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L79

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Null pointer exception with larger datasets

2014-11-18 Thread Naveen Kumar Pokala

Thanks Akhil.

-Naveen.


From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, November 18, 2014 1:19 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Null pointer exception with larger datasets

Make sure your list is not null, if that is null then its more like doing:

JavaRDDStudent distData = sc.parallelize(null)

distData.foreach(println)


Thanks
Best Regards

On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
Hi,

I am having list Students and size is one Lakh and I am trying to save the 
file. It is throwing null pointer exception.

JavaRDDStudent distData = sc.parallelize(list);

distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);


14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 
(TID 5, master): java.lang.NullPointerException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)


How to handle this?

-Naveen

Re: Pagerank implementation

At 2014-11-15 18:01:22 -0700, tom85 tom.manha...@gmail.com wrote:
 This line: val newPR = oldPR + (1.0 - resetProb) * msgSum
 makes no sense to me. Should it not be:
 val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum 
 ?

This is an unusual version of PageRank where the messages being passed around 
are deltas rather than full ranks. This occurs at line 156 in vertexProgram, 
which returns (newPR, newPR - oldPR). The second element of the tuple is the 
delta, which is subsequently used in sendMessage.

The benefit of this is that sendMessage can avoid sending when the delta drops 
below the convergence threshold `tol`, indicating that the source vertex has 
converged. But it means that to update the rank of each vertex, we have to add 
the incoming delta to its existing rank. That's why the oldPR term appears in 
the line you're looking at.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

New Codes in GraphX

Hi,
I am using Spark-1.0.0. There are two GraphX directories that I can see here

1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
which contains LiveJournalPageRank,scala

2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
contains   Analytics.scala, ConnectedComponenets.scala etc etc

Now, if I want to add my own code to GraphX i.e., if I want to write a
small application on GraphX, in which directory should I add my code, in 1
or 2 ? And what is the difference?

Can anyone tell me something on this?

Thank You

Re: Landmarks in GraphX section of Spark API

At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 I was going through the graphx section in the Spark API in
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$

 Here, I find the word landmark. Can anyone explain to me what is landmark
 means. Is it a simple English word or does it mean something else in graphx.

The landmarks in the context of the shortest-paths algorithm are just the 
vertices of interest. For each vertex in the graph, the algorithm will return 
the distance to each of the landmark vertices.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Landmarks in GraphX section of Spark API

So landmark can contain just one vertex right?
Which algorithm has been used to compute the shortest path?

Thank You

On Tue, Nov 18, 2014 at 2:53 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I was going through the graphx section in the Spark API in
 
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$
 
  Here, I find the word landmark. Can anyone explain to me what is
 landmark
  means. Is it a simple English word or does it mean something else in
 graphx.

 The landmarks in the context of the shortest-paths algorithm are just
 the vertices of interest. For each vertex in the graph, the algorithm will
 return the distance to each of the landmark vertices.

 Ankur

Re: Running PageRank in GraphX

At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 I just ran the PageRank code in GraphX with some sample data. What I am
 seeing is that the total rank changes drastically if I change the number of
 iterations from 10 to 100. Why is that so?

As far as I understand, the total rank should asymptotically approach the 
number of vertices in the graph, assuming there are no vertices of zero 
outdegree. Does that seem to be the case for your graph?

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running PageRank in GraphX

There are no vertices of zero outdegree.
The total rank for the graph with numIter = 10 is 4.99 and for the graph
with numIter = 100 is 5.99
I do not know why so much variation.

On Tue, Nov 18, 2014 at 3:22 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I just ran the PageRank code in GraphX with some sample data. What I am
  seeing is that the total rank changes drastically if I change the number
 of
  iterations from 10 to 100. Why is that so?

 As far as I understand, the total rank should asymptotically approach the
 number of vertices in the graph, assuming there are no vertices of zero
 outdegree. Does that seem to be the case for your graph?

 Ankur

Re: Landmarks in GraphX section of Spark API

At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 So landmark can contain just one vertex right?

Right.

 Which algorithm has been used to compute the shortest path?

It's distributed Bellman-Ford.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Landmarks in GraphX section of Spark API

Does Bellman-Ford give the best solution?

On Tue, Nov 18, 2014 at 3:27 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  So landmark can contain just one vertex right?

 Right.

  Which algorithm has been used to compute the shortest path?

 It's distributed Bellman-Ford.

 Ankur

Re: New Codes in GraphX

The codes that are present in 2 can be run with the command

*$SPARK_HOME/bin/spark-submit --master local[*] --class
org.apache.spark.graphx.lib.Analytics
$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-*.jar pagerank
/edge-list-file.txt --numEPart=8 --numIter=10
--partStrategy=EdgePartition2D*

Now, how do I run the LiveJournalPageRank.scala that is there in 1?



On Tue, Nov 18, 2014 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I am using Spark-1.0.0. There are two GraphX directories that I can see
 here

 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
 which contains LiveJournalPageRank,scala

 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
 contains   Analytics.scala, ConnectedComponenets.scala etc etc

 Now, if I want to add my own code to GraphX i.e., if I want to write a
 small application on GraphX, in which directory should I add my code, in 1
 or 2 ? And what is the difference?

 Can anyone tell me something on this?

 Thank You

Re: New Codes in GraphX

At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 I am using Spark-1.0.0. There are two GraphX directories that I can see here

 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
 which contains LiveJournalPageRank,scala

 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
 contains   Analytics.scala, ConnectedComponenets.scala etc etc

 Now, if I want to add my own code to GraphX i.e., if I want to write a
 small application on GraphX, in which directory should I add my code, in 1
 or 2 ? And what is the difference?

If you want to add an algorithm which you can call from the Spark shell and 
submit as a pull request, you should add it to org.apache.spark.graphx.lib 
(#2). To run it from the command line, you'll also have to modify 
Analytics.scala.

If you want to write a separate application, the ideal way is to do it in a 
separate project that links in Spark as a dependency [1]. It will also work to 
put it in either #1 or #2, but this will be worse in the long term because each 
build cycle will require you to rebuild and restart all of Spark rather than 
just building your application and calling spark-submit on the new JAR.

Ankur

[1] http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Slave Node Management in Standalone Cluster

2014-11-18 Thread Kousuke Saruta


Hi Kenichi


1. How can I stop a slave on the specific node?
   Under `sbin/` directory, there are
`start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but
no `stop-slave`. Are there any way to stop the specific (e.g., the
2nd) slave via command line?


You can use sbin/spark-daemon.sh on the machine where the worker you'd 
like to stop runs.
First, you find PID of the worker you'd like to stop and second, you 
find PID file of the worker.

The PID file is on /tmp/ by default and the file name is like as follows.

xxx.org.apache.spark.deploy.worker.Worker-WorkerID.pid

After you find the PID file, you run the following command.

sbin/spark-daemon.sh stop org.apache.spark.worker.Worker WorkerID


2. How can I check cluster status from command line?
   Are there any way to confirm that all Master / Workers are up and
working without using Web UI?


AFAIK, there are no command line tools for checking statuses of 
standalone cluster.

Instead of that, you can use special URL like as follows.

http://master or worker's hostname:webui-port/json

You can get Master and Worker status as JSON format data.

- Kousuke

(2014/11/18 0:27), Kenichi Maehashi wrote:

Hi,

I'm operating Spark in standalone cluster configuration (3 slaves) and
have some question.

1. How can I stop a slave on the specific node?
Under `sbin/` directory, there are
`start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but
no `stop-slave`. Are there any way to stop the specific (e.g., the
2nd) slave via command line?

2. How can I check cluster status from command line?
Are there any way to confirm that all Master / Workers are up and
working without using Web UI?

Thanks in advance!




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Landmarks in GraphX section of Spark API

At 2014-11-18 15:29:08 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 Does Bellman-Ford give the best solution?

It gives the same solution as any other algorithm, since there's only one 
correct solution for shortest paths and it's guaranteed to find it eventually. 
There are probably faster distributed algorithms for single-source shortest 
paths, but I'm not familiar with them. As far as I can tell, distributed 
Bellman-Ford is the most widely-used one.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming with Kafka is failing with Error

2014-11-18 Thread Sourav Chandra

Hi,

While running my spark streaming application built on spark 1.1.0 I am
getting below error.

*14/11/18 15:35:30 ERROR ReceiverTracker: Deregistered receiver for stream
0: Error starting receiver 0 - java.lang.AbstractMethodError*
* at org.apache.spark.Logging$class.log(Logging.scala:52)*
* at
org.apache.spark.streaming.kafka.KafkaReceiver.log(KafkaInputDStream.scala:66)*
* at org.apache.spark.Logging$class.logInfo(Logging.scala:59)*
* at
org.apache.spark.streaming.kafka.KafkaReceiver.logInfo(KafkaInputDStream.scala:66)*
* at
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:86)*
* at
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)*
* at
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)*
* at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)*
* at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)*
* at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)*
* at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)*
* at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)*
* at org.apache.spark.scheduler.Task.run(Task.scala:54)*
* at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)*
* at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)*
* at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)*
* at java.lang.Thread.run(Thread.java:722)*



Can you guys please help me out here?
-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com

Re: New Codes in GraphX

What command should I use to run the LiveJournalPageRank.scala?

 If you want to write a separate application, the ideal way is to do it in
a separate project that links in Spark as a dependency [1].
But even for this, I have to do the build every time I change the code,
right?

Thank You

On Tue, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I am using Spark-1.0.0. There are two GraphX directories that I can see
 here
 
  1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
  which contains LiveJournalPageRank,scala
 
  2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
  contains   Analytics.scala, ConnectedComponenets.scala etc etc
 
  Now, if I want to add my own code to GraphX i.e., if I want to write a
  small application on GraphX, in which directory should I add my code, in
 1
  or 2 ? And what is the difference?

 If you want to add an algorithm which you can call from the Spark shell
 and submit as a pull request, you should add it to
 org.apache.spark.graphx.lib (#2). To run it from the command line, you'll
 also have to modify Analytics.scala.

 If you want to write a separate application, the ideal way is to do it in
 a separate project that links in Spark as a dependency [1]. It will also
 work to put it in either #1 or #2, but this will be worse in the long term
 because each build cycle will require you to rebuild and restart all of
 Spark rather than just building your application and calling spark-submit
 on the new JAR.

 Ankur

 [1]
 http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications

Re: New Codes in GraphX

At 2014-11-18 15:35:13 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 Now, how do I run the LiveJournalPageRank.scala that is there in 1?

I think it should work to use

MASTER=local[*] $SPARK_HOME/bin/run-example graphx.LiveJournalPageRank 
/edge-list-file.txt --numEPart=8 --numIter=10
 --partStrategy=EdgePartition2D

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: New Codes in GraphX

Yes the above command works, but there is this problem. Most of the times,
the total rank is Nan (Not a Number). Why is it so?

Thank You

On Tue, Nov 18, 2014 at 3:48 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 What command should I use to run the LiveJournalPageRank.scala?

  If you want to write a separate application, the ideal way is to do it
 in a separate project that links in Spark as a dependency [1].
 But even for this, I have to do the build every time I change the code,
 right?

 Thank You

 On Tue, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I am using Spark-1.0.0. There are two GraphX directories that I can see
 here
 
  1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
  which contains LiveJournalPageRank,scala
 
  2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
  contains   Analytics.scala, ConnectedComponenets.scala etc etc
 
  Now, if I want to add my own code to GraphX i.e., if I want to write a
  small application on GraphX, in which directory should I add my code,
 in 1
  or 2 ? And what is the difference?

 If you want to add an algorithm which you can call from the Spark shell
 and submit as a pull request, you should add it to
 org.apache.spark.graphx.lib (#2). To run it from the command line, you'll
 also have to modify Analytics.scala.

 If you want to write a separate application, the ideal way is to do it in
 a separate project that links in Spark as a dependency [1]. It will also
 work to put it in either #1 or #2, but this will be worse in the long term
 because each build cycle will require you to rebuild and restart all of
 Spark rather than just building your application and calling spark-submit
 on the new JAR.

 Ankur

 [1]
 http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications

Re: Landmarks in GraphX section of Spark API

At 2014-11-18 15:44:31 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 I meant to ask whether it gives the solution faster than other algorithms.

No, it's just that it's much simpler and easier to implement than the others. 
Section 5.2 of the Pregel paper [1] justifies using it for a graph (a binary 
tree) with 1 billion vertices on 300 machines:

More advanced parallel algorithms exist, e.g., Thorup [44] or the ∆-stepping
method [37], and have been used as the basis for special-purpose parallel
shortest paths implementations [12, 32]. Such advanced algorithms can also
be expressed in the Pregel framework. The simplicity of the implementation
in Figure 5, however, together with the already acceptable performance (see
Section 6), may appeal to users who can't do extensive tuning or
customization.

 What do you mean by distributed algorithms? Can we not use any algorithm on
 a distributed environment?

Any algorithm can be split up and run in a distributed environment, but because 
inter-node coordination is expensive, that can be very inefficient. Distributed 
algorithms in this context are ones that reduce coordination.

Ankur

[1] http://db.cs.berkeley.edu/cs286/papers/pregel-sigmod2010.pdf

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: New Codes in GraphX

At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
 Yes the above command works, but there is this problem. Most of the times,
 the total rank is Nan (Not a Number). Why is it so?

I've also seen this, but I'm not sure why it happens. If you could find out 
which vertices are getting the NaN rank, it might be helpful in tracking down 
the problem.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Kestrel and Spark Stream

2014-11-18 Thread Eduardo Alfaia

Hi guys,
Has anyone already tried doing this work?

Thanks
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Kestrel and Spark Stream

2014-11-18 Thread Akhil Das

You can implement a custom receiver
http://spark.apache.org/docs/latest/streaming-custom-receivers.html to
connect to Kestrel and use it. I think someone have already tried it, not
sure if it is working though. Here's the link
https://github.com/prabeesh/Spark-Kestrel/blob/master/streaming/src/main/scala/spark/streaming/dstream/KestrelInputDStream.scala
.

Thanks
Best Regards

On Tue, Nov 18, 2014 at 4:23 PM, Eduardo Alfaia e.costaalf...@unibs.it
wrote:

 Hi guys,
 Has anyone already tried doing this work?

 Thanks

 Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Kestrel and Spark Stream

2014-11-18 Thread prabeesh k

You can refer the following link
https://github.com/prabeesh/Spark-Kestrel

On Tue, Nov 18, 2014 at 3:51 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can implement a custom receiver
 http://spark.apache.org/docs/latest/streaming-custom-receivers.html to
 connect to Kestrel and use it. I think someone have already tried it, not
 sure if it is working though. Here's the link
 https://github.com/prabeesh/Spark-Kestrel/blob/master/streaming/src/main/scala/spark/streaming/dstream/KestrelInputDStream.scala
 .

 Thanks
 Best Regards

 On Tue, Nov 18, 2014 at 4:23 PM, Eduardo Alfaia e.costaalf...@unibs.it
 wrote:

 Hi guys,
 Has anyone already tried doing this work?

 Thanks

 Informativa sulla Privacy: http://www.unibs.it/node/8155

How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread shahab

Hi,

In my spark application, I am loading some rows from database into Spark
RDDs
Each row has several fields, and a string key. Due to my requirements I
need to work with consecutive numeric ids (starting from 1 to N, where N is
the number of unique keys) instead of string keys . Also several rows can
have same string key .

In spark context, how I can map each row into (Numeric_Key, OriginalRow) as
map/reduce  tasks such that rows with same original string key get same
numeric consecutive key?

Any hints?

best,
/Shahab

Re: Building Spark with hive does not work

2014-11-18 Thread Cheng Lian


Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :)

On 11/18/14 1:50 AM, Ted Yu wrote:

Looks like this was where you got that commandline:

http://search-hadoop.com/m/JW1q5RlPrl

Cheers

On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com 
mailto:inv...@gmail.com wrote:


Sry for spamming,

Just after my previous post, I noticed that the command used is:

./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly

thriftserver*

the typo error is the evil. Stupid, me.

I believe I just copy-pasted from somewhere else, but no even
checked it,
meanwhile no error msg, such as no such option, is displayed,
which makes
me consider the flags are correct.

Sry for the carelessness.

Hao




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19087.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

ReduceByKey but with different functions depending on key

2014-11-18 Thread jelgh

Hello everyone,

I'm new to Spark and I have the following problem:

I have this large JavaRDDMyClass collection, which I group with by
creating a hashcode from some fields in MyClass:

JavaRDDMyClass collection = ...;
JavaPairRDDInteger, Iterablelt;MyClass grouped =
collection.groupBy(...); // the group-function is just creating a hashcode
from some fields in MyClass.

Now I want to reduce the variable grouped. However, I want to reduce it with
different functions depending on the key in the JavaPairRDD. So basically a
reduceByKey but with multiple functions.

Only solution I've come up with is by filtering grouped for each reduce
function and apply it on the filtered  subsets. This feels kinda hackish
though. 

Is there a better way? 

Best regards,
Johannes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Cheng Lian


A not so efficient way can be this:

|val  r0:  RDD[OriginalRow] = ...
val  r1  =  r0.keyBy(row = extractKeyFromOriginalRow(row))
val  r2  =  r1.keys.distinct().zipWithIndex()
val  r3  =  r2.join(r1).values
|

On 11/18/14 8:54 PM, shahab wrote:


Hi,

In my spark application, I am loading some rows from database into 
Spark RDDs
Each row has several fields, and a string key. Due to my requirements 
I need to work with consecutive numeric ids (starting from 1 to N, 
where N is the number of unique keys) instead of string keys . Also 
several rows can have same string key .


In spark context, how I can map each row into (Numeric_Key, 
OriginalRow) as map/reduce  tasks such that rows with same original 
string key get same numeric consecutive key?


Any hints?

best,
/Shahab

Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar

I am writing yet another Spark job server and have been able to submit jobs
and return/save results. I let multiple jobs use the same spark context but
I set job group while firing each job so that I can in future cancel jobs.
Further, what I deserve to do is provide some kind of status
update/progress on running jobs (a % completion but be awesome) but I am
unable to figure out appropriate spark API to use. I do however see status
reporting in spark UI so there must be a way to get status of various
stages per job group. Any hints on what APIs should I look at?

Is sorting persisted after pair rdd transformations?

2014-11-18 Thread Aniket Bhatnagar

I am trying to figure out if sorting is persisted after applying Pair RDD
transformations and I am not able to decisively tell after reading the
documentation.

For example:
val numbers = .. // RDD of numbers
val pairedNumbers = numbers.map(number = (number % 100, number))
val sortedPairedNumbers = pairedNumbers.sortBy(pairedNumber =
pairedNumber._2) // Sort by values in the pair
val aggregates = sortedPairedNumbers.combineByKey(..)

In this example, will the combine functions see values in sorted order?
What if I had done groupByKey and then combineByKey? What transformations
can unsort an already sorted data?

Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella

I started some quick hack for that in the notebook, you can head to:
https://github.com/andypetrella/spark-notebook/blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala

On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 I am writing yet another Spark job server and have been able to submit
 jobs and return/save results. I let multiple jobs use the same spark
 context but I set job group while firing each job so that I can in future
 cancel jobs. Further, what I deserve to do is provide some kind of status
 update/progress on running jobs (a % completion but be awesome) but I am
 unable to figure out appropriate spark API to use. I do however see status
 reporting in spark UI so there must be a way to get status of various
 stages per job group. Any hints on what APIs should I look at?

Re: Building Spark with hive does not work

2014-11-18 Thread Hao Ren

nvm, it would be better if correctness of flags could be checked by sbt
during building.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar

Thanks Andy. This is very useful. This gives me all active stages  their
percentage completion but I am unable to tie stages to job group (or
specific job). I looked at Spark's code and to me, it
seems org.apache.spark.scheduler.ActiveJob's group ID should get propagated
to StageInfo (possibly in the StageInfo.fromStage method). For now, I will
have to write my own version JobProgressListener that stores stageId to
group Id mapping.

I will submit a JIRA ticket and seek spark dev's opinion on this. Many
thanks for your prompt help Andy.

Thanks,
Aniket

On Tue Nov 18 2014 at 19:40:06 andy petrella andy.petre...@gmail.com
wrote:

 I started some quick hack for that in the notebook, you can head to:
 https://github.com/andypetrella/spark-notebook/
 blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala

 On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am writing yet another Spark job server and have been able to submit
 jobs and return/save results. I let multiple jobs use the same spark
 context but I set job group while firing each job so that I can in future
 cancel jobs. Further, what I deserve to do is provide some kind of status
 update/progress on running jobs (a % completion but be awesome) but I am
 unable to figure out appropriate spark API to use. I do however see status
 reporting in spark UI so there must be a way to get status of various
 stages per job group. Any hints on what APIs should I look at?

Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella

yep, we should also propose to add this stuffs in the public API.

Any other ideas?

On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 Thanks Andy. This is very useful. This gives me all active stages  their
 percentage completion but I am unable to tie stages to job group (or
 specific job). I looked at Spark's code and to me, it
 seems org.apache.spark.scheduler.ActiveJob's group ID should get propagated
 to StageInfo (possibly in the StageInfo.fromStage method). For now, I will
 have to write my own version JobProgressListener that stores stageId to
 group Id mapping.

 I will submit a JIRA ticket and seek spark dev's opinion on this. Many
 thanks for your prompt help Andy.

 Thanks,
 Aniket


 On Tue Nov 18 2014 at 19:40:06 andy petrella andy.petre...@gmail.com
 wrote:

 I started some quick hack for that in the notebook, you can head to:
 https://github.com/andypetrella/spark-notebook/
 blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala

 On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am writing yet another Spark job server and have been able to submit
 jobs and return/save results. I let multiple jobs use the same spark
 context but I set job group while firing each job so that I can in future
 cancel jobs. Further, what I deserve to do is provide some kind of status
 update/progress on running jobs (a % completion but be awesome) but I am
 unable to figure out appropriate spark API to use. I do however see status
 reporting in spark UI so there must be a way to get status of various
 stages per job group. Any hints on what APIs should I look at?

sum/avg group by specified ranges

2014-11-18 Thread tridib

Hello Experts,
I need to get total of an amount fields for specified date range. Now that
group by on calculated field does not work
(https://issues.apache.org/jira/browse/SPARK-4296), what is the best way to
get this done?

I thought to do it using spark, but I suspect I will miss the performance of
spark sql on top of parquet file. Any suggestion?

Thanks  Regards
Tridib




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sum-avg-group-by-specified-ranges-tp19187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Status of MLLib exporting models to PMML

2014-11-18 Thread Charles Earl

Yes,
The case is convincing for PMML with Oryx. I will also investigate
parameter server.
Cheers,
Charles

On Tuesday, November 18, 2014, Sean Owen so...@cloudera.com wrote:

 I'm just using PMML. I haven't hit any limitation of its
 expressiveness, for the model types is supports. I don't think there
 is a point in defining a new format for models, excepting that PMML
 can get very big. Still, just compressing the XML gets it down to a
 manageable size for just about any realistic model.*

 I can imagine some kind of translation from PMML-in-XML to
 PMML-in-something-else that is more compact. I've not seen anyone do
 this.

 * there still aren't formats for factored matrices and probably won't
 ever quite be, since they're just too large for a file format.

 On Tue, Nov 18, 2014 at 5:34 AM, Manish Amde manish...@gmail.com
 javascript:; wrote:
  Hi Charles,
 
  I am not aware of other storage formats. Perhaps Sean or Sandy can
 elaborate
  more given their experience with Oryx.
 
  There is work by Smola et al at Google that talks about large scale model
  update and deployment.
 
 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu
 
  -Manish
 



-- 
- Charles

Nightly releases

2014-11-18 Thread Arun Ahuja

Are nightly releases posted anywhere?  There are quite a few vital bugfixes
and performance improvements being commited to Spark and using the latest
commits is useful (or even necessary for some jobs).

Is there a place to post them, it doesn't seem like it would diffcult to
run make-dist nightly and place it somwhere?

Is is possible extract this from Jenkins bulds?

Thanks,
Arun

Re: Nightly releases

2014-11-18 Thread Arun Ahuja

Of course we can run this as well to get the lastest, but the build is
fairly long and this seems like a resource many would need.

On Tue, Nov 18, 2014 at 10:21 AM, Arun Ahuja aahuj...@gmail.com wrote:

 Are nightly releases posted anywhere?  There are quite a few vital
 bugfixes and performance improvements being commited to Spark and using the
 latest commits is useful (or even necessary for some jobs).

 Is there a place to post them, it doesn't seem like it would diffcult to
 run make-dist nightly and place it somwhere?

 Is is possible extract this from Jenkins bulds?

 Thanks,
 Arun

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Yanbo

First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]).
Then use map() on this RDD with a function has different operations depending 
on the key which act as a parameter of this function.


 在 2014年11月18日，下午8:59，jelgh johannes.e...@gmail.com 写道：
 
 Hello everyone,
 
 I'm new to Spark and I have the following problem:
 
 I have this large JavaRDDMyClass collection, which I group with by
 creating a hashcode from some fields in MyClass:
 
 JavaRDDMyClass collection = ...;
 JavaPairRDDInteger, Iterablelt;MyClass grouped =
 collection.groupBy(...); // the group-function is just creating a hashcode
 from some fields in MyClass.
 
 Now I want to reduce the variable grouped. However, I want to reduce it with
 different functions depending on the key in the JavaPairRDD. So basically a
 reduceByKey but with multiple functions.
 
 Only solution I've come up with is by filtering grouped for each reduce
 function and apply it on the filtered  subsets. This feels kinda hackish
 though. 
 
 Is there a better way? 
 
 Best regards,
 Johannes
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das

groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...

reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in reduceByKey if you have access to the key in
reduceByKey/aggregateByKey...

I did not have the need to access the key in reduceByKey/aggregateByKey yet
but there should be a way...

On Tue, Nov 18, 2014 at 7:24 AM, Yanbo yanboha...@gmail.com wrote:

 First use groupByKey(), you get a tuple RDD with
 (key:K,value:ArrayBuffer[V]).
 Then use map() on this RDD with a function has different operations
 depending on the key which act as a parameter of this function.


  在 2014年11月18日，下午8:59，jelgh johannes.e...@gmail.com 写道：
 
  Hello everyone,
 
  I'm new to Spark and I have the following problem:
 
  I have this large JavaRDDMyClass collection, which I group with by
  creating a hashcode from some fields in MyClass:
 
  JavaRDDMyClass collection = ...;
  JavaPairRDDInteger, Iterablelt;MyClass grouped =
  collection.groupBy(...); // the group-function is just creating a
 hashcode
  from some fields in MyClass.
 
  Now I want to reduce the variable grouped. However, I want to reduce it
 with
  different functions depending on the key in the JavaPairRDD. So
 basically a
  reduceByKey but with multiple functions.
 
  Only solution I've come up with is by filtering grouped for each reduce
  function and apply it on the filtered  subsets. This feels kinda hackish
  though.
 
  Is there a better way?
 
  Best regards,
  Johannes
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann

I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
increment them like so:

val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

If the number of distinct keys is relatively small, you might consider
collecting them into a map and broadcasting them rather than using a join,
like so:

val keyIndices = sc.broadcast(r2.collect.toMap)
val r3 = r1.map { case (k, v) = (keyIndices(k), v) }

On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian lian.cs@gmail.com wrote:

  A not so efficient way can be this:

 val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row = 
 extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val 
 r3 = r2.join(r1).values

 On 11/18/14 8:54 PM, shahab wrote:

   Hi,

  In my spark application, I am loading some rows from database into Spark
 RDDs
 Each row has several fields, and a string key. Due to my requirements I
 need to work with consecutive numeric ids (starting from 1 to N, where N is
 the number of unique keys) instead of string keys . Also several rows can
 have same string key .

  In spark context, how I can map each row into (Numeric_Key, OriginalRow)
 as map/reduce  tasks such that rows with same original string key get same
 numeric consecutive key?

  Any hints?

  best,
 /Shahab

   




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io

Re: Exception in spark sql when running a group by query

2014-11-18 Thread Sadhan Sood

ah makes sense - Thanks Michael!

On Mon, Nov 17, 2014 at 6:08 PM, Michael Armbrust mich...@databricks.com
wrote:

 You are perhaps hitting an issue that was fixed by #3248
 https://github.com/apache/spark/pull/3248?

 On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 While testing sparkSQL, we were running this group by with expression
 query and got an exception. The same query worked fine on hive.

 SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd') as pst_date,
 count(*) as num_xyzs
   FROM
 all_matched_abc
   GROUP BY
 from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd')

 14/11/17 17:41:46 ERROR thriftserver.SparkSQLDriver: Failed in [SELECT 
 from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd') as pst_date,
 count(*) as num_xyzs
   FROM
 all_matched_abc
   GROUP BY
 from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd')
 ]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: 
 HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived
  AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, 
 DoubleType))),/MM/dd) AS pst_date#179, tree:

 Aggregate 
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived,
  DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd)], 
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived
  AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, 
 DoubleType))),/MM/dd) AS pst_date#179,COUNT(1) AS num_xyzs#180L]

  MetastoreRelation default, all_matched_abc, None
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
 at 
 scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
 at

Re: Nightly releases

2014-11-18 Thread Andrew Ash

I can see this being valuable for users wanting to live on the cutting edge
without building CI infrastructure themselves, myself included.  I think
Patrick's recent work on the build scripts for 1.2.0 will make delivering
nightly builds to a public maven repo easier.

On Tue, Nov 18, 2014 at 10:22 AM, Arun Ahuja aahuj...@gmail.com wrote:

 Of course we can run this as well to get the lastest, but the build is
 fairly long and this seems like a resource many would need.

 On Tue, Nov 18, 2014 at 10:21 AM, Arun Ahuja aahuj...@gmail.com wrote:

 Are nightly releases posted anywhere?  There are quite a few vital
 bugfixes and performance improvements being commited to Spark and using the
 latest commits is useful (or even necessary for some jobs).

 Is there a place to post them, it doesn't seem like it would diffcult to
 run make-dist nightly and place it somwhere?

 Is is possible extract this from Jenkins bulds?

 Thanks,
 Arun

Is there a way to create key based on counts in Spark

2014-11-18 Thread Blind Faith

As it is difficult to explain this, I would show what I want. Lets us say,
I have an RDD A with the following value

A = [word1, word2, word3]

I want to have an RDD with the following value

B = [(1, word1), (2, word2), (3, word3)]

That is, it gives a unique number to each entry as a key value. Can we do
such thing with Python or Scala?

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread lordjoe

Map the key value into a key,Tuple2key,value and process that -
Also ask the Spark maintainers for a version of keyed operations where the
key is passed in as an argument - I run into these cases all the time

/**
 * map a tuple int a key tuple pair to insure subsequent processing has
access to both Key and value
 * @param inp input pair RDD
 * @param K   key type
 * @param V   value type
 * @return   output where value has both key and value
 */
   @Nonnull
   public static K extends Serializable, V extends Serializable
JavaPairRDDK,Tuple2lt;K,  V toKeyedTuples(@Nonnull JavaPairRDD K, V
inp) {
 return inp.flatMapToPair(new PairFlatMapFunctionTuple2lt;K,
V, K, Tuple2K, V() {
   @Override
   public IterableTuple2lt;K, Tuple2lt;K, V call(final
Tuple2K, V t) throws Exception {
   return   new Tuple2K, Tuple2lt;K, V(t._1(),new
Tuple2K,V(t._1(),t._2());
   }
   });
   }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177p19198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JavaKafkaWordCount

2014-11-18 Thread Eduardo Costa Alfaia

Hi Guys,

I am doing some tests with JavaKafkaWordCount, my cluster is composed by 8 
workers and 1 driver con spark-1.1.0, I am using Kafka too and I have some 
questions about.

1 - When I launch the command:
bin/spark-submit --class org.apache.spark.examples.streaming.JavaKafkaWordCount 
—master spark://computer8:7077 --driver-memory 1g --executor-memory 2g 
--executor-cores 2 
examples/target/scala-2.10/spark-examples-1.1.0-hadoop1.0.4.jar computer49:2181 
test-consumer-group test 2

I see in the Spark WebAdmin that only 1 worker work. Why?

2 - In Kafka I can see the same thing:

Group   Topic  Pid Offset  logSize  
   Lag Owner
test-consumer-group test  0   147092  147092
  0   test-consumer-group_computer1-1416319543858-817b566f-0
test-consumer-group test  1   232183  232183
  0   test-consumer-group_computer1-1416319543858-817b566f-0
test-consumer-group test  2   186805  186805
  0   test-consumer-group_computer1-1416319543858-817b566f-0
test-consumer-group test  3   0   0 
  0   test-consumer-group_computer1-1416319543858-817b566f-1
test-consumer-group test  4   0   0 
  0   test-consumer-group_computer1-1416319543858-817b566f-1
test-consumer-group test  5   0   0 
  0   test-consumer-group_computer1-1416319543858-817b566f-1

I would like to understand this behavior, Is it normal? Am I doing something 
wrong?

Thanks Guys
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das

Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...

Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of RDD[String] before you can run
zipWithIndex...

indices are within 0 till maxIndex-1...if you want 1 you have to later map
the index to index + 1

On Tue, Nov 18, 2014 at 8:56 AM, Blind Faith person.of.b...@gmail.com
wrote:

 As it is difficult to explain this, I would show what I want. Lets us say,
 I have an RDD A with the following value

 A = [word1, word2, word3]

 I want to have an RDD with the following value

 B = [(1, word1), (2, word2), (3, word3)]

 That is, it gives a unique number to each entry as a key value. Can we do
 such thing with Python or Scala?

Spark on YARN

2014-11-18 Thread Alan Prando

Hi Folks!

I'm running Spark on YARN cluster installed with Cloudera Manager Express.
The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
RAM.

My spark's job is working fine, however it seems that just 2 of 3 slaves
are working (htop shows 2 slaves working 100% on 32 cores, and 1 slaves
without any processing).

I'm using this command:
./spark-submit --master yarn --num-executors 3 --executor-cores 32
 --executor-memory 32g feature_extractor.py -r 390

Additionaly, spark's log testify communications with 2 slaves only:
14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-172-31-13-180.ec2.internal:33177/user/Executor#-113177469]
with ID 1
14/11/18 17:19:38 INFO RackResolver: Resolved ip-172-31-13-180.ec2.internal
to /default
14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-172-31-13-179.ec2.internal:51859/user/Executor#-323896724]
with ID 2
14/11/18 17:19:38 INFO RackResolver: Resolved ip-172-31-13-179.ec2.internal
to /default
14/11/18 17:19:38 INFO BlockManagerMasterActor: Registering block manager
ip-172-31-13-180.ec2.internal:50959 with 16.6 GB RAM
14/11/18 17:19:39 INFO BlockManagerMasterActor: Registering block manager
ip-172-31-13-179.ec2.internal:53557 with 16.6 GB RAM
14/11/18 17:19:51 INFO YarnClientSchedulerBackend: SchedulerBackend is
ready for scheduling beginning after waiting
maxRegisteredResourcesWaitingTime: 3(ms)

Is there a configuration to call spark's job on YARN cluster with all
slaves?

Thanks in advance! =]

---
Regards
Alan Vidotti Prando.

Re: Spark on YARN

2014-11-18 Thread Debasish Das

I run my Spark on YARN jobs as:

HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit
--master yarn --jars test-job.jar --executor-cores 4 --num-executors 10
--executor-memory 16g --driver-memory 4g --class TestClass test.jar

It uses HADOOP_CONF_DIR to schedule executors and I get the number I ask
for (assuming other MapReduce jobs are not taking the cluster)...

Large memory intensive jobs like ALS still get issues on YARN but simple
jobs run fine...

Mine is also internal CDH cluster...

On Tue, Nov 18, 2014 at 10:03 AM, Alan Prando a...@scanboo.com.br wrote:

 Hi Folks!

 I'm running Spark on YARN cluster installed with Cloudera Manager Express.
 The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
 RAM.

 My spark's job is working fine, however it seems that just 2 of 3 slaves
 are working (htop shows 2 slaves working 100% on 32 cores, and 1 slaves
 without any processing).

 I'm using this command:
 ./spark-submit --master yarn --num-executors 3 --executor-cores 32
  --executor-memory 32g feature_extractor.py -r 390

 Additionaly, spark's log testify communications with 2 slaves only:
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-180.ec2.internal:33177/user/Executor#-113177469]
 with ID 1
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-180.ec2.internal to /default
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-179.ec2.internal:51859/user/Executor#-323896724]
 with ID 2
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-179.ec2.internal to /default
 14/11/18 17:19:38 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-180.ec2.internal:50959 with 16.6 GB RAM
 14/11/18 17:19:39 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-179.ec2.internal:53557 with 16.6 GB RAM
 14/11/18 17:19:51 INFO YarnClientSchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after waiting
 maxRegisteredResourcesWaitingTime: 3(ms)

 Is there a configuration to call spark's job on YARN cluster with all
 slaves?

 Thanks in advance! =]

 ---
 Regards
 Alan Vidotti Prando.

Re: Spark on YARN

2014-11-18 Thread Marcelo Vanzin

Can you check in your RM's web UI how much of each resource does Yarn
think you have available? You can also check that in the Yarn
configuration directly.

Perhaps it's not configured to use all of the available resources. (If
it was set up with Cloudera Manager, CM will reserve some room for
daemons that need to run on each machine, so it won't tell Yarn to
make all 32 cores / 64 GB available for applications.)

Also remember that Spark needs to start num executors + 1 containers
when adding up all the needed resources. The extra container generally
requires less resources than the executors, but it still needs to
allocate resources from the RM.



On Tue, Nov 18, 2014 at 10:03 AM, Alan Prando a...@scanboo.com.br wrote:
 Hi Folks!

 I'm running Spark on YARN cluster installed with Cloudera Manager Express.
 The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
 RAM.

 My spark's job is working fine, however it seems that just 2 of 3 slaves are
 working (htop shows 2 slaves working 100% on 32 cores, and 1 slaves without
 any processing).

 I'm using this command:
 ./spark-submit --master yarn --num-executors 3 --executor-cores 32
 --executor-memory 32g feature_extractor.py -r 390

 Additionaly, spark's log testify communications with 2 slaves only:
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-180.ec2.internal:33177/user/Executor#-113177469]
 with ID 1
 14/11/18 17:19:38 INFO RackResolver: Resolved ip-172-31-13-180.ec2.internal
 to /default
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-179.ec2.internal:51859/user/Executor#-323896724]
 with ID 2
 14/11/18 17:19:38 INFO RackResolver: Resolved ip-172-31-13-179.ec2.internal
 to /default
 14/11/18 17:19:38 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-180.ec2.internal:50959 with 16.6 GB RAM
 14/11/18 17:19:39 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-179.ec2.internal:53557 with 16.6 GB RAM
 14/11/18 17:19:51 INFO YarnClientSchedulerBackend: SchedulerBackend is ready
 for scheduling beginning after waiting maxRegisteredResourcesWaitingTime:
 3(ms)

 Is there a configuration to call spark's job on YARN cluster with all
 slaves?

 Thanks in advance! =]

 ---
 Regards
 Alan Vidotti Prando.





-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on YARN

2014-11-18 Thread Sandy Ryza

Hey Alan,

Spark's application master will take up 1 core on one of the nodes on the
cluster.  This means that that node will only have 31 cores remaining, not
enough to fit your third executor.

-Sandy

On Tue, Nov 18, 2014 at 10:03 AM, Alan Prando a...@scanboo.com.br wrote:

 Hi Folks!

 I'm running Spark on YARN cluster installed with Cloudera Manager Express.
 The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
 RAM.

 My spark's job is working fine, however it seems that just 2 of 3 slaves
 are working (htop shows 2 slaves working 100% on 32 cores, and 1 slaves
 without any processing).

 I'm using this command:
 ./spark-submit --master yarn --num-executors 3 --executor-cores 32
  --executor-memory 32g feature_extractor.py -r 390

 Additionaly, spark's log testify communications with 2 slaves only:
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-180.ec2.internal:33177/user/Executor#-113177469]
 with ID 1
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-180.ec2.internal to /default
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-179.ec2.internal:51859/user/Executor#-323896724]
 with ID 2
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-179.ec2.internal to /default
 14/11/18 17:19:38 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-180.ec2.internal:50959 with 16.6 GB RAM
 14/11/18 17:19:39 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-179.ec2.internal:53557 with 16.6 GB RAM
 14/11/18 17:19:51 INFO YarnClientSchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after waiting
 maxRegisteredResourcesWaitingTime: 3(ms)

 Is there a configuration to call spark's job on YARN cluster with all
 slaves?

 Thanks in advance! =]

 ---
 Regards
 Alan Vidotti Prando.

Re: Spark on YARN

2014-11-18 Thread Sean Owen

My guess is you're asking for all cores of all machines but the driver
needs at least one core, so one executor is unable to find a machine to fit
on.
On Nov 18, 2014 7:04 PM, Alan Prando a...@scanboo.com.br wrote:

 Hi Folks!

 I'm running Spark on YARN cluster installed with Cloudera Manager Express.
 The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
 RAM.

 My spark's job is working fine, however it seems that just 2 of 3 slaves
 are working (htop shows 2 slaves working 100% on 32 cores, and 1 slaves
 without any processing).

 I'm using this command:
 ./spark-submit --master yarn --num-executors 3 --executor-cores 32
  --executor-memory 32g feature_extractor.py -r 390

 Additionaly, spark's log testify communications with 2 slaves only:
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-180.ec2.internal:33177/user/Executor#-113177469]
 with ID 1
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-180.ec2.internal to /default
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-179.ec2.internal:51859/user/Executor#-323896724]
 with ID 2
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-179.ec2.internal to /default
 14/11/18 17:19:38 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-180.ec2.internal:50959 with 16.6 GB RAM
 14/11/18 17:19:39 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-179.ec2.internal:53557 with 16.6 GB RAM
 14/11/18 17:19:51 INFO YarnClientSchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after waiting
 maxRegisteredResourcesWaitingTime: 3(ms)

 Is there a configuration to call spark's job on YARN cluster with all
 slaves?

 Thanks in advance! =]

 ---
 Regards
 Alan Vidotti Prando.

Re: Spark streaming cannot receive any message from Kafka

2014-11-18 Thread Bill Jay

Hi Jerry,

I looked at KafkaUtils.createStream api and found actually the
spark.default.parallelism is specified in SparkConf instead. I do not
remember the exact stacks of the exception. But the exception was incurred
when createStream was called if we do not specify the
spark.default.parallelism. The error message basically shows parsing an
empty string into Int if spark.default.parallelism is not specified.

Bill

On Mon, Nov 17, 2014 at 4:45 PM, Shao, Saisai saisai.s...@intel.com wrote:

  Hi Bill,



 Would you mind describing what you found a little more specifically, I’m
 not sure there’s the a parameter in KafkaUtils.createStream you can specify
 the spark parallelism, also what is the exception stacks.



 Thanks

 Jerry



 *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com]
 *Sent:* Tuesday, November 18, 2014 2:47 AM
 *To:* Helena Edelson
 *Cc:* Jay Vyas; u...@spark.incubator.apache.org; Tobias Pfeiffer; Shao,
 Saisai

 *Subject:* Re: Spark streaming cannot receive any message from Kafka



 Hi all,



 I find the reason of this issue. It seems in the new version, if I do not
 specify spark.default.parallelism in KafkaUtils.createstream, there will be
 an exception since the kakfa stream creation stage. In the previous
 versions, it seems Spark will use the default value.



 Thanks!



 Bill



 On Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson 
 helena.edel...@datastax.com wrote:

 I encounter no issues with streaming from kafka to spark in 1.1.0. Do you
 perhaps have a version conflict?

 Helena

 On Nov 13, 2014 12:55 AM, Jay Vyas jayunit100.apa...@gmail.com wrote:

  Yup , very important that  n1 for spark streaming jobs, If local use
 local[2]



 The thing to remember is that your spark receiver will take a thread to
 itself and produce data , so u need another thread to consume it .



 In a cluster manager like yarn or mesos, the word thread Is not used
 anymore, I guess has different meaning- you need 2 or more free compute
 slots, and that should be guaranteed by looking to see how many free node
 managers are running etc.


 On Nov 12, 2014, at 7:53 PM, Shao, Saisai saisai.s...@intel.com wrote:

  Did you configure Spark master as local, it should be local[n], n  1
 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming
 example, you can try that. I’ve tested with latest master, it’s OK.



 Thanks

 Jerry



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp t...@preferred.jp]
 *Sent:* Thursday, November 13, 2014 8:45 AM
 *To:* Bill Jay
 *Cc:* u...@spark.incubator.apache.org
 *Subject:* Re: Spark streaming cannot receive any message from Kafka



 Bill,



   However, when I am currently using Spark 1.1.0. the Spark streaming job
 cannot receive any messages from Kafka. I have not made any change to the
 code.



 Do you see any suspicious messages in the log output?



 Tobias

Re: Pyspark Error

My best guess would be a networking issue--it looks like the Python 
socket library isn't able to connect to whatever hostname you're 
providing Spark in the configuration.


On 11/18/14 9:10 AM, amin mohebbi wrote:

Hi there,

*I have already downloaded Pre-built spark-1.1.0, I want to run 
pyspark by try typing ./bin/pyspark but I got the following error:*

*
*







*scala shell is up and working fine*

hduser@master:~/Downloads/spark-1.1.0$ ./bin/spark-shell
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

.
.
14/11/18 04:33:13 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@master:34937/user/HeartbeatReceiver

14/11/18 04:33:13 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala hduser@master:~/Downloads/spark-1.1.0$


*
*
*But python shell does not work:*

hduser@master:~/Downloads/spark-1.1.0$
hduser@master:~/Downloads/spark-1.1.0$
hduser@master:~/Downloads/spark-1.1.0$ ./bin/pyspark
Python 2.7.3 (default, Feb 27 2014, 20:00:17)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

14/11/18 04:36:06 INFO SecurityManager: Changing view acls to: hduser,
14/11/18 04:36:06 INFO SecurityManager: Changing modify acls to: hduser,
14/11/18 04:36:06 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users with view 
permissions: Set(hduser, ); users with modify permissions: Set(hduser, )

14/11/18 04:36:06 INFO Slf4jLogger: Slf4jLogger started
14/11/18 04:36:06 INFO Remoting: Starting remoting
14/11/18 04:36:06 INFO Remoting: Remoting started; listening on 
addresses :[akka.tcp://sparkDriver@master:52317]
14/11/18 04:36:06 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@master:52317]
14/11/18 04:36:06 INFO Utils: Successfully started service 
'sparkDriver' on port 52317.

14/11/18 04:36:06 INFO SparkEnv: Registering MapOutputTracker
14/11/18 04:36:06 INFO SparkEnv: Registering BlockManagerMaster
14/11/18 04:36:06 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20141118043606-c346
14/11/18 04:36:07 INFO Utils: Successfully started service 'Connection 
manager for block manager' on port 47507.
14/11/18 04:36:07 INFO ConnectionManager: Bound socket to port 47507 
with id = ConnectionManagerId(master,47507)
14/11/18 04:36:07 INFO MemoryStore: MemoryStore started with capacity 
267.3 MB

14/11/18 04:36:07 INFO BlockManagerMaster: Trying to register BlockManager
14/11/18 04:36:07 INFO BlockManagerMasterActor: Registering block 
manager master:47507 with 267.3 MB RAM

14/11/18 04:36:07 INFO BlockManagerMaster: Registered BlockManager
14/11/18 04:36:07 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-8b29544a-c74b-4a3e-88e0-13801c8dcc65

14/11/18 04:36:07 INFO HttpServer: Starting HTTP Server
14/11/18 04:36:07 INFO Utils: Successfully started service 'HTTP file 
server' on port 40029.
14/11/18 04:36:12 INFO Utils: Successfully started service 'SparkUI' 
on port 4040.
14/11/18 04:36:12 INFO SparkUI: Started SparkUI at http://master:4040 
http://master:4040/
14/11/18 04:36:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@master:52317/user/HeartbeatReceiver
14/11/18 04:36:12 INFO SparkUI: Stopped Spark web UI at 
http://master:4040 http://master:4040/

14/11/18 04:36:12 INFO DAGScheduler: Stopping DAGScheduler
14/11/18 04:36:13 INFO MapOutputTrackerMasterActor: 
MapOutputTrackerActor stopped!

14/11/18 04:36:13 INFO ConnectionManager: Selector thread was interrupted!
14/11/18 04:36:13 INFO ConnectionManager: ConnectionManager stopped
14/11/18 04:36:13 INFO MemoryStore: MemoryStore cleared
14/11/18 04:36:13 INFO BlockManager: BlockManager stopped
14/11/18 04:36:13 INFO BlockManagerMaster: BlockManagerMaster stopped
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.

14/11/18 04:36:13 INFO SparkContext: Successfully stopped SparkContext
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Remote daemon shut down; proceeding with flushing remote transports.

14/11/18 04:36:13 INFO Remoting: Remoting shut down
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.

Traceback (most recent call last):
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, 
line 44, in module

sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, 
line 107, in __init__

conf)
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, 
line 159, in _do_init

self._accumulatorServer = accumulators._start_update_server()
  File

Iterative transformations over RDD crashes in phantom reduce


Hi all,

This is somewhat related to my previous question ( 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html 
, for additional context) but for all practical purposes this is its own 
issue.


As in my previous question, I'm making iterative changes to an RDD, 
where each iteration depends on the results of the previous one. I've 
stripped down what was previously a loop to just be two sequential edits 
to try and nail down where the problem is. It looks like this:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

M is basically a row-indexed matrix, where each index points to a 
dictionary (sparse matrix more or less, with some domain-specific 
modifications). This program crashes on the second-to-last (7th) line; 
the creepy part is that it says the crash happens in func2 with the 
broadcast variable INDEX == 1 (it attempts to access an entry that 
doesn't exist in a dictionary of one of the rows).


How is that even possible? Am I missing something fundamental about how 
Spark works under the hood?


Thanks for your help!

Shannon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Iterative transformations over RDD crashes in phantom reduce

To clarify about what, precisely, is impossible: the crash happens with 
INDEX == 1 in func2, but func2 is only called in the reduceByKey 
transformation when INDEX == 0. And according to the output of the 
foreach() in line 4, that reduceByKey(func2) works just fine. How is it 
then invoked again with INDEX == 1 when there clearly isn't another 
reduce call at line 7?


On 11/18/14 1:58 PM, Shannon Quinn wrote:

Hi all,

This is somewhat related to my previous question ( 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html 
, for additional context) but for all practical purposes this is its 
own issue.


As in my previous question, I'm making iterative changes to an RDD, 
where each iteration depends on the results of the previous one. I've 
stripped down what was previously a loop to just be two sequential 
edits to try and nail down where the problem is. It looks like this:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

M is basically a row-indexed matrix, where each index points to a 
dictionary (sparse matrix more or less, with some domain-specific 
modifications). This program crashes on the second-to-last (7th) line; 
the creepy part is that it says the crash happens in func2 with the 
broadcast variable INDEX == 1 (it attempts to access an entry that 
doesn't exist in a dictionary of one of the rows).


How is that even possible? Am I missing something fundamental about 
how Spark works under the hood?


Thanks for your help!

Shannon



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

unsubscribe

2014-11-18 Thread Abdul Hakeem





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Pyspark Error

2014-11-18 Thread Davies Liu

It seems that `localhost` can not be resolved in your machines, I had
filed https://issues.apache.org/jira/browse/SPARK-4475 to track it.

On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi
aminn_...@yahoo.com.invalid wrote:
 Hi there,

 I have already downloaded Pre-built spark-1.1.0, I want to run pyspark by
 try typing ./bin/pyspark but I got the following error:








 scala shell is up and working fine

 hduser@master:~/Downloads/spark-1.1.0$ ./bin/spark-shell
 Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m;
 support was removed in 8.0
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 .
 .
 14/11/18 04:33:13 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://sparkDriver@master:34937/user/HeartbeatReceiver
 14/11/18 04:33:13 INFO SparkILoop: Created spark context..
 Spark context available as sc.

 scala hduser@master:~/Downloads/spark-1.1.0$



 But python shell does not work:

 hduser@master:~/Downloads/spark-1.1.0$
 hduser@master:~/Downloads/spark-1.1.0$
 hduser@master:~/Downloads/spark-1.1.0$ ./bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m;
 support was removed in 8.0
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/11/18 04:36:06 INFO SecurityManager: Changing view acls to: hduser,
 14/11/18 04:36:06 INFO SecurityManager: Changing modify acls to: hduser,
 14/11/18 04:36:06 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(hduser, );
 users with modify permissions: Set(hduser, )
 14/11/18 04:36:06 INFO Slf4jLogger: Slf4jLogger started
 14/11/18 04:36:06 INFO Remoting: Starting remoting
 14/11/18 04:36:06 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@master:52317]
 14/11/18 04:36:06 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkDriver@master:52317]
 14/11/18 04:36:06 INFO Utils: Successfully started service 'sparkDriver' on
 port 52317.
 14/11/18 04:36:06 INFO SparkEnv: Registering MapOutputTracker
 14/11/18 04:36:06 INFO SparkEnv: Registering BlockManagerMaster
 14/11/18 04:36:06 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20141118043606-c346
 14/11/18 04:36:07 INFO Utils: Successfully started service 'Connection
 manager for block manager' on port 47507.
 14/11/18 04:36:07 INFO ConnectionManager: Bound socket to port 47507 with id
 = ConnectionManagerId(master,47507)
 14/11/18 04:36:07 INFO MemoryStore: MemoryStore started with capacity 267.3
 MB
 14/11/18 04:36:07 INFO BlockManagerMaster: Trying to register BlockManager
 14/11/18 04:36:07 INFO BlockManagerMasterActor: Registering block manager
 master:47507 with 267.3 MB RAM
 14/11/18 04:36:07 INFO BlockManagerMaster: Registered BlockManager
 14/11/18 04:36:07 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-8b29544a-c74b-4a3e-88e0-13801c8dcc65
 14/11/18 04:36:07 INFO HttpServer: Starting HTTP Server
 14/11/18 04:36:07 INFO Utils: Successfully started service 'HTTP file
 server' on port 40029.
 14/11/18 04:36:12 INFO Utils: Successfully started service 'SparkUI' on port
 4040.
 14/11/18 04:36:12 INFO SparkUI: Started SparkUI at http://master:4040
 14/11/18 04:36:12 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://sparkDriver@master:52317/user/HeartbeatReceiver
 14/11/18 04:36:12 INFO SparkUI: Stopped Spark web UI at http://master:4040
 14/11/18 04:36:12 INFO DAGScheduler: Stopping DAGScheduler
 14/11/18 04:36:13 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor
 stopped!
 14/11/18 04:36:13 INFO ConnectionManager: Selector thread was interrupted!
 14/11/18 04:36:13 INFO ConnectionManager: ConnectionManager stopped
 14/11/18 04:36:13 INFO MemoryStore: MemoryStore cleared
 14/11/18 04:36:13 INFO BlockManager: BlockManager stopped
 14/11/18 04:36:13 INFO BlockManagerMaster: BlockManagerMaster stopped
 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
 down remote daemon.
 14/11/18 04:36:13 INFO SparkContext: Successfully stopped SparkContext
 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Remote
 daemon shut down; proceeding with flushing remote transports.
 14/11/18 04:36:13 INFO Remoting: Remoting shut down
 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
 shut down.
 Traceback (most recent call last):
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, line
 44, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line
 107, in __init__
 conf)
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line
 159, in _do_init
 self._accumulatorServer = accumulators._start_update_server()
   File

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu

On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote:
 Use zipWithIndex but cache the data before you run zipWithIndex...that way
 your ordering will be consistent (unless the bug has been fixed where you
 don't have to cache the data)...

Could you point some link about the bug?

 Normally these operations are used for dictionary building and so I am
 hoping you can cache the dictionary of RDD[String] before you can run
 zipWithIndex...

 indices are within 0 till maxIndex-1...if you want 1 you have to later map
 the index to index + 1

 On Tue, Nov 18, 2014 at 8:56 AM, Blind Faith person.of.b...@gmail.com
 wrote:

 As it is difficult to explain this, I would show what I want. Lets us say,
 I have an RDD A with the following value

 A = [word1, word2, word3]

 I want to have an RDD with the following value

 B = [(1, word1), (2, word2), (3, word3)]

 That is, it gives a unique number to each entry as a key value. Can we do
 such thing with Python or Scala?



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Iterative transformations over RDD crashes in phantom reduce

Sorry everyone--turns out an oft-forgotten single line of code was 
required to make this work:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
*M.cache()*
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

Works as expected now, and I understand why it was failing before: Spark 
was trying to recompute the RDD but consequently it was invoked with 
index == 1.


On 11/18/14 2:02 PM, Shannon Quinn wrote:
To clarify about what, precisely, is impossible: the crash happens 
with INDEX == 1 in func2, but func2 is only called in the reduceByKey 
transformation when INDEX == 0. And according to the output of the 
foreach() in line 4, that reduceByKey(func2) works just fine. How is 
it then invoked again with INDEX == 1 when there clearly isn't another 
reduce call at line 7?


On 11/18/14 1:58 PM, Shannon Quinn wrote:

Hi all,

This is somewhat related to my previous question ( 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html 
, for additional context) but for all practical purposes this is its 
own issue.


As in my previous question, I'm making iterative changes to an RDD, 
where each iteration depends on the results of the previous one. I've 
stripped down what was previously a loop to just be two sequential 
edits to try and nail down where the problem is. It looks like this:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

M is basically a row-indexed matrix, where each index points to a 
dictionary (sparse matrix more or less, with some domain-specific 
modifications). This program crashes on the second-to-last (7th) 
line; the creepy part is that it says the crash happens in func2 
with the broadcast variable INDEX == 1 (it attempts to access an 
entry that doesn't exist in a dictionary of one of the rows).


How is that even possible? Am I missing something fundamental about 
how Spark works under the hood?


Thanks for your help!

Shannon

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Michael Armbrust

Those are probably related.  It looks like we are somehow not being thread
safe when initializing various parts of the scala compiler.

Since code gen is pretty experimental we probably won't have the resources
to investigate backporting a fix.  However, if you can reproduce the
problem in Spark 1.2 then please file a JIRA.

On Mon, Nov 17, 2014 at 9:37 PM, Eric Zhen zhpeng...@gmail.com wrote:

 Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100
 (65 failed)), and sometimes cause the stage to fail.

 And there is another error that I'm not sure if there is a correlation.

 java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$
 at
 org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:114)
 at
 org.apache.spark.sql.execution.Filter.conditionEvaluator$lzycompute(basicOperators.scala:55)
 at
 org.apache.spark.sql.execution.Filter.conditionEvaluator(basicOperators.scala:55)
 at
 org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:58)
 at
 org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:57)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

 On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust mich...@databricks.com
  wrote:

 Interesting, I believe we have run that query with version 1.1.0 with
 codegen turned on and not much has changed there.  Is the error
 deterministic?

 On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen zhpeng...@gmail.com wrote:

 Hi Michael,

 We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.

 On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust 
 mich...@databricks.com wrote:

 What version of Spark SQL?

 On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com
 wrote:

 Hi all,

 We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true,
 we got exceptions as below, has anyone else saw these before?

 java.lang.ExceptionInInitializerError
 at
 org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.NullPointerException
 at
 scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
 at
 scala.reflect.internal.Types$UniqueType.init(Types.scala:1304)
 at
 scala.reflect.internal.Types$TypeRef.init(Types.scala:2341)
 at
 scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137)
 at
 scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544)

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Sean Owen

On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
 On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 Use zipWithIndex but cache the data before you run zipWithIndex...that way
 your ordering will be consistent (unless the bug has been fixed where you
 don't have to cache the data)...

 Could you point some link about the bug?

I think it's this:

https://issues.apache.org/jira/browse/SPARK-3098

... but it's resolved as not a bug.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Problems launching 1.2.0-SNAPSHOT cluster with Hive support on EC2

2014-11-18 Thread curtkohler

I've developed a Spark application using the 1.2.0-SNAPSHOP branch that
leverages Spark Streaming and Hive and can run it locally with no problem (I
need some fixes in the 1.2.0 branch). I successfully launched my EC2 cluster
by specifying a git commit hash from the 1.2.0-SNAPSHOT branch as the
version (as documented in the scripts)  and was able to run non-Hive Spark
jobs against it.  However, when I try and deploy my Hive application to it,
I am getting an exception (java.lang.ArrayStoreException:
org.apache.spark.sql.execution.SparkStrategies$CommandStrategy?) creating my
HiveContext.  I suspect that after the spark-ec2 launch script causes the
source code clone of the git repo/commit point, the assembly isn't being
built with Hive support (e.g. a -Phive profile on the mvn build command as
described in building Spark with Maven in the documentation).


Is anyone aware of a way I can get Hive support added into my compiled
assembly on the master/slaves using the spark_ec scripts (or where it would
make sense to hack the existing scripts to get it working).


Thanks,

Curt



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problems-launching-1-2-0-SNAPSHOT-cluster-with-Hive-support-on-EC2-tp19221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

GraphX twitter

2014-11-18 Thread tom85

I'm having problems running the twitter graph on a cluster with 4 nodes, each
having over 100GB of RAM and 32 virtual cores per node. 

I do have a pre-installed spark version (built against hadoop 2.3, because
it didn't compile on my system), but I'm loading my graph file from disk
without hdfs. The twitter graph is around 25GB big and I'm loading the graph
with GraphLoader.edgeListFile(..,.., minEdgePartitions = 128 ).
I assume that 128 partitions is optimal because that's the total number of
cores that I have. 

Now I've started one executor on each node, which has 100GB of RAM and
spark.executor.memory=32 to enjoy full parallelism. 4 workers, each having
one executor, each executor using 32 cores -- 128 cores, 128 partitions.

Is there any configuration that can be used for replicating results as given
in  this http://arxiv.org/pdf/1402.2394v1.pdf   paper?

The paper states a running time of under 500s. I can't get any results after
more than 1 hour. I'm running the built-in algorithm with 15 iterations.

Am I using too many partitions? Is there a bottleneck I can't see? 

Turned on GC logging, it looks like that:

3.785: [GC [PSYoungGen: 62914560K-7720K(73400320K)]
62914560K-7792K(241172480K), 0.0151130 secs] [Times: user=0.27 sys=0.02,
real=0.02 secs]
9.209: [GC [PSYoungGen: 62922280K-1943393K(73400320K)]
62922352K-1943473K(241172480K), 0.6108790 secs] [Times: user=5.95 sys=8.04,
real=0.62 secs]
13.316: [GC [PSYoungGen: 64857953K-4283906K(73400320K)]
64858033K-4283994K(241172480K), 1.1567380 secs] [Times: user=10.84
sys=15.73, real=1.16 secs]
17.931: [GC [PSYoungGen: 67198466K-6808418K(73400320K)]
67198554K-6808514K(241172480K), 1.9807690 secs] [Times: user=16.21
sys=29.29, real=1.99 secs]
26.112: [GC [PSYoungGen: 69722978K-7211955K(73400320K)]
69723074K-7212059K(241172480K), 2.1325980 secs] [Times: user=15.66
sys=33.33, real=2.14 secs]
64.833: [GC [PSYoungGen: 70126515K-5378991K(74105216K)]
70126619K-5379103K(241877376K), 0.3315500 secs] [Times: user=7.53 sys=0.00,
real=0.33 secs]


In the stderr log, I wonder about these logs:
INFO HadoopRDD: Input split: file:/.../twitter.edge:25065160704+33554432
INFO HadoopRDD: Input split: file:/.../twitter.edge:19696451584+33554432

Why does it even split the file as a HadoopRDD?

I also wonder about this error:

 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId($SPARK_MASTER,59331)
java.io.IOException: sendMessageReliably failed without being ACK'd


Any help would be highly appreciated.







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-twitter-tp19222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: unsubscribe

2014-11-18 Thread Corey Nolet

Abdul,

Please send an email to user-unsubscr...@spark.apache.org

On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote:





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Lost executors

2014-11-18 Thread Pala M Muthaia

Hi,

I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
shell.

I am running a job that essentially reads a bunch of HBase keys, looks up
HBase data, and performs some filtering and aggregation. The job works fine
in smaller datasets, but when i try to execute on the full dataset, the job
never completes. The few symptoms i notice are:

a. The job shows progress for a while and then starts throwing lots of the
following errors:

2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
906 disconnected, so removing it*
2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
executor 906 on machine name: remote Akka client disassociated*

2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
 org.apache.spark.storage.BlockManagerMasterActor - *Removing BlockManager
BlockManagerId(9186, machine name, 54600, 0) with no recent heart beats:
82313ms exceeds 45000ms*

Looking at the logs, the job never recovers from these errors, and
continues to show errors about lost executors and launching new executors,
and this just continues for a long time.

Could this be because the executors are running out of memory?

In terms of memory usage, the intermediate data could be large (after the
HBase lookup), but partial and fully aggregated data set size should be
quite small - essentially a bunch of ids and counts ( 1 mil in total).



b. In the Spark UI, i am seeing the following errors (redacted for
brevity), not sure if they are transient or real issue:

java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read
timed out}
...
org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
...
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)




I was trying to get more data to investigate but haven't been able to
figure out how to enable logging on the executors. The Spark UI appears
stuck and i only see driver side logs in the jobhistory directory specified
in the job.


Thanks,
pala

Re: Lost executors

2014-11-18 Thread Sandy Ryza

Hi Pala,

Do you have access to your YARN NodeManager logs?  Are you able to check
whether they report killing any containers for exceeding memory limits?

-Sandy

On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi,

 I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
 shell.

 I am running a job that essentially reads a bunch of HBase keys, looks up
 HBase data, and performs some filtering and aggregation. The job works fine
 in smaller datasets, but when i try to execute on the full dataset, the job
 never completes. The few symptoms i notice are:

 a. The job shows progress for a while and then starts throwing lots of the
 following errors:

 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
 906 disconnected, so removing it*
 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
 org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
 executor 906 on machine name: remote Akka client disassociated*

 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
  org.apache.spark.storage.BlockManagerMasterActor - *Removing
 BlockManager BlockManagerId(9186, machine name, 54600, 0) with no recent
 heart beats: 82313ms exceeds 45000ms*

 Looking at the logs, the job never recovers from these errors, and
 continues to show errors about lost executors and launching new executors,
 and this just continues for a long time.

 Could this be because the executors are running out of memory?

 In terms of memory usage, the intermediate data could be large (after the
 HBase lookup), but partial and fully aggregated data set size should be
 quite small - essentially a bunch of ids and counts ( 1 mil in total).



 b. In the Spark UI, i am seeing the following errors (redacted for
 brevity), not sure if they are transient or real issue:

 java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed 
 out}
 ...
 org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 ...
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:724)




 I was trying to get more data to investigate but haven't been able to
 figure out how to enable logging on the executors. The Spark UI appears
 stuck and i only see driver side logs in the jobhistory directory specified
 in the job.


 Thanks,
 pala

spark-shell giving me error of unread block data

2014-11-18 Thread Anson Abraham

I'm essentially loading a file and saving output to another location:

val source = sc.textFile(/tmp/testfile.txt)
source.saveAsTextFile(/tmp/testsparkoutput)

when i do so, i'm hitting this error:
14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
console:15
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
unread block data

java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought
it was jar files mismatch, but i did a compare and see they're all
identical.  But seeing as how they were all installed through CDH parcels,
not sure how there would be version mismatch on the nodes and master.  Oh
yeah 1 master node w/ 2 worker nodes and running in standalone not through
yarn.  So as a just in case, i copied the jars from the master to the 2
worker nodes as just in case, and still same issue.
Weird thing is, first time i installed and tested it out, it worked, but
now it doesn't.

Any help here would be greatly appreciated.

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu

I see, thanks!

On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote:
 On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
 On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 Use zipWithIndex but cache the data before you run zipWithIndex...that way
 your ordering will be consistent (unless the bug has been fixed where you
 don't have to cache the data)...

 Could you point some link about the bug?

 I think it's this:

 https://issues.apache.org/jira/browse/SPARK-3098

 ... but it's resolved as not a bug.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-shell giving me error of unread block data

2014-11-18 Thread Ritesh Kumar Singh

It can be a serialization issue.
Happens when there are different versions installed on the same system.
What do you mean by the first time you installed and tested it out?

On Wed, Nov 19, 2014 at 3:29 AM, Anson Abraham anson.abra...@gmail.com
wrote:

 I'm essentially loading a file and saving output to another location:

 val source = sc.textFile(/tmp/testfile.txt)
 source.saveAsTextFile(/tmp/testsparkoutput)

 when i do so, i'm hitting this error:
 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
 console:15
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
 unread block data

 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)

 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
 spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought
 it was jar files mismatch, but i did a compare and see they're all
 identical.  But seeing as how they were all installed through CDH parcels,
 not sure how there would be version mismatch on the nodes and master.  Oh
 yeah 1 master node w/ 2 worker nodes and running in standalone not through
 yarn.  So as a just in case, i copied the jars from the master to the 2
 worker nodes as just in case, and still same issue.
 Weird thing is, first time i installed and tested it out, it worked, but
 now it doesn't.

 Any help here would be greatly appreciated.

Cores on Master

I see the default and max cores settings but these seem to control total cores 
per cluster.

My cobbled together home cluster needs the Master to not use all its cores or 
it may lock up (it does other things). Is there a way to control max cores used 
for a particular cluster machine in standalone mode?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Cores on Master

Looks like I can do this by not using start-all.sh but starting each worker 
separately passing in a '--cores n' to the master? No config/env way?

On Nov 18, 2014, at 3:14 PM, Pat Ferrel p...@occamsmachete.com wrote:

I see the default and max cores settings but these seem to control total cores 
per cluster.

My cobbled together home cluster needs the Master to not use all its cores or 
it may lock up (it does other things). Is there a way to control max cores used 
for a particular cluster machine in standalone mode?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JdbcRDD

2014-11-18 Thread Krishna

Hi,

Are there any examples of using JdbcRDD in java available?

Its not clear what is the last argument in this example (
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
):

sc = new SparkContext(local, test)
val rdd = new JdbcRDD(
sc,
() = { DriverManager.getConnection(jdbc:derby:target/JdbcRDDSuiteDb) },
SELECT DATA FROM FOO WHERE ? = ID AND ID = ?,
1, 100, 3,
(r: ResultSet) = { r.getInt(1) }
).cache()

Thanks

Re: Cores on Master

This seems to work only on a ‘worker’ not the master? So I’m back to having no 
way to control cores on the master?
 
On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote:

Looks like I can do this by not using start-all.sh but starting each worker 
separately passing in a '--cores n' to the master? No config/env way?

On Nov 18, 2014, at 3:14 PM, Pat Ferrel p...@occamsmachete.com wrote:

I see the default and max cores settings but these seem to control total cores 
per cluster.

My cobbled together home cluster needs the Master to not use all its cores or 
it may lock up (it does other things). Is there a way to control max cores used 
for a particular cluster machine in standalone mode?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: JdbcRDD

2014-11-18 Thread mykidong

I had also same problem to use JdbcRDD in java.
For me, I have written a class in scala to get JdbcRDD, and I call this
instance from java.

for instance, JdbcRDDWrapper.scala like this:

...

import java.sql._
import org.apache.spark.SparkContext
import org.apache.spark.rdd.JdbcRDD
import com.gsshop.polaris.domain.event._

class JdbcRDDWrapper(sc: SparkContext, rowCount: Long, from: Long, to: Long)
 {

  def getItemViewEventJdbcRdd(): JdbcRDD[ItemViewEvent] =
  {
val sql =  +
SELECT  +
i.ID as id, +
i.ITEM_ID as \itemViewEvent.itemId\, +
i.BRAND_ID as \itemViewEvent.brandId\, +
i.ITEM_TYPE as \itemViewEvent.itemType\, +
i.PROMOTION_ID as \itemViewEvent.promotionId\, +
i.PRICE as \itemViewEvent.price\, +
i.ITEM_TITLE as \itemViewEvent.itemTitle\, +
i.ITEM_DESCRIPTION as \itemViewEvent.itemDescription\, +
i.THUMB_NAIL_URL as \itemViewEvent.thumbnailUrl\, +
i.LOAD_DATE as loadDate, +
b.EVENT_TYPE as \itemViewEvent.baseProperties.eventType\, +
b.TIMESTAMP as \itemViewEvent.baseProperties.timestamp\, +
b.URL as \itemViewEvent.baseProperties.url\, +
b.REFERER as \itemViewEvent.baseProperties.referer\, +
b.UID as \itemViewEvent.baseProperties.uid\, +
b.PCID as \itemViewEvent.baseProperties.pcid\, +
b.SERVICE_ID as \itemViewEvent.baseProperties.serviceId\, +
b.VERSION as \itemViewEvent.baseProperties.version\, +
b.DEVICE_TYPE as \itemViewEvent.baseProperties.deviceType\, +
b.DOMAIN as \itemViewEvent.baseProperties.domain\, +
b.SITE as \itemViewEvent.baseProperties.site\  +
FROM ITEM_VIEW_EVENT AS i  +
INNER JOIN BASE_PROPERTIES AS b  +
ON i.ID = b.EVENT_ID  +
WHERE b.TIMESTAMP != ? AND  + from +
 = b.TIMESTAMP AND b.TIMESTAMP   + to +  LIMIT ? 

val rdd = new JdbcRDD(
  sc,
  () = {
Class.forName(org.apache.phoenix.jdbc.PhoenixDriver)
DriverManager.getConnection(jdbc:phoenix:xx:/hbase-unsecure)
  },
  sql,
  0, rowCount, 5,
  (rs: ResultSet) = {
val baseProperties = new BaseProperties()
baseProperties.setEventType(rs.getString(
itemViewEvent.baseProperties.eventType))
baseProperties.setTimestamp(rs.getLong(
itemViewEvent.baseProperties.timestamp))
baseProperties.setUrl(rs.getString(
itemViewEvent.baseProperties.url))
baseProperties.setReferer(rs.getString(
itemViewEvent.baseProperties.referer))
baseProperties.setUid(rs.getString(
itemViewEvent.baseProperties.uid))
baseProperties.setPcid(rs.getString(
itemViewEvent.baseProperties.pcid))
baseProperties.setServiceId(rs.getString(
itemViewEvent.baseProperties.serviceId))
baseProperties.setVersion(rs.getString(
itemViewEvent.baseProperties.version))
baseProperties.setDeviceType(rs.getString(
itemViewEvent.baseProperties.deviceType))
baseProperties.setDomain(rs.getString(
itemViewEvent.baseProperties.domain))
baseProperties.setSite(rs.getString(
itemViewEvent.baseProperties.site))

val itemViewEvent = new ItemViewEvent()
itemViewEvent.setItemId(rs.getString(itemViewEvent.itemId))
itemViewEvent.setBrandId(rs.getString(itemViewEvent.brandId))
itemViewEvent.setItemType(rs.getString(itemViewEvent.itemType))
itemViewEvent.setPromotionId(rs.getString(
itemViewEvent.promotionId))
itemViewEvent.setPrice(rs.getLong(itemViewEvent.price))
itemViewEvent.setItemTitle(rs.getString(itemViewEvent.itemTitle))
itemViewEvent.setItemDescription(rs.getString(
itemViewEvent.itemDescription))
itemViewEvent.setThumbnailUrl(rs.getString(
itemViewEvent.thumbnailUrl))
itemViewEvent.setBaseProperties(baseProperties)


itemViewEvent
  })

rdd
  }

}

and from java, JdbcRdd can be received:

import scala.reflect.ClassManifestFactory$;

...
JdbcRDDItemViewEvent jdbcRddItemViewEvent = new
JdbcRDDWrapper(JavaSparkContext.toSparkContext(ctx), rowCountItemViewEvent,
fromTime, toTime).getItemViewEventJdbcRdd();
JavaRDDItemViewEvent javaRddItemViewEvent =
JavaRDD.fromRDD(jdbcRddItemViewEvent,
ClassManifestFactory$.MODULE$.fromClass(ItemViewEvent.class));


- Kidong.





2014-11-19 8:58 GMT+09:00 sparkling [via Apache Spark User List] 
ml-node+s1001560n19233...@n3.nabble.com:

 Hi,

 Are there any examples of using JdbcRDD in java available?

 Its not clear what is the last argument in this example (
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
 ):

 sc = new SparkContext(local, test)
 val rdd = new JdbcRDD(
 sc,
 () = { DriverManager.getConnection(jdbc:derby:target/JdbcRDDSuiteDb) },
 SELECT DATA FROM FOO WHERE ? = ID AND ID = ?,
 1, 100, 3,
 (r: ResultSet) = { r.getInt(1) }
 ).cache()

 Thanks



 --
  If you reply to this email, your message will be added to the discussion
 below:

Re: JdbcRDD

2014-11-18 Thread Krishna

Thanks Kidong. I'll try your approach.

On Tue, Nov 18, 2014 at 4:22 PM, mykidong mykid...@gmail.com wrote:

 I had also same problem to use JdbcRDD in java.
 For me, I have written a class in scala to get JdbcRDD, and I call this
 instance from java.

 for instance, JdbcRDDWrapper.scala like this:

 ...

 import java.sql._
 import org.apache.spark.SparkContext
 import org.apache.spark.rdd.JdbcRDD
 import com.gsshop.polaris.domain.event._

 class JdbcRDDWrapper(sc: SparkContext, rowCount: Long, from: Long, to:
 Long) {

   def getItemViewEventJdbcRdd(): JdbcRDD[ItemViewEvent] =
   {
 val sql =  +
 SELECT  +
 i.ID as id, +
 i.ITEM_ID as \itemViewEvent.itemId\, +
 i.BRAND_ID as \itemViewEvent.brandId\, +
 i.ITEM_TYPE as \itemViewEvent.itemType\, +
 i.PROMOTION_ID as \itemViewEvent.promotionId\, +
 i.PRICE as \itemViewEvent.price\, +
 i.ITEM_TITLE as \itemViewEvent.itemTitle\, +
 i.ITEM_DESCRIPTION as \itemViewEvent.itemDescription\, +
 i.THUMB_NAIL_URL as \itemViewEvent.thumbnailUrl\, +
 i.LOAD_DATE as loadDate, +
 b.EVENT_TYPE as \itemViewEvent.baseProperties.eventType\, +
 b.TIMESTAMP as \itemViewEvent.baseProperties.timestamp\, +
 b.URL as \itemViewEvent.baseProperties.url\, +
 b.REFERER as \itemViewEvent.baseProperties.referer\, +
 b.UID as \itemViewEvent.baseProperties.uid\, +
 b.PCID as \itemViewEvent.baseProperties.pcid\, +
 b.SERVICE_ID as \itemViewEvent.baseProperties.serviceId\, +
 b.VERSION as \itemViewEvent.baseProperties.version\, +
 b.DEVICE_TYPE as \itemViewEvent.baseProperties.deviceType\, +

 b.DOMAIN as \itemViewEvent.baseProperties.domain\, +
 b.SITE as \itemViewEvent.baseProperties.site\  +
 FROM ITEM_VIEW_EVENT AS i  +
 INNER JOIN BASE_PROPERTIES AS b  +
 ON i.ID = b.EVENT_ID  +
 WHERE b.TIMESTAMP != ? AND  + from +
  = b.TIMESTAMP AND b.TIMESTAMP   + to +  LIMIT ? 

 val rdd = new JdbcRDD(
   sc,
   () = {
 Class.forName(org.apache.phoenix.jdbc.PhoenixDriver)
 DriverManager.getConnection(jdbc:phoenix:xx:/hbase-unsecure)
   },
   sql,
   0, rowCount, 5,
   (rs: ResultSet) = {
 val baseProperties = new BaseProperties()
 baseProperties.setEventType(rs.getString(
 itemViewEvent.baseProperties.eventType))
 baseProperties.setTimestamp(rs.getLong(
 itemViewEvent.baseProperties.timestamp))
 baseProperties.setUrl(rs.getString(
 itemViewEvent.baseProperties.url))
 baseProperties.setReferer(rs.getString(
 itemViewEvent.baseProperties.referer))
 baseProperties.setUid(rs.getString(
 itemViewEvent.baseProperties.uid))
 baseProperties.setPcid(rs.getString(
 itemViewEvent.baseProperties.pcid))
 baseProperties.setServiceId(rs.getString(
 itemViewEvent.baseProperties.serviceId))
 baseProperties.setVersion(rs.getString(
 itemViewEvent.baseProperties.version))
 baseProperties.setDeviceType(rs.getString(
 itemViewEvent.baseProperties.deviceType))
 baseProperties.setDomain(rs.getString(
 itemViewEvent.baseProperties.domain))
 baseProperties.setSite(rs.getString(
 itemViewEvent.baseProperties.site))

 val itemViewEvent = new ItemViewEvent()
 itemViewEvent.setItemId(rs.getString(itemViewEvent.itemId))
 itemViewEvent.setBrandId(rs.getString(itemViewEvent.brandId))
 itemViewEvent.setItemType(rs.getString(itemViewEvent.itemType))
 itemViewEvent.setPromotionId(rs.getString(
 itemViewEvent.promotionId))
 itemViewEvent.setPrice(rs.getLong(itemViewEvent.price))
 itemViewEvent.setItemTitle(rs.getString(itemViewEvent.itemTitle
 ))
 itemViewEvent.setItemDescription(rs.getString(
 itemViewEvent.itemDescription))
 itemViewEvent.setThumbnailUrl(rs.getString(
 itemViewEvent.thumbnailUrl))
 itemViewEvent.setBaseProperties(baseProperties)


 itemViewEvent
   })

 rdd
   }

 }

 and from java, JdbcRdd can be received:

 import scala.reflect.ClassManifestFactory$;

 ...
 JdbcRDDItemViewEvent jdbcRddItemViewEvent = new
 JdbcRDDWrapper(JavaSparkContext.toSparkContext(ctx), rowCountItemViewEvent,
 fromTime, toTime).getItemViewEventJdbcRdd();
 JavaRDDItemViewEvent javaRddItemViewEvent =
 JavaRDD.fromRDD(jdbcRddItemViewEvent,
 ClassManifestFactory$.MODULE$.fromClass(ItemViewEvent.class));


 - Kidong.





 2014-11-19 8:58 GMT+09:00 sparkling [via Apache Spark User List] [hidden
 email] http://user/SendEmail.jtp?type=nodenode=19235i=0:

 Hi,

 Are there any examples of using JdbcRDD in java available?

 Its not clear what is the last argument in this example (
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
 ):

 sc = new SparkContext(local, test)
 val rdd = new JdbcRDD(
 sc,
 () = { DriverManager.getConnection(jdbc:derby:target/JdbcRDDSuiteDb) },

Spark streaming: java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )

2014-11-18 Thread Bill Jay

Hi all,

I am running a Spark Streaming job. It was able to produce the correct
results up to some time. Later on, the job was still running but producing
no result. I checked the Spark streaming UI and found that 4 tasks of a
stage failed.

The error messages showed that Job aborted due to stage failure: Task 0 in
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0
(TID 400048, ip-172-31-13-130.ec2.internal): ExecutorLostFailure (executor
lost)
Driver stacktrace:

I further clicked the stage and found 4 executors running the stages had
the error message:

ExecutorLostFailure (executor lost)

The stage that failed was actually runJob at ReceiverTracker.scala:275
http://ec2-54-172-118-237.compute-1.amazonaws.com:9046/proxy/application_1415902783817_0019/stages/stage?id=2attempt=0,
which is the stage that keeps receiving message from Kafka. I guess that is
why the job does not produce results any more.

To investigate it, I logged into one of the executor machine and checked
the hadoop log. The log file contains a lot of exception message:

*java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )*

This streaming job is reading from Kafka and producing aggregation results.
After this stage failure, the job is still running but there is no data
shuffle as seen in the Spark UI.

The amount of the time this job can run correctly varies from job to job.
Does anyone has an idea why this Spark Streaming job had this exception?
And why it cannot recover from the stage failure?

Thanks!

Bill

Re: Cores on Master