RE: Spark streaming on standalone cluster

2015-07-01 Thread prajod.vettiyattil
Spark streaming needs at least two threads on the worker/slave side. I have 
seen this issue when(to test the behavior), I set the thread count for spark 
streaming to 1. It should be atleast 2: one for the receiver adapter(kafka, 
flume etc) and the second for processing the data.

But I tested that in local mode: “--master local[2] “. The same issue could 
happen in worker also.  If you set “--master local[1] “ the streaming 
worker/slave blocks due to starvation.

Which conf parameter sets the worker thread count in cluster mode ? is it 
spark.akka.threads ?

From: Tathagata Das [mailto:t...@databricks.com]
Sent: 01 July 2015 01:32
To: Borja Garrido Bear
Cc: user
Subject: Re: Spark streaming on standalone cluster

How many receivers do you have in the streaming program? You have to have more 
numbers of core in reserver by your spar application than the number of 
receivers. That would explain the receiving output after stopping.

TD

On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear 
kazebo...@gmail.commailto:kazebo...@gmail.com wrote:
Hi all,

I'm running a spark standalone cluster with one master and one slave (different 
machines and both in version 1.4.0), the thing is I have a spark streaming job 
that gets data from Kafka, and the just prints it.

To configure the cluster I just started the master and then the slaves pointing 
to it, as everything appears in the web interface I assumed everything was 
fine, but maybe I missed some configuration.

When I run it locally there is no problem, it works.
When I run it in the cluster the worker state appears as loading
 - If the job is a Scala one, when I stop it I receive all the output
 - If the job is Python, when I stop it I receive a bunch of these exceptions

\\\

ERROR JobScheduler: Error running job streaming job 143567542 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return 
Message: null
at py4j.Protocol.getReturnValue(Protocol.java:417)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
at com.sun.proxy.$Proxy14.call(Unknown Source)
at 
org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
at 
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at 
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

\\\

Is there any known issue with spark streaming and the standalone mode? or with 
Python?

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Spark streaming on standalone cluster

2015-07-01 Thread Wojciech Pituła
Hi,
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Points to remember

   -

   When running a Spark Streaming program locally, do not use “local” or
   “local[1]” as the master URL. Either of these means that only one thread
   will be used for running tasks locally. If you are using a input DStream
   based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
   thread will be used to run the receiver, leaving no thread for processing
   the received data. Hence, when running locally, always use “local[*n*]”
   as the master URL where *n*  number of receivers to run (see Spark
   Properties
   
https://spark.apache.org/docs/latest/configuration.html#spark-properties.html
for
   information on how to set the master).


śr., 1.07.2015 o 11:25 użytkownik Borja Garrido Bear kazebo...@gmail.com
napisał:

 Hi all,

 Thanks for the answers, yes, my problem was I was using just one worker
 with one core, so it was starving and then I never get the job to run, now
 it seems it's working properly.

 One question, is this information in the docs? (because maybe I misread it)

 On Wed, Jul 1, 2015 at 10:30 AM, prajod.vettiyat...@wipro.com wrote:

  Spark streaming needs at least two threads on the worker/slave side. I
 have seen this issue when(to test the behavior), I set the thread count for
 spark streaming to 1. It should be atleast 2: one for the receiver
 adapter(kafka, flume etc) and the second for processing the data.



 But I tested that in local mode: “--master local[2] “. The same issue
 could happen in worker also.  If you set “--master local[1] “ the streaming
 worker/slave blocks due to starvation.



 Which conf parameter sets the worker thread count in cluster mode ? is it
 spark.akka.threads ?



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* 01 July 2015 01:32
 *To:* Borja Garrido Bear
 *Cc:* user
 *Subject:* Re: Spark streaming on standalone cluster



 How many receivers do you have in the streaming program? You have to have
 more numbers of core in reserver by your spar application than the number
 of receivers. That would explain the receiving output after stopping.



 TD



 On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com
 wrote:

  Hi all,



 I'm running a spark standalone cluster with one master and one slave
 (different machines and both in version 1.4.0), the thing is I have a spark
 streaming job that gets data from Kafka, and the just prints it.



 To configure the cluster I just started the master and then the slaves
 pointing to it, as everything appears in the web interface I assumed
 everything was fine, but maybe I missed some configuration.



 When I run it locally there is no problem, it works.

 When I run it in the cluster the worker state appears as loading

  - If the job is a Scala one, when I stop it I receive all the output

  - If the job is Python, when I stop it I receive a bunch of these
 exceptions




 \\\



 ERROR JobScheduler: Error running job streaming job 143567542 ms.0

 py4j.Py4JException: An exception was raised by the Python Proxy. Return
 Message: null

 at py4j.Protocol.getReturnValue(Protocol.java:417)

 at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)

 at com.sun.proxy.$Proxy14.call(Unknown Source)

 at
 org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at scala.util.Try$.apply(Try.scala:161)

 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at scala.util.DynamicVariable.withValue

Re: Spark streaming on standalone cluster

2015-07-01 Thread Borja Garrido Bear
Hi all,

Thanks for the answers, yes, my problem was I was using just one worker
with one core, so it was starving and then I never get the job to run, now
it seems it's working properly.

One question, is this information in the docs? (because maybe I misread it)

On Wed, Jul 1, 2015 at 10:30 AM, prajod.vettiyat...@wipro.com wrote:

  Spark streaming needs at least two threads on the worker/slave side. I
 have seen this issue when(to test the behavior), I set the thread count for
 spark streaming to 1. It should be atleast 2: one for the receiver
 adapter(kafka, flume etc) and the second for processing the data.



 But I tested that in local mode: “--master local[2] “. The same issue
 could happen in worker also.  If you set “--master local[1] “ the streaming
 worker/slave blocks due to starvation.



 Which conf parameter sets the worker thread count in cluster mode ? is it
 spark.akka.threads ?



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* 01 July 2015 01:32
 *To:* Borja Garrido Bear
 *Cc:* user
 *Subject:* Re: Spark streaming on standalone cluster



 How many receivers do you have in the streaming program? You have to have
 more numbers of core in reserver by your spar application than the number
 of receivers. That would explain the receiving output after stopping.



 TD



 On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com
 wrote:

  Hi all,



 I'm running a spark standalone cluster with one master and one slave
 (different machines and both in version 1.4.0), the thing is I have a spark
 streaming job that gets data from Kafka, and the just prints it.



 To configure the cluster I just started the master and then the slaves
 pointing to it, as everything appears in the web interface I assumed
 everything was fine, but maybe I missed some configuration.



 When I run it locally there is no problem, it works.

 When I run it in the cluster the worker state appears as loading

  - If the job is a Scala one, when I stop it I receive all the output

  - If the job is Python, when I stop it I receive a bunch of these
 exceptions




 \\\



 ERROR JobScheduler: Error running job streaming job 143567542 ms.0

 py4j.Py4JException: An exception was raised by the Python Proxy. Return
 Message: null

 at py4j.Protocol.getReturnValue(Protocol.java:417)

 at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)

 at com.sun.proxy.$Proxy14.call(Unknown Source)

 at
 org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at scala.util.Try$.apply(Try.scala:161)

 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)




 \\\



 Is there any known issue with spark streaming and the standalone mode? or
 with Python?


  The information contained in this electronic message and any attachments
 to this message are intended for the exclusive use of the addressee(s) and
 may contain proprietary, confidential or privileged information. If you are
 not the intended recipient, you should not disseminate, distribute or copy
 this e-mail

Spark streaming on standalone cluster

2015-06-30 Thread Borja Garrido Bear
Hi all,

I'm running a spark standalone cluster with one master and one slave
(different machines and both in version 1.4.0), the thing is I have a spark
streaming job that gets data from Kafka, and the just prints it.

To configure the cluster I just started the master and then the slaves
pointing to it, as everything appears in the web interface I assumed
everything was fine, but maybe I missed some configuration.

When I run it locally there is no problem, it works.
When I run it in the cluster the worker state appears as loading
 - If the job is a Scala one, when I stop it I receive all the output
 - If the job is Python, when I stop it I receive a bunch of these
exceptions

\\\

ERROR JobScheduler: Error running job streaming job 143567542 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return
Message: null
at py4j.Protocol.getReturnValue(Protocol.java:417)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
at com.sun.proxy.$Proxy14.call(Unknown Source)
at
org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
at
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

\\\

Is there any known issue with spark streaming and the standalone mode? or
with Python?


Re: Spark streaming on standalone cluster

2015-06-30 Thread Tathagata Das
How many receivers do you have in the streaming program? You have to have
more numbers of core in reserver by your spar application than the number
of receivers. That would explain the receiving output after stopping.

TD

On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com
wrote:

 Hi all,

 I'm running a spark standalone cluster with one master and one slave
 (different machines and both in version 1.4.0), the thing is I have a spark
 streaming job that gets data from Kafka, and the just prints it.

 To configure the cluster I just started the master and then the slaves
 pointing to it, as everything appears in the web interface I assumed
 everything was fine, but maybe I missed some configuration.

 When I run it locally there is no problem, it works.
 When I run it in the cluster the worker state appears as loading
  - If the job is a Scala one, when I stop it I receive all the output
  - If the job is Python, when I stop it I receive a bunch of these
 exceptions


 \\\

 ERROR JobScheduler: Error running job streaming job 143567542 ms.0
 py4j.Py4JException: An exception was raised by the Python Proxy. Return
 Message: null
 at py4j.Protocol.getReturnValue(Protocol.java:417)
 at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
 at com.sun.proxy.$Proxy14.call(Unknown Source)
 at
 org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 \\\

 Is there any known issue with spark streaming and the standalone mode? or
 with Python?