Re: Spark streaming on standalone cluster

2015-07-01 Thread Borja Garrido Bear
Hi all,

Thanks for the answers, yes, my problem was I was using just one worker
with one core, so it was starving and then I never get the job to run, now
it seems it's working properly.

One question, is this information in the docs? (because maybe I misread it)

On Wed, Jul 1, 2015 at 10:30 AM, prajod.vettiyat...@wipro.com wrote:

  Spark streaming needs at least two threads on the worker/slave side. I
 have seen this issue when(to test the behavior), I set the thread count for
 spark streaming to 1. It should be atleast 2: one for the receiver
 adapter(kafka, flume etc) and the second for processing the data.



 But I tested that in local mode: “--master local[2] “. The same issue
 could happen in worker also.  If you set “--master local[1] “ the streaming
 worker/slave blocks due to starvation.



 Which conf parameter sets the worker thread count in cluster mode ? is it
 spark.akka.threads ?



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* 01 July 2015 01:32
 *To:* Borja Garrido Bear
 *Cc:* user
 *Subject:* Re: Spark streaming on standalone cluster



 How many receivers do you have in the streaming program? You have to have
 more numbers of core in reserver by your spar application than the number
 of receivers. That would explain the receiving output after stopping.



 TD



 On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com
 wrote:

  Hi all,



 I'm running a spark standalone cluster with one master and one slave
 (different machines and both in version 1.4.0), the thing is I have a spark
 streaming job that gets data from Kafka, and the just prints it.



 To configure the cluster I just started the master and then the slaves
 pointing to it, as everything appears in the web interface I assumed
 everything was fine, but maybe I missed some configuration.



 When I run it locally there is no problem, it works.

 When I run it in the cluster the worker state appears as loading

  - If the job is a Scala one, when I stop it I receive all the output

  - If the job is Python, when I stop it I receive a bunch of these
 exceptions




 \\\



 ERROR JobScheduler: Error running job streaming job 143567542 ms.0

 py4j.Py4JException: An exception was raised by the Python Proxy. Return
 Message: null

 at py4j.Protocol.getReturnValue(Protocol.java:417)

 at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)

 at com.sun.proxy.$Proxy14.call(Unknown Source)

 at
 org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at scala.util.Try$.apply(Try.scala:161)

 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)




 \\\



 Is there any known issue with spark streaming and the standalone mode? or
 with Python?


  The information contained in this electronic message and any attachments
 to this message are intended for the exclusive use of the addressee(s) and
 may contain proprietary, confidential or privileged information. If you are
 not the intended recipient, you should not disseminate, distribute or copy
 this e-mail

Spark streaming on standalone cluster

2015-06-30 Thread Borja Garrido Bear
Hi all,

I'm running a spark standalone cluster with one master and one slave
(different machines and both in version 1.4.0), the thing is I have a spark
streaming job that gets data from Kafka, and the just prints it.

To configure the cluster I just started the master and then the slaves
pointing to it, as everything appears in the web interface I assumed
everything was fine, but maybe I missed some configuration.

When I run it locally there is no problem, it works.
When I run it in the cluster the worker state appears as loading
 - If the job is a Scala one, when I stop it I receive all the output
 - If the job is Python, when I stop it I receive a bunch of these
exceptions

\\\

ERROR JobScheduler: Error running job streaming job 143567542 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return
Message: null
at py4j.Protocol.getReturnValue(Protocol.java:417)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
at com.sun.proxy.$Proxy14.call(Unknown Source)
at
org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
at
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

\\\

Is there any known issue with spark streaming and the standalone mode? or
with Python?


Re: Spark standalone mode and kerberized cluster

2015-06-16 Thread Borja Garrido Bear
Thank you for the answer, it doesn't seem to work neither (I've not log
into the machine as the spark user, but kinit inside the spark-env script),
and also tried inside the job.

I've notice when I run pyspark that the kerberos token is used for
something, but this same behavior is not presented when I start a worker,
so maybe those aren't think to use kerberos...

On Tue, Jun 16, 2015 at 12:10 PM, Steve Loughran ste...@hortonworks.com
wrote:


  On 15 Jun 2015, at 15:43, Borja Garrido Bear kazebo...@gmail.com wrote:

  I tried running the job in a standalone cluster and I'm getting this:

  java.io.IOException: Failed on local exception: java.io.IOException:
 org.apache.hadoop.security.AccessControlException: Client cannot
 authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
 worker-node/0.0.0.0; destination host is: hdfs:9000;


 Both nodes can access the HDFS running spark locally, and have valid kerberos 
 credentials, I know for the moment keytab is not supported for standalone 
 mode, but as long as the tokens I had when initiating the workers and masters 
 are valid this should work, shouldn't it?




 I don't know anything about tokens on standalone. In YARN what we have to
 do is something called delegation tokens, the client asks (something) for
 tokens granting access to HDFS, and attaches that to the YARN container
 creation request, which is then handed off to the app master, which then
 gets to deal with (a) passing them down to launched workers and (b) dealing
 with token refresh (which is where keytabs come in to play)

  Why not try sshing in to the worker-node as the spark user and run kinit
 there to see if the problem goes away once you've logged in with Kerberos.
 If that works, you're going to have to automate that process across the
 cluster



Re: Spark standalone mode and kerberized cluster

2015-06-15 Thread Borja Garrido Bear
I tried running the job in a standalone cluster and I'm getting this:

java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
worker-node/0.0.0.0; destination host is: hdfs:9000;


Both nodes can access the HDFS running spark locally, and have valid
kerberos credentials, I know for the moment keytab is not supported
for standalone mode, but as long as the tokens I had when initiating
the workers and masters are valid this should work, shouldn't it?



On Thu, Jun 11, 2015 at 10:22 AM, Steve Loughran ste...@hortonworks.com
wrote:

  That's spark on YARN in Kerberos

  In Spark 1.3 you can submit work to a Kerberized Hadoop cluster; once
 the tokens you passed up with your app submission expire (~72 hours) your
 job can't access HDFS any more.

  That's been addressed in Spark 1.4, where you can now specify a kerberos
 keytab for the application master; the AM will then give the workers
 updated tokens when needed.

  The kerberos authentication is all related to the HDFS interaction, YARN
 itself, and the way Kerberized YARN runs your work under your userid, not
 mapred or yarn
 It will also handle SPNEGO authentication between your web browser and the
 Spark UI (which is redirected via the YARN RM Proxy to achieve this)

  it does not do anything about Akka-based IPC between your client code
 and the spark application

  -steve

  On 11 Jun 2015, at 06:47, Akhil Das ak...@sigmoidanalytics.com wrote:

  This might help
 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html

  Thanks
 Best Regards

 On Wed, Jun 10, 2015 at 6:49 PM, kazeborja kazebo...@gmail.com wrote:

 Hello all.

 I've been reading some old mails and notice that the use of kerberos in a
 standalone cluster was not supported. Is this stillt he case?

 Thanks.
 Borja.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-mode-and-kerberized-cluster-tp23255.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org