Re: submitting spark job with kerberized Hadoop issue

2016-08-06 Thread Wojciech Pituła
What I can say, is that we successfully use spark on yarn with kerberized
cluster. One of my coworkers also tried using it in the same way as you
are(spark standalone with kerberized cluster) but as far as I remember, he
didn't succeed. I may be wrong, because I was not personally involved in
this use case, but I think that he concluded, that every executor of spark
standalone cluster must also be kinited.

pt., 5.08.2016 o 15:54 użytkownik Aneela Saleem 
napisał:

> Hi all,
>
> I'm trying to connect to Kerberized Hadoop cluster using spark job. I have
> kinit'd from command line. When i run the following job i.e.,
>
> *./bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab --principal
> spark/hadoop-master@platalyticsrealm --class
> com.platalytics.example.spark.App --master spark://hadoop-master:7077
> /home/vm6/project-1-jar-with-dependencies.jar
> hdfs://hadoop-master:8020/text*
>
> I get the error:
>
> Caused by: java.io.IOException:
> org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN, KERBEROS]
> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
> Following are the contents of *spark-defaults.conf* file:
>
> spark.master spark://hadoop-master:7077
> spark.eventLog.enabled   true
> spark.eventLog.dir   hdfs://hadoop-master:8020/spark/logs
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.yarn.access.namenodes hdfs://hadoop-master:8020/
> spark.yarn.security.tokens.hbase.enabled true
> spark.yarn.security.tokens.hive.enabled true
> spark.yarn.principal yarn/hadoop-master@platalyticsrealm
> spark.yarn.keytab /etc/hadoop/conf/yarn.keytab
>
>
> Also i have added following in *spark-env.sh* file:
>
> HOSTNAME=`hostname -f`
> export SPARK_HISTORY_OPTS="-Dspark.history.kerberos.enabled=true
> -Dspark.history.kerberos.principal=spark/${HOSTNAME}@platalyticsrealm
> -Dspark.history.kerberos.keytab=/etc/hadoop/conf/spark.keytab"
>
>
> Please guide me, how to trace the issue?
>
> Thanks
>
>


Dynamic (de)allocation with Spark Streaming

2015-11-04 Thread Wojciech Pituła
Hi,

I have some doubts about dynamic resource allocation with spark streaming.

If spark had allocated 5 executors for me, then he would dispatch every
batch tasks on all of them equally. So if batchSize <
spark.dynamicAllocation.executorIdleTimeout
then spark will never free any executor. Moreover to free executor the
processing time of batch must be lower than (batchSize -
spark.dynamicAllocation.executorIdleTimeout),
e.g. for
batchSize = 30s
spark.dynamicAllocation.executorIdleTimeout = 25s
my batch should be processed under 5 seconds to free executors.

If everything I have written above is true, it is not so great mechanism,
because I would like to free an executor, for example, when processing time
was lower than batchSize/2. I can set
spark.dynamicAllocation.executorIdleTimeout
to batchSize/2 but then spark will probably free all my executors...

Have anyone worked out some sensible solution to this problem?


Re: Java 8 vs Scala

2015-07-16 Thread Wojciech Pituła
IMHO only Scala is an option. Once you're familiar with it you just cant
even look at java code.

czw., 16.07.2015 o 07:20 użytkownik spark user spark_u...@yahoo.com.invalid
napisał:

 I struggle lots in Scala , almost 10 days n0 improvement , but when i
 switch to Java 8 , things are so smooth , and I used Data Frame with
 Redshift and Hive all are looking good .
 if you are very good In Scala the go with Scala otherwise Java is best fit
  .

 This is just my openion because I am Java guy.



   On Wednesday, July 15, 2015 12:33 PM, vaquar khan vaquar.k...@gmail.com
 wrote:


 My choice is java 8
 On 15 Jul 2015 18:03, Alan Burlison alan.burli...@oracle.com wrote:

 On 15/07/2015 08:31, Ignacio Blasco wrote:

  The main advantage of using scala vs java 8 is being able to use a console


 https://bugs.openjdk.java.net/browse/JDK-8043364

 --
 Alan Burlison
 --

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Numer of runJob at SparkPlan.scala:122 in Spark SQL

2015-07-09 Thread Wojciech Pituła
Hey,

I was wondering if it is possible to tune number of jobs generated by spark
sql? Currently my query generates over 80 runJob at SparkPlan.scala:122
jobs, every one of them gets executed in ~4 sec and contains only 5 tasks.
As a result of this, most of my cores do nothing.


Re: Spark streaming on standalone cluster

2015-07-01 Thread Wojciech Pituła
Hi,
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Points to remember

   -

   When running a Spark Streaming program locally, do not use “local” or
   “local[1]” as the master URL. Either of these means that only one thread
   will be used for running tasks locally. If you are using a input DStream
   based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
   thread will be used to run the receiver, leaving no thread for processing
   the received data. Hence, when running locally, always use “local[*n*]”
   as the master URL where *n*  number of receivers to run (see Spark
   Properties
   
https://spark.apache.org/docs/latest/configuration.html#spark-properties.html
for
   information on how to set the master).


śr., 1.07.2015 o 11:25 użytkownik Borja Garrido Bear kazebo...@gmail.com
napisał:

 Hi all,

 Thanks for the answers, yes, my problem was I was using just one worker
 with one core, so it was starving and then I never get the job to run, now
 it seems it's working properly.

 One question, is this information in the docs? (because maybe I misread it)

 On Wed, Jul 1, 2015 at 10:30 AM, prajod.vettiyat...@wipro.com wrote:

  Spark streaming needs at least two threads on the worker/slave side. I
 have seen this issue when(to test the behavior), I set the thread count for
 spark streaming to 1. It should be atleast 2: one for the receiver
 adapter(kafka, flume etc) and the second for processing the data.



 But I tested that in local mode: “--master local[2] “. The same issue
 could happen in worker also.  If you set “--master local[1] “ the streaming
 worker/slave blocks due to starvation.



 Which conf parameter sets the worker thread count in cluster mode ? is it
 spark.akka.threads ?



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* 01 July 2015 01:32
 *To:* Borja Garrido Bear
 *Cc:* user
 *Subject:* Re: Spark streaming on standalone cluster



 How many receivers do you have in the streaming program? You have to have
 more numbers of core in reserver by your spar application than the number
 of receivers. That would explain the receiving output after stopping.



 TD



 On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com
 wrote:

  Hi all,



 I'm running a spark standalone cluster with one master and one slave
 (different machines and both in version 1.4.0), the thing is I have a spark
 streaming job that gets data from Kafka, and the just prints it.



 To configure the cluster I just started the master and then the slaves
 pointing to it, as everything appears in the web interface I assumed
 everything was fine, but maybe I missed some configuration.



 When I run it locally there is no problem, it works.

 When I run it in the cluster the worker state appears as loading

  - If the job is a Scala one, when I stop it I receive all the output

  - If the job is Python, when I stop it I receive a bunch of these
 exceptions




 \\\



 ERROR JobScheduler: Error running job streaming job 143567542 ms.0

 py4j.Py4JException: An exception was raised by the Python Proxy. Return
 Message: null

 at py4j.Protocol.getReturnValue(Protocol.java:417)

 at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)

 at com.sun.proxy.$Proxy14.call(Unknown Source)

 at
 org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)

 at scala.util.Try$.apply(Try.scala:161)

 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)

 at 

Re: Spark-Submit / Spark-Shell Error Standalone cluster

2015-06-28 Thread Wojciech Pituła
I assume that /usr/bin/load-spark-env.sh exists. Have you got the rights to
execute it?

niedz., 28.06.2015 o 04:53 użytkownik Ashish Soni asoni.le...@gmail.com
napisał:

 Not sure what is the issue but when i run the spark-submit or spark-shell
 i am getting below error

 /usr/bin/spark-class: line 24: /usr/bin/load-spark-env.sh: No such file or
 directory

 Can some one please help

 Thanks,



Re: Spark Streaming: limit number of nodes

2015-06-24 Thread Wojciech Pituła
Ok, thanks. I have 1 worker process on each machine but I would like to run
my app on only 3 of them. Is it possible?

śr., 24.06.2015 o 11:44 użytkownik Evo Eftimov evo.efti...@isecc.com
napisał:

 There is no direct one to one mapping between Executor and Node



 Executor is simply the spark framework term for JVM instance with some
 spark framework system code running in it



 A node is a physical server machine



 You can have more than one JVM per node



 And vice versa you can have Nodes without any JVM running on them. How? BY
 specifying the number of executors to be less than the number of nodes



 So if you specify number of executors to be 1 and you have 5 nodes,  ONE
 executor will run on only one of them



 The above is valid for Spark on YARN



 For spark in standalone mode the number of executors is equal to the
 number of spark worker processes (daemons) running on each node



 *From:* Wojciech Pituła [mailto:w.pit...@gmail.com]
 *Sent:* Tuesday, June 23, 2015 12:38 PM
 *To:* user@spark.apache.org
 *Subject:* Spark Streaming: limit number of nodes



 I have set up small standalone cluster: 5 nodes, every node has 5GB of
 memory an 8 cores. As you can see, node doesn't have much RAM.



 I have 2 streaming apps, first one is configured to use 3GB of memory per
 node and second one uses 2GB per node.



 My problem is, that smaller app could easily run on 2 or 3 nodes, instead
 of 5 so I could lanuch third app.



 Is it possible to limit number of nodes(executors) that app wil get from
 standalone cluster?



Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I can not. I've already limited the number of cores to 10, so it gets 5
executors with 2 cores each...

wt., 23.06.2015 o 13:45 użytkownik Akhil Das ak...@sigmoidanalytics.com
napisał:

 Use *spark.cores.max* to limit the CPU per job, then you can easily
 accommodate your third job also.

 Thanks
 Best Regards

 On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła w.pit...@gmail.com
 wrote:

 I have set up small standalone cluster: 5 nodes, every node has 5GB of
 memory an 8 cores. As you can see, node doesn't have much RAM.

 I have 2 streaming apps, first one is configured to use 3GB of memory per
 node and second one uses 2GB per node.

 My problem is, that smaller app could easily run on 2 or 3 nodes, instead
 of 5 so I could lanuch third app.

 Is it possible to limit number of nodes(executors) that app wil get from
 standalone cluster?





Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I have set up small standalone cluster: 5 nodes, every node has 5GB of
memory an 8 cores. As you can see, node doesn't have much RAM.

I have 2 streaming apps, first one is configured to use 3GB of memory per
node and second one uses 2GB per node.

My problem is, that smaller app could easily run on 2 or 3 nodes, instead
of 5 so I could lanuch third app.

Is it possible to limit number of nodes(executors) that app wil get from
standalone cluster?