Re: submitting spark job with kerberized Hadoop issue
What I can say, is that we successfully use spark on yarn with kerberized cluster. One of my coworkers also tried using it in the same way as you are(spark standalone with kerberized cluster) but as far as I remember, he didn't succeed. I may be wrong, because I was not personally involved in this use case, but I think that he concluded, that every executor of spark standalone cluster must also be kinited. pt., 5.08.2016 o 15:54 użytkownik Aneela Saleemnapisał: > Hi all, > > I'm trying to connect to Kerberized Hadoop cluster using spark job. I have > kinit'd from command line. When i run the following job i.e., > > *./bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab --principal > spark/hadoop-master@platalyticsrealm --class > com.platalytics.example.spark.App --master spark://hadoop-master:7077 > /home/vm6/project-1-jar-with-dependencies.jar > hdfs://hadoop-master:8020/text* > > I get the error: > > Caused by: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot > authenticate via:[TOKEN, KERBEROS] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > > Following are the contents of *spark-defaults.conf* file: > > spark.master spark://hadoop-master:7077 > spark.eventLog.enabled true > spark.eventLog.dir hdfs://hadoop-master:8020/spark/logs > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.yarn.access.namenodes hdfs://hadoop-master:8020/ > spark.yarn.security.tokens.hbase.enabled true > spark.yarn.security.tokens.hive.enabled true > spark.yarn.principal yarn/hadoop-master@platalyticsrealm > spark.yarn.keytab /etc/hadoop/conf/yarn.keytab > > > Also i have added following in *spark-env.sh* file: > > HOSTNAME=`hostname -f` > export SPARK_HISTORY_OPTS="-Dspark.history.kerberos.enabled=true > -Dspark.history.kerberos.principal=spark/${HOSTNAME}@platalyticsrealm > -Dspark.history.kerberos.keytab=/etc/hadoop/conf/spark.keytab" > > > Please guide me, how to trace the issue? > > Thanks > >
Dynamic (de)allocation with Spark Streaming
Hi, I have some doubts about dynamic resource allocation with spark streaming. If spark had allocated 5 executors for me, then he would dispatch every batch tasks on all of them equally. So if batchSize < spark.dynamicAllocation.executorIdleTimeout then spark will never free any executor. Moreover to free executor the processing time of batch must be lower than (batchSize - spark.dynamicAllocation.executorIdleTimeout), e.g. for batchSize = 30s spark.dynamicAllocation.executorIdleTimeout = 25s my batch should be processed under 5 seconds to free executors. If everything I have written above is true, it is not so great mechanism, because I would like to free an executor, for example, when processing time was lower than batchSize/2. I can set spark.dynamicAllocation.executorIdleTimeout to batchSize/2 but then spark will probably free all my executors... Have anyone worked out some sensible solution to this problem?
Re: Java 8 vs Scala
IMHO only Scala is an option. Once you're familiar with it you just cant even look at java code. czw., 16.07.2015 o 07:20 użytkownik spark user spark_u...@yahoo.com.invalid napisał: I struggle lots in Scala , almost 10 days n0 improvement , but when i switch to Java 8 , things are so smooth , and I used Data Frame with Redshift and Hive all are looking good . if you are very good In Scala the go with Scala otherwise Java is best fit . This is just my openion because I am Java guy. On Wednesday, July 15, 2015 12:33 PM, vaquar khan vaquar.k...@gmail.com wrote: My choice is java 8 On 15 Jul 2015 18:03, Alan Burlison alan.burli...@oracle.com wrote: On 15/07/2015 08:31, Ignacio Blasco wrote: The main advantage of using scala vs java 8 is being able to use a console https://bugs.openjdk.java.net/browse/JDK-8043364 -- Alan Burlison -- - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Numer of runJob at SparkPlan.scala:122 in Spark SQL
Hey, I was wondering if it is possible to tune number of jobs generated by spark sql? Currently my query generates over 80 runJob at SparkPlan.scala:122 jobs, every one of them gets executed in ~4 sec and contains only 5 tasks. As a result of this, most of my cores do nothing.
Re: Spark streaming on standalone cluster
Hi, https://spark.apache.org/docs/latest/streaming-programming-guide.html Points to remember - When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[*n*]” as the master URL where *n* number of receivers to run (see Spark Properties https://spark.apache.org/docs/latest/configuration.html#spark-properties.html for information on how to set the master). śr., 1.07.2015 o 11:25 użytkownik Borja Garrido Bear kazebo...@gmail.com napisał: Hi all, Thanks for the answers, yes, my problem was I was using just one worker with one core, so it was starving and then I never get the job to run, now it seems it's working properly. One question, is this information in the docs? (because maybe I misread it) On Wed, Jul 1, 2015 at 10:30 AM, prajod.vettiyat...@wipro.com wrote: Spark streaming needs at least two threads on the worker/slave side. I have seen this issue when(to test the behavior), I set the thread count for spark streaming to 1. It should be atleast 2: one for the receiver adapter(kafka, flume etc) and the second for processing the data. But I tested that in local mode: “--master local[2] “. The same issue could happen in worker also. If you set “--master local[1] “ the streaming worker/slave blocks due to starvation. Which conf parameter sets the worker thread count in cluster mode ? is it spark.akka.threads ? *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* 01 July 2015 01:32 *To:* Borja Garrido Bear *Cc:* user *Subject:* Re: Spark streaming on standalone cluster How many receivers do you have in the streaming program? You have to have more numbers of core in reserver by your spar application than the number of receivers. That would explain the receiving output after stopping. TD On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com wrote: Hi all, I'm running a spark standalone cluster with one master and one slave (different machines and both in version 1.4.0), the thing is I have a spark streaming job that gets data from Kafka, and the just prints it. To configure the cluster I just started the master and then the slaves pointing to it, as everything appears in the web interface I assumed everything was fine, but maybe I missed some configuration. When I run it locally there is no problem, it works. When I run it in the cluster the worker state appears as loading - If the job is a Scala one, when I stop it I receive all the output - If the job is Python, when I stop it I receive a bunch of these exceptions \\\ ERROR JobScheduler: Error running job streaming job 143567542 ms.0 py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: null at py4j.Protocol.getReturnValue(Protocol.java:417) at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113) at com.sun.proxy.$Proxy14.call(Unknown Source) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at
Re: Spark-Submit / Spark-Shell Error Standalone cluster
I assume that /usr/bin/load-spark-env.sh exists. Have you got the rights to execute it? niedz., 28.06.2015 o 04:53 użytkownik Ashish Soni asoni.le...@gmail.com napisał: Not sure what is the issue but when i run the spark-submit or spark-shell i am getting below error /usr/bin/spark-class: line 24: /usr/bin/load-spark-env.sh: No such file or directory Can some one please help Thanks,
Re: Spark Streaming: limit number of nodes
Ok, thanks. I have 1 worker process on each machine but I would like to run my app on only 3 of them. Is it possible? śr., 24.06.2015 o 11:44 użytkownik Evo Eftimov evo.efti...@isecc.com napisał: There is no direct one to one mapping between Executor and Node Executor is simply the spark framework term for JVM instance with some spark framework system code running in it A node is a physical server machine You can have more than one JVM per node And vice versa you can have Nodes without any JVM running on them. How? BY specifying the number of executors to be less than the number of nodes So if you specify number of executors to be 1 and you have 5 nodes, ONE executor will run on only one of them The above is valid for Spark on YARN For spark in standalone mode the number of executors is equal to the number of spark worker processes (daemons) running on each node *From:* Wojciech Pituła [mailto:w.pit...@gmail.com] *Sent:* Tuesday, June 23, 2015 12:38 PM *To:* user@spark.apache.org *Subject:* Spark Streaming: limit number of nodes I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you can see, node doesn't have much RAM. I have 2 streaming apps, first one is configured to use 3GB of memory per node and second one uses 2GB per node. My problem is, that smaller app could easily run on 2 or 3 nodes, instead of 5 so I could lanuch third app. Is it possible to limit number of nodes(executors) that app wil get from standalone cluster?
Re: Spark Streaming: limit number of nodes
I can not. I've already limited the number of cores to 10, so it gets 5 executors with 2 cores each... wt., 23.06.2015 o 13:45 użytkownik Akhil Das ak...@sigmoidanalytics.com napisał: Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła w.pit...@gmail.com wrote: I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you can see, node doesn't have much RAM. I have 2 streaming apps, first one is configured to use 3GB of memory per node and second one uses 2GB per node. My problem is, that smaller app could easily run on 2 or 3 nodes, instead of 5 so I could lanuch third app. Is it possible to limit number of nodes(executors) that app wil get from standalone cluster?
Spark Streaming: limit number of nodes
I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you can see, node doesn't have much RAM. I have 2 streaming apps, first one is configured to use 3GB of memory per node and second one uses 2GB per node. My problem is, that smaller app could easily run on 2 or 3 nodes, instead of 5 so I could lanuch third app. Is it possible to limit number of nodes(executors) that app wil get from standalone cluster?