Re: How to use spark-on-k8s pod template?
Are you using Spark 2.3 or above? See the documentation: https://spark.apache.org/docs/latest/running-on-kubernetes.html I looks like you do not need: --conf spark.kubernetes.driver.podTemplateFile='/spark-pod-template.yaml' \ --conf spark.kubernetes.executor.podTemplateFile='/spark-pod-template.yaml' \ Is your service account and namespace properly setup? Cluster mode: $ bin/spark-submit \ --master k8s://https://: \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=5 \ --conf spark.kubernetes.container.image= \ local:///path/to/examples.jar On Tue, Nov 5, 2019 at 6:37 AM sora wrote: > Hi all, > I am looking for the usage about the spark-on-k8s pod template. > I want to set some toleration rules for the driver and executor pod. > I tried to set --conf > spark.kubernetes.driver.podTemplateFile=/spark-pod-template.yaml but > didn't work. > The driver pod started without the toleration rules and stay pending > because of no available node. > Could anyone please show me any usage? > > The template file is below. > > apiVersion: extensions/v1beta1 > kind: Pod > spec: > template: > spec: > tolerations: > - effect: NoSchedule > key: project > operator: Equal > value: name > > > My full command is below. > > /opt/spark/bin/spark-submit --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT \ > --conf spark.kubernetes.driver.podTemplateFile='/spark-pod-template.yaml' \ > --conf spark.kubernetes.executor.podTemplateFile='/spark-pod-template.yaml' \ > --conf spark.scheduler.mode=FAIR \ > --conf spark.driver.memory=2g \ > --conf spark.driver.cores=1 \ > --conf spark.executor.cores=1 \ > --conf spark.executor.memory=1g \ > --conf spark.executor.instances=4 \ > --conf spark.kubernetes.container.image=job-image \ > --conf spark.kubernetes.namespace=nc \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=sa \ > --conf spark.kubernetes.report.interval=5 \ > --conf spark.kubernetes.submission.waitAppCompletion=false \ > --deploy-mode cluster \ > --name job-name \ > --class job.class job.jar job-args > > > > > > > > > > > > -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Re: What benefits do we really get out of colocation?
To get a node local read from Spark to Cassandra, one has to use a read consistency level of LOCAL_ONE. For some use cases, this is not an option. For example, if you need to use a read consistency level of LOCAL_QUORUM, as many use cases demand, then one is not going to get a node local read. Also, to insure a node local read, one has to set spark.locality.wait to zero. Whether or not a partition will be streamed to another node or computed locally is dependent on the spark.locality.wait parameters. This parameter can be set to 0 to force all partitions to only be computed on local nodes. If you do some testing, please post your performance numbers.
Re: How to avoid Spark shuffle spill memory?
Hi unk1102, Try adding more memory to your nodes. Are you running Spark in the cloud? If so, increase the memory on your servers. Do you have default parallelism set (spark.default.parallelism)? If so, unset it, and let Spark decided how many partitions to allocate. You can also try refactoring your code to make is use less memory. David On Tue, Oct 6, 2015 at 3:19 PM, unk1102wrote: > Hi I have a Spark job which runs for around 4 hours and it shared > SparkContext and runs many child jobs. When I see each job in UI I see > shuffle spill of around 30 to 40 GB and because of that many times > executors > gets lost because of using physical memory beyond limits how do I avoid > shuffle spill? I have tried almost all optimisations nothing is helping I > dont cache anything I am using Spark 1.4.1 and also using tungsten,codegen > etc I am using spark.shuffle.storage as 0.2 and spark.storage.memory as > 0.2 > I tried to increase shuffle memory to 0.6 but then it halts in GC pause > causing my executor to timeout and then getting lost eventually. > > Please guide. Thanks in advance. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-Spark-shuffle-spill-memory-tp24960.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Re: submit_spark_job_to_YARN
Hi Ajay, Are you trying to save to your local file system or to HDFS? // This would save to HDFS under /user/hadoop/counter counter.saveAsTextFile(/user/hadoop/counter); David On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander itsche...@gmail.com wrote: Hi Everyone, Recently we have installed spark on yarn in hortonworks cluster. Now I am trying to run a wordcount program in my eclipse and I did setMaster(local) and I see the results that's as expected. Now I want to submit the same job to my yarn cluster from my eclipse. In storm basically I was doing the same by using StormSubmitter class and by passing nimbus zookeeper host to Config object. I was looking for something exactly the same. When I went through the documentation online, it read that I am suppose to export HADOOP_HOME_DIR=path to the conf dir. So now I copied the conf folder from one of sparks gateway node to my local Unix box. Now I did export that dir... export HADOOP_HOME_DIR=/Users/user1/Documents/conf/ And I did the same in .bash_profile too. Now when I do echo $HADOOP_HOME_DIR, I see the path getting printed in the command prompt. Now my assumption is, in my program when I change setMaster(local) to setMaster(yarn-client) my program should pick up the resource mangers i.e yarn cluster info from the directory which I have exported and the job should get submitted to resolve manager from my eclipse. But somehow it's not happening. Please tell me if my assumption is wrong or if I am missing anything here. I have attached the word count program that I was using. Any help is highly appreciated. Thank you, Ajay - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Re: No. of Task vs No. of Executors
This is likely due to data skew. If you are using key-value pairs, one key has a lot more records, than the other keys. Do you have any groupBy operations? David On Tue, Jul 14, 2015 at 9:43 AM, shahid sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Re: Spark performance
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and return responses in seconds, minutes or hours. When running batch jobs over large data sets, Spark can be a replacement for analytic databases like Greenplum or Netezza. On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov ole...@gmail.com wrote: Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will perform much better for that volume of data. Mohammed *From:* Ravisankar Mani [mailto:rrav...@gmail.com] *Sent:* Friday, July 10, 2015 3:50 AM *To:* user@spark.apache.org *Subject:* Spark performance Hi everyone, I have planned to move mssql server to spark?. I have using around 50,000 to 1l records. The spark performance is slow when compared to mssql server. What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ? regards, Ravi -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Re: spark sql - reading data from sql tables having space in column names
I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, Executor Info from the Spark logs. I suggest that we open a JIRA ticket to address this issue. On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com wrote: I would think the easiest way would be to create a view in DB with column names with no space. In fact, you can pass a sql in place of a real table. From documentation: The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. Kindly let the community know if this works On Tue, Jun 2, 2015 at 6:43 PM, Sachin Goyal sachin.go...@jabong.com wrote: Hi, We are using spark sql (1.3.1) to load data from Microsoft sql server using jdbc (as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases ). It is working fine except when there is a space in column names (we can't modify the schemas to remove space as it is a legacy database). Sqoop is able to handle such scenarios by enclosing column names in '[ ]' - the recommended method from microsoft sql server. ( https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/manager/SQLServerManager.java - line no 319) Is there a way to handle this in spark sql? Thanks, sachin -- Best Regards, Ayan Guha
ORCFiles
Does anyone know in which version of Spark will there be support for ORCFiles via spark.sql.hive? Will it be in 1.4? David
Re: Spark Release 1.3.0 DataFrame API
Thank you for your help. toDF() solved my first problem. And, the second issue was a non-issue, since the second example worked without any modification. David On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote: programmatically specifying Schema needs import org.apache.spark.sql.type._ for StructType and StructField to resolve. On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote: Yes I think this was already just fixed by: https://github.com/apache/spark/pull/4977 a .toDF() is missing On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I've found people.toDF gives you a data frame (roughly equivalent to the previous Row RDD), And you can then call registerTempTable on that DataFrame. So people.toDF.registerTempTable(people) should work — Sent from Mailbox On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell jdavidmitch...@gmail.com wrote: I am pleased with the release of the DataFrame API. However, I started playing with it, and neither of the two main examples in the documentation work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html Specfically: Inferring the Schema Using Reflection Programmatically Specifying the Schema Scala 2.11.6 Spark 1.3.0 prebuilt for Hadoop 2.4 and later Inferring the Schema Using Reflection scala people.registerTempTable(people) console:31: error: value registerTempTable is not a member of org.apache.spark .rdd.RDD[Person] people.registerTempTable(people) ^ Programmatically Specifying the Schema scala val peopleDataFrame = sqlContext.createDataFrame(people, schema) console:41: error: overloaded method value createDataFrame with alternatives: (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spar k.sql.DataFrame and (rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.Dat aFrame and (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns: java.util.List[String])org.apache.spark.sql.DataFrame and (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame and (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache .spark.sql.types.StructType)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.sql.ty pes.StructType) val df = sqlContext.createDataFrame(people, schema) Any help would be appreciated. David - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Spark Release 1.3.0 DataFrame API
I am pleased with the release of the DataFrame API. However, I started playing with it, and neither of the two main examples in the documentation work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html Specfically: - Inferring the Schema Using Reflection - Programmatically Specifying the Schema Scala 2.11.6 Spark 1.3.0 prebuilt for Hadoop 2.4 and later *Inferring the Schema Using Reflection* scala people.registerTempTable(people) console:31: error: value registerTempTable is not a member of org.apache.spark .rdd.RDD[Person] people.registerTempTable(people) ^ *Programmatically Specifying the Schema* scala val peopleDataFrame = sqlContext.createDataFrame(people, schema) console:41: error: overloaded method value createDataFrame with alternatives: (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spar k.sql.DataFrame and (rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.Dat aFrame and (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns: java.util.List[String])org.apache.spark.sql.DataFrame and (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame and (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache .spark.sql.types.StructType)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.sql.ty pes.StructType) val df = sqlContext.createDataFrame(people, schema) Any help would be appreciated. David