Re: How to use spark-on-k8s pod template?

2019-11-08 Thread David Mitchell
Are you using Spark 2.3 or above?

See the documentation:
https://spark.apache.org/docs/latest/running-on-kubernetes.html

I looks like you do not need:
--conf spark.kubernetes.driver.podTemplateFile='/spark-pod-template.yaml' \
--conf spark.kubernetes.executor.podTemplateFile='/spark-pod-template.yaml'
\

Is your service account and namespace properly setup?

Cluster mode:

$ bin/spark-submit \
--master k8s://https://: \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image= \
local:///path/to/examples.jar


On Tue, Nov 5, 2019 at 6:37 AM sora  wrote:

> Hi all,
> I am looking for the usage about the spark-on-k8s pod template.
> I want to set some toleration rules for the driver and executor pod.
> I tried to set --conf 
> spark.kubernetes.driver.podTemplateFile=/spark-pod-template.yaml but
> didn't work.
> The driver pod started without the toleration rules and stay pending
> because of no available node.
> Could anyone please show me any usage?
>
> The template file is below.
>
> apiVersion: extensions/v1beta1
> kind: Pod
> spec:
>   template:
> spec:
>   tolerations:
> - effect: NoSchedule
>   key: project
>   operator: Equal
>   value: name
>
>
> My full command is below.
>
> /opt/spark/bin/spark-submit --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT \
> --conf spark.kubernetes.driver.podTemplateFile='/spark-pod-template.yaml' \
> --conf spark.kubernetes.executor.podTemplateFile='/spark-pod-template.yaml' \
> --conf spark.scheduler.mode=FAIR \
> --conf spark.driver.memory=2g \
> --conf spark.driver.cores=1 \
> --conf spark.executor.cores=1 \
> --conf spark.executor.memory=1g \
> --conf spark.executor.instances=4 \
> --conf spark.kubernetes.container.image=job-image \
> --conf spark.kubernetes.namespace=nc \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=sa \
> --conf spark.kubernetes.report.interval=5 \
> --conf spark.kubernetes.submission.waitAppCompletion=false \
> --deploy-mode cluster \
> --name job-name \
> --class job.class job.jar job-args
>
>
>
>
>
>
>
>
>
>
>
>

-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Re: What benefits do we really get out of colocation?

2016-12-03 Thread David Mitchell
To get a node local read from Spark to Cassandra, one has to use a read
consistency level of LOCAL_ONE.  For some use cases, this is not an
option.  For example, if you need to use a read consistency level
of LOCAL_QUORUM, as many use cases demand, then one is not going to get a
node local read.

Also, to insure a node local read, one has to set spark.locality.wait to
zero.  Whether or not a partition will be streamed to another node or
computed locally is dependent on the spark.locality.wait parameters. This
parameter can be set to 0 to force all partitions to only be computed on
local nodes.

If you do some testing, please post your performance numbers.


Re: How to avoid Spark shuffle spill memory?

2015-10-06 Thread David Mitchell
Hi unk1102,

Try adding more memory to your nodes.  Are you running Spark in the cloud?
If so, increase the memory on your servers.
Do you have default parallelism set (spark.default.parallelism)?  If so,
unset it, and let Spark decided how many partitions to allocate.
You can also try refactoring your code to make is use less memory.

David

On Tue, Oct 6, 2015 at 3:19 PM, unk1102  wrote:

> Hi I have a Spark job which runs for around 4 hours and it shared
> SparkContext and runs many child jobs. When I see each job in UI I see
> shuffle spill of around 30 to 40 GB and because of that many times
> executors
> gets lost because of using physical memory beyond limits how do I avoid
> shuffle spill? I have tried almost all optimisations nothing is helping I
> dont cache anything I am using Spark 1.4.1 and also using tungsten,codegen
> etc  I am using spark.shuffle.storage as 0.2 and spark.storage.memory as
> 0.2
> I tried to increase shuffle memory to 0.6 but then it halts in GC pause
> causing my executor to timeout and then getting lost eventually.
>
> Please guide. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-Spark-shuffle-spill-memory-tp24960.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Re: submit_spark_job_to_YARN

2015-08-30 Thread David Mitchell
Hi Ajay,

Are you trying to save to your local file system or to HDFS?

// This would save to HDFS under /user/hadoop/counter
counter.saveAsTextFile(/user/hadoop/counter);

David


On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander itsche...@gmail.com wrote:

 Hi Everyone,

 Recently we have installed spark on yarn in hortonworks cluster. Now I am
 trying to run a wordcount program in my eclipse and I
 did setMaster(local) and I see the results that's as expected. Now I want
 to submit the same job to my yarn cluster from my eclipse. In storm
 basically I was doing the same by using StormSubmitter class and by passing
 nimbus  zookeeper host to Config object. I was looking for something
 exactly the same.

 When I went through the documentation online, it read that I am suppose to
 export HADOOP_HOME_DIR=path to the conf dir. So now I copied the conf
 folder from one of sparks gateway node to my local Unix box. Now I did
 export that dir...

 export HADOOP_HOME_DIR=/Users/user1/Documents/conf/

 And I did the same in .bash_profile too. Now when I do echo
 $HADOOP_HOME_DIR, I see the path getting printed in the command prompt. Now
 my assumption is, in my program when I change setMaster(local) to
 setMaster(yarn-client) my program should pick up the resource mangers i.e
 yarn cluster info from the directory which I have exported and the job
 should get submitted to resolve manager from my eclipse. But somehow it's
 not happening. Please tell me if my assumption is wrong or if I am missing
 anything here.

 I have attached the word count program that I was using. Any help is
 highly appreciated.

 Thank you,
 Ajay



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Re: No. of Task vs No. of Executors

2015-07-18 Thread David Mitchell
This is likely due to data skew.  If you are using key-value pairs, one key
has a lot more records, than the other keys.  Do you have any groupBy
operations?

David


On Tue, Jul 14, 2015 at 9:43 AM, shahid sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are
 put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.
 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark.  However, you will
get an answer in minutes or hours, not in milliseconds or seconds.  OLTP
databases are used for web applications, and typically return responses in
milliseconds.  Analytic databases tend to operate on large data sets, and
return responses in seconds, minutes or hours.  When running batch jobs
over large data sets, Spark can be a replacement for analytic databases
like Greenplum or Netezza.



On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov ole...@gmail.com wrote:

 Hello. Had the same question. What if I need to store 4-6 Tb and do
 queries? Can't find any clue in documentation.
 Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com:

  Hi Ravi,

 First, Neither Spark nor Spark SQL is a database. Both are compute
 engines, which need to be paired with a storage system. Seconds, they are
 designed for processing large distributed datasets. If you have only
 100,000 records or even a million records, you don’t need Spark. A RDBMS
 will perform much better for that volume of data.



 Mohammed



 *From:* Ravisankar Mani [mailto:rrav...@gmail.com]
 *Sent:* Friday, July 10, 2015 3:50 AM
 *To:* user@spark.apache.org
 *Subject:* Spark performance



 Hi everyone,

 I have planned to move mssql server to spark?.  I have using around
 50,000 to 1l records.

  The spark performance is slow when compared to mssql server.



 What is the best data base(Spark or sql) to store or retrieve data around
 50,000 to 1l records ?

 regards,

 Ravi






-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread David Mitchell
I am having the same problem reading JSON.  There does not seem to be a way
of selecting a field that has a space, Executor Info from the Spark logs.

I suggest that we open a JIRA ticket to address this issue.
 On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com wrote:

 I would think the easiest way would be to create a view in DB with column
 names with no space.

 In fact, you can pass a sql in place of a real table.

 From documentation: The JDBC table that should be read. Note that
 anything that is valid in a `FROM` clause of a SQL query can be used. For
 example, instead of a full table you could also use a subquery in
 parentheses.

 Kindly let the community know if this works

 On Tue, Jun 2, 2015 at 6:43 PM, Sachin Goyal sachin.go...@jabong.com
 wrote:

 Hi,

 We are using spark sql (1.3.1) to load data from Microsoft sql server
 using jdbc (as described in
 https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
 ).

 It is working fine except when there is a space in column names (we can't
 modify the schemas to remove space as it is a legacy database).

 Sqoop is able to handle such scenarios by enclosing column names in '[ ]'
 - the recommended method from microsoft sql server. (
 https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/manager/SQLServerManager.java
 - line no 319)

 Is there a way to handle this in spark sql?

 Thanks,
 sachin




 --
 Best Regards,
 Ayan Guha



ORCFiles

2015-04-24 Thread David Mitchell
Does anyone know in which version of Spark will there be support for
ORCFiles via spark.sql.hive?  Will it be in 1.4?

David


Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread David Mitchell
Thank you for your help.  toDF() solved my first problem.  And, the
second issue was a non-issue, since the second example worked without any
modification.

David


On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote:

 programmatically specifying Schema needs

  import org.apache.spark.sql.type._

 for StructType and StructField to resolve.

 On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes I think this was already just fixed by:

 https://github.com/apache/spark/pull/4977

 a .toDF() is missing

 On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I've found people.toDF gives you a data frame (roughly equivalent to the
  previous Row RDD),
 
  And you can then call registerTempTable on that DataFrame.
 
  So people.toDF.registerTempTable(people) should work
 
 
 
  —
  Sent from Mailbox
 
 
  On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell 
 jdavidmitch...@gmail.com
  wrote:
 
 
  I am pleased with the release of the DataFrame API.  However, I started
  playing with it, and neither of the two main examples in the
 documentation
  work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html
 
  Specfically:
 
  Inferring the Schema Using Reflection
  Programmatically Specifying the Schema
 
 
  Scala 2.11.6
  Spark 1.3.0 prebuilt for Hadoop 2.4 and later
 
  Inferring the Schema Using Reflection
  scala people.registerTempTable(people)
  console:31: error: value registerTempTable is not a member of
  org.apache.spark
  .rdd.RDD[Person]
people.registerTempTable(people)
   ^
 
  Programmatically Specifying the Schema
  scala val peopleDataFrame = sqlContext.createDataFrame(people, schema)
  console:41: error: overloaded method value createDataFrame with
  alternatives:
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass:
  Class[_])org.apache.spar
  k.sql.DataFrame and
(rdd: org.apache.spark.rdd.RDD[_],beanClass:
  Class[_])org.apache.spark.sql.Dat
  aFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns:
  java.util.List[String])org.apache.spark.sql.DataFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o
  rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
 and
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema:
  org.apache
  .spark.sql.types.StructType)org.apache.spark.sql.DataFrame
   cannot be applied to (org.apache.spark.rdd.RDD[String],
  org.apache.spark.sql.ty
  pes.StructType)
 val df = sqlContext.createDataFrame(people, schema)
 
  Any help would be appreciated.
 
  David
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Spark Release 1.3.0 DataFrame API

2015-03-14 Thread David Mitchell
I am pleased with the release of the DataFrame API.  However, I started
playing with it, and neither of the two main examples in the documentation
work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html

Specfically:

   - Inferring the Schema Using Reflection
   - Programmatically Specifying the Schema


Scala 2.11.6
Spark 1.3.0 prebuilt for Hadoop 2.4 and later

*Inferring the Schema Using Reflection*
scala people.registerTempTable(people)
console:31: error: value registerTempTable is not a member of
org.apache.spark
.rdd.RDD[Person]
  people.registerTempTable(people)
 ^

*Programmatically Specifying the Schema*
scala val peopleDataFrame = sqlContext.createDataFrame(people, schema)
console:41: error: overloaded method value createDataFrame with
alternatives:
  (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass:
Class[_])org.apache.spar
k.sql.DataFrame and
  (rdd: org.apache.spark.rdd.RDD[_],beanClass:
Class[_])org.apache.spark.sql.Dat
aFrame and
  (rowRDD:
org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns:
java.util.List[String])org.apache.spark.sql.DataFrame and
  (rowRDD:
org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o
rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame and
  (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema:
org.apache
.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.rdd.RDD[String],
org.apache.spark.sql.ty
pes.StructType)
   val df = sqlContext.createDataFrame(people, schema)

Any help would be appreciated.

David