Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Srinath C
Hi Akash, Glad to know that repartition helped! The overall tasks actually depends on the kind of operations you are performing and also on how the DF is partitioned. I can't comment on the former but can provide some pointers on the latter. Default value of spark.sql.shuffle.partitions is 200.

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
Hi Srinath, Thanks for such an elaborate reply. How to reduce the number of overall tasks? I found, after simply repartitioning the csv file into 8 parts and converting it to parquet with snappy compression, helped not only in even distribution of the tasks on all nodes, but also helped in

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Srinath C
Hi Aakash, Can you check the logs for Executor ID 0? It was restarted on worker 192.168.49.39 perhaps due to OOM or something. Also observed that the number of tasks are high and unevenly distributed across the workers. Check if there are too many partitions in the RDD and tune it using

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
Yes, but when I did increase my executor memory, the spark job is going to halt after running a few steps, even though, the executor isn't dying. Data - 60,000 data-points, 230 columns (60 MB data). Any input on why it behaves like that? On Tue, Jun 12, 2018 at 8:15 AM, Vamshi Talla wrote: >

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Vamshi Talla
Aakash, Like Jorn suggested, did you increase your test data set? If so, did you also update your executor-memory setting? It seems like you might exceeding the executor memory threshold. Thanks Vamshi Talla Sent from my iPhone On Jun 11, 2018, at 8:54 AM, Aakash Basu

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi Jorn/Others, Thanks for your help. Now, data is being distributed in a proper way, but the challenge is, after a certain point, I'm getting this error, after which, everything stops moving ahead - 2018-06-11 18:14:56 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 192.168.49.39: Remote RPC

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
If it is in kB then spark will always schedule it to one node. As soon as it gets bigger you will see usage of more nodes. Hence increase your testing Dataset . > On 11. Jun 2018, at 12:22, Aakash Basu wrote: > > Jorn - The code is a series of feature engineering and model tuning >

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread akshay naidu
try --num-executors 3 --executor-cores 4 --executor-memory 2G --conf spark.scheduler.mode=FAIR On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu wrote: > Hi, > > I have submitted a job on* 4 node cluster*, where I see, most of the > operations happening at one of the worker nodes and other two are

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
What is your code ? Maybe this one does an operation which is bound to a single host or your data volume is too small for multiple hosts. > On 11. Jun 2018, at 11:13, Aakash Basu wrote: > > Hi, > > I have submitted a job on 4 node cluster, where I see, most of the operations > happening at

[Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi, I have submitted a job on* 4 node cluster*, where I see, most of the operations happening at one of the worker nodes and other two are simply chilling out. Picture below puts light on that - How to properly distribute the load? My cluster conf (4 node cluster [1 driver; 3 slaves]) -

Re: Spark Optimization

2018-04-26 Thread CPC
I would recommend UseParallelGC since this is a batch job. Parallelization should be 2-3x of cores. Also if those are physical machines i would recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node? On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski < vincent.gromakow...@gmail.com>

Spark Optimization

2018-04-26 Thread Pallavi Singh
Hi Team, We are currently working on POC based on Spark and Scala. we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys. we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end

spark optimization

2016-11-07 Thread maitraythaker
Why those two stages in apache spark are computing same thing? <http://stackoverflow.com/questions/40192302/why-those-two-stages-in-apache-spark-are-computing-same-thing> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-optimization-tp28034.htm

Fwd: Spark optimization problem

2016-10-22 Thread Maitray Thaker
Hi, I have a query regarding spark stage optimization. I have asked the question in more detail at Stackoverflow, please find the following link: http://stackoverflow.com/questions/40192302/why-is- that-two-stages-in-apache-spark-are-computing-same-thing

SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

2016-02-03 Thread Yiannis Gkoufas
that shows those capabilities which you can find here : https://ibm.app.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg There is a blog post which gives more details on the functionality here: www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2

Re: Does Spark optimization might miss to run transformation?

2015-08-13 Thread Michael Armbrust
-dev If you want to guarantee the side effects happen you should use foreach or foreachPartitions. A `take`, for example, might only evaluate a subset of the partitions until it find enough results. On Wed, Aug 12, 2015 at 7:06 AM, Eugene Morozov fathers...@list.ru wrote: Hi! I’d like to

Does Spark optimization might miss to run transformation?

2015-08-12 Thread Eugene Morozov
Hi! I’d like to complete action (store / print smth) inside of transformation (map or mapPartitions). This approach has some flaws, but there is a question. Might it happen that Spark will optimise (RDD or DataFrame) processing so that my mapPartitions simply won’t happen? -- Eugene Morozov

Re: Spark optimization

2014-10-27 Thread Akhil Das
.test.org:34546 Best regards, Morbious -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-optimization-tp17290.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Spark optimization

2014-10-26 Thread Morbious
to sp...@spark-s4.test.org:34546 Best regards, Morbious -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-optimization-tp17290.html Sent from the Apache Spark User List mailing list archive at Nabble.com