Re: No. of Task vs No. of Executors
Thanks All! thanks Ayan! I did the repartition to 20 so it used all cores in the cluster and was done in 3 minutes. seems data was skewed to this partition. On Tue, Jul 14, 2015 at 8:05 PM, ayan guha wrote: > Hi > > As you can see, Spark has taken data locality into consideration and thus > scheduled all tasks as node local. It is because spark could run task on a > node where data is present, so spark went ahead and scheduled the tasks. It > is actually good for reading. If you really want to fan out processing, you > may do a repartition(n). > Regarding slowness, as you can see another task has completed successfully > in 6 mins in Excutor id 2.So it does not seem that node itself is slow. it > is possible the computation for one node is skewed. you may want to switch > on speculative execution to see if the same task gets completed in other > node faster or not. If yes, then its a node issue, else, ost ikely data > issue > > On Tue, Jul 14, 2015 at 11:43 PM, shahid wrote: > >> hi >> >> I have a 10 node cluster i loaded the data onto hdfs, so the no. of >> partitions i get is 9. I am running a spark application , it gets stuck on >> one of tasks, looking at the UI it seems application is not using all >> nodes >> to do calculations. attached is the screen shot of tasks, it seems tasks >> are >> put on each node more then once. looking at tasks 8 tasks get completed >> under 7-8 minutes and one task takes around 30 minutes so causing the >> delay >> in results. >> < >> http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png >> > >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Best Regards, > Ayan Guha > -- with Regards Shahid Ashraf
Re: No. of Task vs No. of Executors
This is likely due to data skew. If you are using key-value pairs, one key has a lot more records, than the other keys. Do you have any groupBy operations? David On Tue, Jul 14, 2015 at 9:43 AM, shahid wrote: > hi > > I have a 10 node cluster i loaded the data onto hdfs, so the no. of > partitions i get is 9. I am running a spark application , it gets stuck on > one of tasks, looking at the UI it seems application is not using all nodes > to do calculations. attached is the screen shot of tasks, it seems tasks > are > put on each node more then once. looking at tasks 8 tasks get completed > under 7-8 minutes and one task takes around 30 minutes so causing the delay > in results. > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Re: No. of Task vs No. of Executors
You could even try changing the block size of the input data on HDFS (can be done on a per file basis) and that would get all workers going right from the get-go in Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824p23896.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No. of Task vs No. of Executors
Hi As you can see, Spark has taken data locality into consideration and thus scheduled all tasks as node local. It is because spark could run task on a node where data is present, so spark went ahead and scheduled the tasks. It is actually good for reading. If you really want to fan out processing, you may do a repartition(n). Regarding slowness, as you can see another task has completed successfully in 6 mins in Excutor id 2.So it does not seem that node itself is slow. it is possible the computation for one node is skewed. you may want to switch on speculative execution to see if the same task gets completed in other node faster or not. If yes, then its a node issue, else, ost ikely data issue On Tue, Jul 14, 2015 at 11:43 PM, shahid wrote: > hi > > I have a 10 node cluster i loaded the data onto hdfs, so the no. of > partitions i get is 9. I am running a spark application , it gets stuck on > one of tasks, looking at the UI it seems application is not using all nodes > to do calculations. attached is the screen shot of tasks, it seems tasks > are > put on each node more then once. looking at tasks 8 tasks get completed > under 7-8 minutes and one task takes around 30 minutes so causing the delay > in results. > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha