Re: Good idea to do multi-threading in spark job?
Thanks for the answer Sean! On Sun, May 3, 2020 at 10:35 AM Sean Owen wrote: > Spark will by default assume each task needs 1 CPU. On an executor > with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using > 4 cores, then 64 threads are trying to run. If you're CPU-bound, that > could slow things down. But to the extent some of tasks take some time > blocking on I/O, it could increase overall utilization. You shouldn't > have to worry about Spark there, but, you do have to consider that N > tasks, each with its own concurrency, maybe executing your code in one > JVM, and whatever synchronization that implies. > > On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote: > > > > Hi all, > > > > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we > use semaphores / parallel collections within our spark job. We definitely > notice a huge speedup in our job from doing this, but were wondering if > this could cause any unintended side effects? Particularly I’m worried > about any deadlocks and if it could mess with the fixes for issues such as > this > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 > > > > We do run with multiple cores. > > > > Thanks! > > -- > > Cheers, > > Ruijing Li > -- Cheers, Ruijing Li
Re: Good idea to do multi-threading in spark job?
Spark will by default assume each task needs 1 CPU. On an executor with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using 4 cores, then 64 threads are trying to run. If you're CPU-bound, that could slow things down. But to the extent some of tasks take some time blocking on I/O, it could increase overall utilization. You shouldn't have to worry about Spark there, but, you do have to consider that N tasks, each with its own concurrency, maybe executing your code in one JVM, and whatever synchronization that implies. On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote: > > Hi all, > > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use > semaphores / parallel collections within our spark job. We definitely notice > a huge speedup in our job from doing this, but were wondering if this could > cause any unintended side effects? Particularly I’m worried about any > deadlocks and if it could mess with the fixes for issues such as this > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 > > We do run with multiple cores. > > Thanks! > -- > Cheers, > Ruijing Li - To unsubscribe e-mail: user-unsubscr...@spark.apache.org