Re: Good idea to do multi-threading in spark job?
Thanks for the answer Sean! On Sun, May 3, 2020 at 10:35 AM Sean Owen wrote: > Spark will by default assume each task needs 1 CPU. On an executor > with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using > 4 cores, then 64 threads are trying to run. If you're CPU-bound, that > could slow things down. But to the extent some of tasks take some time > blocking on I/O, it could increase overall utilization. You shouldn't > have to worry about Spark there, but, you do have to consider that N > tasks, each with its own concurrency, maybe executing your code in one > JVM, and whatever synchronization that implies. > > On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote: > > > > Hi all, > > > > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we > use semaphores / parallel collections within our spark job. We definitely > notice a huge speedup in our job from doing this, but were wondering if > this could cause any unintended side effects? Particularly I’m worried > about any deadlocks and if it could mess with the fixes for issues such as > this > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 > > > > We do run with multiple cores. > > > > Thanks! > > -- > > Cheers, > > Ruijing Li > -- Cheers, Ruijing Li
Re: Good idea to do multi-threading in spark job?
Spark will by default assume each task needs 1 CPU. On an executor with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using 4 cores, then 64 threads are trying to run. If you're CPU-bound, that could slow things down. But to the extent some of tasks take some time blocking on I/O, it could increase overall utilization. You shouldn't have to worry about Spark there, but, you do have to consider that N tasks, each with its own concurrency, maybe executing your code in one JVM, and whatever synchronization that implies. On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote: > > Hi all, > > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use > semaphores / parallel collections within our spark job. We definitely notice > a huge speedup in our job from doing this, but were wondering if this could > cause any unintended side effects? Particularly I’m worried about any > deadlocks and if it could mess with the fixes for issues such as this > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 > > We do run with multiple cores. > > Thanks! > -- > Cheers, > Ruijing Li - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Good idea to do multi-threading in spark job?
Hi all, We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use semaphores / parallel collections within our spark job. We definitely notice a huge speedup in our job from doing this, but were wondering if this could cause any unintended side effects? Particularly I’m worried about any deadlocks and if it could mess with the fixes for issues such as this https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 We do run with multiple cores. Thanks! -- Cheers, Ruijing Li