Re: Good idea to do multi-threading in spark job?

2020-05-06 Thread Ruijing Li
Thanks for the answer Sean!

On Sun, May 3, 2020 at 10:35 AM Sean Owen  wrote:

> Spark will by default assume each task needs 1 CPU. On an executor
> with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
> 4 cores, then 64 threads are trying to run. If you're CPU-bound, that
> could slow things down. But to the extent some of tasks take some time
> blocking on I/O, it could increase overall utilization. You shouldn't
> have to worry about Spark there, but, you do have to consider that N
> tasks, each with its own concurrency, maybe executing your code in one
> JVM, and whatever synchronization that implies.
>
> On Sun, May 3, 2020 at 11:32 AM Ruijing Li  wrote:
> >
> > Hi all,
> >
> > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we
> use semaphores / parallel collections within our spark job. We definitely
> notice a huge speedup in our job from doing this, but were wondering if
> this could cause any unintended side effects? Particularly I’m worried
> about any deadlocks and if it could mess with the fixes for issues such as
> this
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
> >
> > We do run with multiple cores.
> >
> > Thanks!
> > --
> > Cheers,
> > Ruijing Li
>
-- 
Cheers,
Ruijing Li


Re: Good idea to do multi-threading in spark job?

2020-05-03 Thread Sean Owen
Spark will by default assume each task needs 1 CPU. On an executor
with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
4 cores, then 64 threads are trying to run. If you're CPU-bound, that
could slow things down. But to the extent some of tasks take some time
blocking on I/O, it could increase overall utilization. You shouldn't
have to worry about Spark there, but, you do have to consider that N
tasks, each with its own concurrency, maybe executing your code in one
JVM, and whatever synchronization that implies.

On Sun, May 3, 2020 at 11:32 AM Ruijing Li  wrote:
>
> Hi all,
>
> We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use 
> semaphores / parallel collections within our spark job. We definitely notice 
> a huge speedup in our job from doing this, but were wondering if this could 
> cause any unintended side effects? Particularly I’m worried about any 
> deadlocks and if it could mess with the fixes for issues such as this
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
>
> We do run with multiple cores.
>
> Thanks!
> --
> Cheers,
> Ruijing Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org