Re: Good idea to do multi-threading in spark job?

2020-05-06 Thread Ruijing Li
Thanks for the answer Sean!

On Sun, May 3, 2020 at 10:35 AM Sean Owen  wrote:

> Spark will by default assume each task needs 1 CPU. On an executor
> with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
> 4 cores, then 64 threads are trying to run. If you're CPU-bound, that
> could slow things down. But to the extent some of tasks take some time
> blocking on I/O, it could increase overall utilization. You shouldn't
> have to worry about Spark there, but, you do have to consider that N
> tasks, each with its own concurrency, maybe executing your code in one
> JVM, and whatever synchronization that implies.
>
> On Sun, May 3, 2020 at 11:32 AM Ruijing Li  wrote:
> >
> > Hi all,
> >
> > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we
> use semaphores / parallel collections within our spark job. We definitely
> notice a huge speedup in our job from doing this, but were wondering if
> this could cause any unintended side effects? Particularly I’m worried
> about any deadlocks and if it could mess with the fixes for issues such as
> this
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
> >
> > We do run with multiple cores.
> >
> > Thanks!
> > --
> > Cheers,
> > Ruijing Li
>
-- 
Cheers,
Ruijing Li


Re: Good idea to do multi-threading in spark job?

2020-05-03 Thread Sean Owen
Spark will by default assume each task needs 1 CPU. On an executor
with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
4 cores, then 64 threads are trying to run. If you're CPU-bound, that
could slow things down. But to the extent some of tasks take some time
blocking on I/O, it could increase overall utilization. You shouldn't
have to worry about Spark there, but, you do have to consider that N
tasks, each with its own concurrency, maybe executing your code in one
JVM, and whatever synchronization that implies.

On Sun, May 3, 2020 at 11:32 AM Ruijing Li  wrote:
>
> Hi all,
>
> We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use 
> semaphores / parallel collections within our spark job. We definitely notice 
> a huge speedup in our job from doing this, but were wondering if this could 
> cause any unintended side effects? Particularly I’m worried about any 
> deadlocks and if it could mess with the fixes for issues such as this
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
>
> We do run with multiple cores.
>
> Thanks!
> --
> Cheers,
> Ruijing Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Good idea to do multi-threading in spark job?

2020-05-03 Thread Ruijing Li
Hi all,

We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use
semaphores / parallel collections within our spark job. We definitely
notice a huge speedup in our job from doing this, but were wondering if
this could cause any unintended side effects? Particularly I’m worried
about any deadlocks and if it could mess with the fixes for issues such as
this
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961

We do run with multiple cores.

Thanks!
-- 
Cheers,
Ruijing Li