Re: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Cosmin Posteuca
Hi,

Egor is right, for every partition it create a task, and every task run on
a single core. But with different configurations spark has different
results:

1 executor with 4 cores takes 120 seconds
2 executors with 2 cores each, takes twice 60 seconds, and once 120 seconds
4 executors with 1 core each, take 60 seconds

Why is it happen? why is non deterministic?

Thanks

2017-02-14 10:29 GMT+02:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:

> Memory seems to be enough. My cluster has 22.5 gb total memory and my job
> use 6.88 gb. If i run twice this job, they will use 13.75 gb, but sometimes
> the cluster has a spike of memory of 19.5 gb.
>
> Thanks,
> Cosmin
>
> 2017-02-14 10:03 GMT+02:00 Mendelson, Assaf <assaf.mendel...@rsa.com>:
>
>> You should also check your memory usage.
>>
>> Let’s say for example you have 16 cores and 8 GB. And that you use 4
>> executors with 1 core each.
>>
>> When you use an executor, spark reserves it from yarn and yarn allocates
>> the number of cores (e.g. 1 in our case) and the memory. The memory is
>> actually more than you asked for. If you ask for 1GB it will in fact
>> allocate almost 1.5GB with overhead. In addition, it will probably allocate
>> an executor for the driver (probably with 1024MB memory usage).
>>
>> When you run your program and look in port 8080, you should look not only
>> on the VCores used out of the VCores total but also on the Memory used and
>> Memory total. You should also navigate to the executors (e.g.
>> applications->running on the left and then choose you application and
>> navigate all the way down to a single container). You can see there the
>> actual usage.
>>
>>
>>
>> BTW, it doesn’t matter how much memory your program wants but how much it
>> reserves. In your example it will not take the 50MB of the test but the
>> ~1.5GB (after overhead) per executor.
>>
>> Hope this helps,
>>
>> Assaf.
>>
>>
>>
>> *From:* Cosmin Posteuca [mailto:cosmin.poste...@gmail.com]
>> *Sent:* Tuesday, February 14, 2017 9:53 AM
>> *To:* Egor Pahomov
>> *Cc:* user
>> *Subject:* Re: [Spark Launcher] How to launch parallel jobs?
>>
>>
>>
>> Hi Egor,
>>
>>
>>
>> About the first problem i think you are right, it's make sense.
>>
>>
>>
>> About the second problem, i check available resource on 8088 port and
>> there show 16 available cores. I start my job with 4 executors with 1 core
>> each, and 1gb per executor. My job use maximum 50mb of memory(just for
>> test). From my point of view the resources are enough, and the problem i
>> think is from yarn configuration files, but i don't know what is missing.
>>
>>
>>
>> Thank you
>>
>>
>>
>> 2017-02-13 21:14 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>>
>> About second problem: I understand this can be in two cases: when one job
>> prevents the other one from getting resources for executors or (2)
>> bottleneck is reading from disk, so you can not really parallel that. I
>> have no experience with second case, but it's easy to verify the fist one:
>> just look on you hadoop UI and verify, that both job get enough resources.
>>
>>
>>
>> 2017-02-13 11:07 GMT-08:00 Egor Pahomov <pahomov.e...@gmail.com>:
>>
>> "But if i increase only executor-cores the finish time is the same".
>> More experienced ones can correct me, if I'm wrong, but as far as I
>> understand that: one partition processed by one spark task. Task is always
>> running on 1 core and not parallelized among cores. So if you have 5
>> partitions and you increased totall number of cores among cluster from 7 to
>> 10 for example - you have not gained anything. But if you repartition you
>> give an opportunity to process thing in more threads, so now more tasks can
>> execute in parallel.
>>
>>
>>
>> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:
>>
>> Hi,
>>
>>
>>
>> I think i don't understand enough how to launch jobs.
>>
>>
>>
>> I have one job which takes 60 seconds to finish. I run it with following
>> command:
>>
>>
>>
>> spark-submit --executor-cores 1 \
>>
>>  --executor-memory 1g \
>>
>>  --driver-memory 1g \
>>
>>  --master yarn \
>>
>>  --deploy-mode cluster \
>>
>>  --conf spark.dynamicAllocation.enabled=true \
>>
>>  --conf spark

Re: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Cosmin Posteuca
Memory seems to be enough. My cluster has 22.5 gb total memory and my job
use 6.88 gb. If i run twice this job, they will use 13.75 gb, but sometimes
the cluster has a spike of memory of 19.5 gb.

Thanks,
Cosmin

2017-02-14 10:03 GMT+02:00 Mendelson, Assaf <assaf.mendel...@rsa.com>:

> You should also check your memory usage.
>
> Let’s say for example you have 16 cores and 8 GB. And that you use 4
> executors with 1 core each.
>
> When you use an executor, spark reserves it from yarn and yarn allocates
> the number of cores (e.g. 1 in our case) and the memory. The memory is
> actually more than you asked for. If you ask for 1GB it will in fact
> allocate almost 1.5GB with overhead. In addition, it will probably allocate
> an executor for the driver (probably with 1024MB memory usage).
>
> When you run your program and look in port 8080, you should look not only
> on the VCores used out of the VCores total but also on the Memory used and
> Memory total. You should also navigate to the executors (e.g.
> applications->running on the left and then choose you application and
> navigate all the way down to a single container). You can see there the
> actual usage.
>
>
>
> BTW, it doesn’t matter how much memory your program wants but how much it
> reserves. In your example it will not take the 50MB of the test but the
> ~1.5GB (after overhead) per executor.
>
> Hope this helps,
>
> Assaf.
>
>
>
> *From:* Cosmin Posteuca [mailto:cosmin.poste...@gmail.com]
> *Sent:* Tuesday, February 14, 2017 9:53 AM
> *To:* Egor Pahomov
> *Cc:* user
> *Subject:* Re: [Spark Launcher] How to launch parallel jobs?
>
>
>
> Hi Egor,
>
>
>
> About the first problem i think you are right, it's make sense.
>
>
>
> About the second problem, i check available resource on 8088 port and
> there show 16 available cores. I start my job with 4 executors with 1 core
> each, and 1gb per executor. My job use maximum 50mb of memory(just for
> test). From my point of view the resources are enough, and the problem i
> think is from yarn configuration files, but i don't know what is missing.
>
>
>
> Thank you
>
>
>
> 2017-02-13 21:14 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
> About second problem: I understand this can be in two cases: when one job
> prevents the other one from getting resources for executors or (2)
> bottleneck is reading from disk, so you can not really parallel that. I
> have no experience with second case, but it's easy to verify the fist one:
> just look on you hadoop UI and verify, that both job get enough resources.
>
>
>
> 2017-02-13 11:07 GMT-08:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
> "But if i increase only executor-cores the finish time is the same". More
> experienced ones can correct me, if I'm wrong, but as far as I understand
> that: one partition processed by one spark task. Task is always running on
> 1 core and not parallelized among cores. So if you have 5 partitions and
> you increased totall number of cores among cluster from 7 to 10 for example
> - you have not gained anything. But if you repartition you give an
> opportunity to process thing in more threads, so now more tasks can execute
> in parallel.
>
>
>
> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:
>
> Hi,
>
>
>
> I think i don't understand enough how to launch jobs.
>
>
>
> I have one job which takes 60 seconds to finish. I run it with following
> command:
>
>
>
> spark-submit --executor-cores 1 \
>
>  --executor-memory 1g \
>
>  --driver-memory 1g \
>
>  --master yarn \
>
>  --deploy-mode cluster \
>
>  --conf spark.dynamicAllocation.enabled=true \
>
>  --conf spark.shuffle.service.enabled=true \
>
>  --conf spark.dynamicAllocation.minExecutors=1 \
>
>  --conf spark.dynamicAllocation.maxExecutors=4 \
>
>  --conf spark.dynamicAllocation.initialExecutors=4 \
>
>  --conf spark.executor.instances=4 \
>
> If i increase number of partitions from code and number of executors the app 
> will finish faster, which it's ok. But if i increase only executor-cores the 
> finish time is the same, and i don't understand why. I expect the time to be 
> lower than initial time.
>
> My second problem is if i launch twice above code i expect that both jobs to 
> finish in 60 seconds, but this don't happen. Both jobs finish after 120 
> seconds and i don't understand why.
>
> I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
> threads). From what i saw in default EMR configurations, yarn is set on 
> FIFO(default) mode with CapacityScheduler.
>
> What do you think about this problems?
>
> Thanks,
>
> Cosmin
>
>
>
>
>
> --
>
>
> *Sincerely yours Egor Pakhomov*
>
>
>
>
>
> --
>
>
> *Sincerely yours Egor Pakhomov*
>
>
>


RE: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Mendelson, Assaf
You should also check your memory usage.
Let’s say for example you have 16 cores and 8 GB. And that you use 4 executors 
with 1 core each.
When you use an executor, spark reserves it from yarn and yarn allocates the 
number of cores (e.g. 1 in our case) and the memory. The memory is actually 
more than you asked for. If you ask for 1GB it will in fact allocate almost 
1.5GB with overhead. In addition, it will probably allocate an executor for the 
driver (probably with 1024MB memory usage).
When you run your program and look in port 8080, you should look not only on 
the VCores used out of the VCores total but also on the Memory used and Memory 
total. You should also navigate to the executors (e.g. applications->running on 
the left and then choose you application and navigate all the way down to a 
single container). You can see there the actual usage.

BTW, it doesn’t matter how much memory your program wants but how much it 
reserves. In your example it will not take the 50MB of the test but the ~1.5GB 
(after overhead) per executor.
Hope this helps,
Assaf.

From: Cosmin Posteuca [mailto:cosmin.poste...@gmail.com]
Sent: Tuesday, February 14, 2017 9:53 AM
To: Egor Pahomov
Cc: user
Subject: Re: [Spark Launcher] How to launch parallel jobs?

Hi Egor,

About the first problem i think you are right, it's make sense.

About the second problem, i check available resource on 8088 port and there 
show 16 available cores. I start my job with 4 executors with 1 core each, and 
1gb per executor. My job use maximum 50mb of memory(just for test). From my 
point of view the resources are enough, and the problem i think is from yarn 
configuration files, but i don't know what is missing.

Thank you

2017-02-13 21:14 GMT+02:00 Egor Pahomov 
<pahomov.e...@gmail.com<mailto:pahomov.e...@gmail.com>>:
About second problem: I understand this can be in two cases: when one job 
prevents the other one from getting resources for executors or (2) bottleneck 
is reading from disk, so you can not really parallel that. I have no experience 
with second case, but it's easy to verify the fist one: just look on you hadoop 
UI and verify, that both job get enough resources.

2017-02-13 11:07 GMT-08:00 Egor Pahomov 
<pahomov.e...@gmail.com<mailto:pahomov.e...@gmail.com>>:
"But if i increase only executor-cores the finish time is the same". More 
experienced ones can correct me, if I'm wrong, but as far as I understand that: 
one partition processed by one spark task. Task is always running on 1 core and 
not parallelized among cores. So if you have 5 partitions and you increased 
totall number of cores among cluster from 7 to 10 for example - you have not 
gained anything. But if you repartition you give an opportunity to process 
thing in more threads, so now more tasks can execute in parallel.

2017-02-13 7:05 GMT-08:00 Cosmin Posteuca 
<cosmin.poste...@gmail.com<mailto:cosmin.poste...@gmail.com>>:
Hi,

I think i don't understand enough how to launch jobs.

I have one job which takes 60 seconds to finish. I run it with following 
command:


spark-submit --executor-cores 1 \

 --executor-memory 1g \

 --driver-memory 1g \

 --master yarn \

 --deploy-mode cluster \

 --conf spark.dynamicAllocation.enabled=true \

 --conf spark.shuffle.service.enabled=true \

 --conf spark.dynamicAllocation.minExecutors=1 \

 --conf spark.dynamicAllocation.maxExecutors=4 \

 --conf spark.dynamicAllocation.initialExecutors=4 \

 --conf spark.executor.instances=4 \

If i increase number of partitions from code and number of executors the app 
will finish faster, which it's ok. But if i increase only executor-cores the 
finish time is the same, and i don't understand why. I expect the time to be 
lower than initial time.

My second problem is if i launch twice above code i expect that both jobs to 
finish in 60 seconds, but this don't happen. Both jobs finish after 120 seconds 
and i don't understand why.

I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
threads). From what i saw in default EMR configurations, yarn is set on 
FIFO(default) mode with CapacityScheduler.

What do you think about this problems?

Thanks,

Cosmin



--
Sincerely yours
Egor Pakhomov



--
Sincerely yours
Egor Pakhomov



Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Cosmin Posteuca
Hi Egor,

About the first problem i think you are right, it's make sense.

About the second problem, i check available resource on 8088 port and there
show 16 available cores. I start my job with 4 executors with 1 core each,
and 1gb per executor. My job use maximum 50mb of memory(just for test).
>From my point of view the resources are enough, and the problem i think is
from yarn configuration files, but i don't know what is missing.

Thank you

2017-02-13 21:14 GMT+02:00 Egor Pahomov :

> About second problem: I understand this can be in two cases: when one job
> prevents the other one from getting resources for executors or (2)
> bottleneck is reading from disk, so you can not really parallel that. I
> have no experience with second case, but it's easy to verify the fist one:
> just look on you hadoop UI and verify, that both job get enough resources.
>
> 2017-02-13 11:07 GMT-08:00 Egor Pahomov :
>
>> "But if i increase only executor-cores the finish time is the same".
>> More experienced ones can correct me, if I'm wrong, but as far as I
>> understand that: one partition processed by one spark task. Task is always
>> running on 1 core and not parallelized among cores. So if you have 5
>> partitions and you increased totall number of cores among cluster from 7 to
>> 10 for example - you have not gained anything. But if you repartition you
>> give an opportunity to process thing in more threads, so now more tasks can
>> execute in parallel.
>>
>> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca :
>>
>>> Hi,
>>>
>>> I think i don't understand enough how to launch jobs.
>>>
>>> I have one job which takes 60 seconds to finish. I run it with following
>>> command:
>>>
>>> spark-submit --executor-cores 1 \
>>>  --executor-memory 1g \
>>>  --driver-memory 1g \
>>>  --master yarn \
>>>  --deploy-mode cluster \
>>>  --conf spark.dynamicAllocation.enabled=true \
>>>  --conf spark.shuffle.service.enabled=true \
>>>  --conf spark.dynamicAllocation.minExecutors=1 \
>>>  --conf spark.dynamicAllocation.maxExecutors=4 \
>>>  --conf spark.dynamicAllocation.initialExecutors=4 \
>>>  --conf spark.executor.instances=4 \
>>>
>>> If i increase number of partitions from code and number of executors the 
>>> app will finish faster, which it's ok. But if i increase only 
>>> executor-cores the finish time is the same, and i don't understand why. I 
>>> expect the time to be lower than initial time.
>>>
>>> My second problem is if i launch twice above code i expect that both jobs 
>>> to finish in 60 seconds, but this don't happen. Both jobs finish after 120 
>>> seconds and i don't understand why.
>>>
>>> I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
>>> threads). From what i saw in default EMR configurations, yarn is set on 
>>> FIFO(default) mode with CapacityScheduler.
>>>
>>> What do you think about this problems?
>>>
>>> Thanks,
>>>
>>> Cosmin
>>>
>>>
>>
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>


Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Egor Pahomov
About second problem: I understand this can be in two cases: when one job
prevents the other one from getting resources for executors or (2)
bottleneck is reading from disk, so you can not really parallel that. I
have no experience with second case, but it's easy to verify the fist one:
just look on you hadoop UI and verify, that both job get enough resources.

2017-02-13 11:07 GMT-08:00 Egor Pahomov :

> "But if i increase only executor-cores the finish time is the same". More
> experienced ones can correct me, if I'm wrong, but as far as I understand
> that: one partition processed by one spark task. Task is always running on
> 1 core and not parallelized among cores. So if you have 5 partitions and
> you increased totall number of cores among cluster from 7 to 10 for example
> - you have not gained anything. But if you repartition you give an
> opportunity to process thing in more threads, so now more tasks can execute
> in parallel.
>
> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca :
>
>> Hi,
>>
>> I think i don't understand enough how to launch jobs.
>>
>> I have one job which takes 60 seconds to finish. I run it with following
>> command:
>>
>> spark-submit --executor-cores 1 \
>>  --executor-memory 1g \
>>  --driver-memory 1g \
>>  --master yarn \
>>  --deploy-mode cluster \
>>  --conf spark.dynamicAllocation.enabled=true \
>>  --conf spark.shuffle.service.enabled=true \
>>  --conf spark.dynamicAllocation.minExecutors=1 \
>>  --conf spark.dynamicAllocation.maxExecutors=4 \
>>  --conf spark.dynamicAllocation.initialExecutors=4 \
>>  --conf spark.executor.instances=4 \
>>
>> If i increase number of partitions from code and number of executors the app 
>> will finish faster, which it's ok. But if i increase only executor-cores the 
>> finish time is the same, and i don't understand why. I expect the time to be 
>> lower than initial time.
>>
>> My second problem is if i launch twice above code i expect that both jobs to 
>> finish in 60 seconds, but this don't happen. Both jobs finish after 120 
>> seconds and i don't understand why.
>>
>> I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
>> threads). From what i saw in default EMR configurations, yarn is set on 
>> FIFO(default) mode with CapacityScheduler.
>>
>> What do you think about this problems?
>>
>> Thanks,
>>
>> Cosmin
>>
>>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 


*Sincerely yoursEgor Pakhomov*


Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Egor Pahomov
"But if i increase only executor-cores the finish time is the same". More
experienced ones can correct me, if I'm wrong, but as far as I understand
that: one partition processed by one spark task. Task is always running on
1 core and not parallelized among cores. So if you have 5 partitions and
you increased totall number of cores among cluster from 7 to 10 for example
- you have not gained anything. But if you repartition you give an
opportunity to process thing in more threads, so now more tasks can execute
in parallel.

2017-02-13 7:05 GMT-08:00 Cosmin Posteuca :

> Hi,
>
> I think i don't understand enough how to launch jobs.
>
> I have one job which takes 60 seconds to finish. I run it with following
> command:
>
> spark-submit --executor-cores 1 \
>  --executor-memory 1g \
>  --driver-memory 1g \
>  --master yarn \
>  --deploy-mode cluster \
>  --conf spark.dynamicAllocation.enabled=true \
>  --conf spark.shuffle.service.enabled=true \
>  --conf spark.dynamicAllocation.minExecutors=1 \
>  --conf spark.dynamicAllocation.maxExecutors=4 \
>  --conf spark.dynamicAllocation.initialExecutors=4 \
>  --conf spark.executor.instances=4 \
>
> If i increase number of partitions from code and number of executors the app 
> will finish faster, which it's ok. But if i increase only executor-cores the 
> finish time is the same, and i don't understand why. I expect the time to be 
> lower than initial time.
>
> My second problem is if i launch twice above code i expect that both jobs to 
> finish in 60 seconds, but this don't happen. Both jobs finish after 120 
> seconds and i don't understand why.
>
> I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
> threads). From what i saw in default EMR configurations, yarn is set on 
> FIFO(default) mode with CapacityScheduler.
>
> What do you think about this problems?
>
> Thanks,
>
> Cosmin
>
>


-- 


*Sincerely yoursEgor Pakhomov*


[Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Cosmin Posteuca
Hi,

I think i don't understand enough how to launch jobs.

I have one job which takes 60 seconds to finish. I run it with following
command:

spark-submit --executor-cores 1 \
 --executor-memory 1g \
 --driver-memory 1g \
 --master yarn \
 --deploy-mode cluster \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=4 \
 --conf spark.dynamicAllocation.initialExecutors=4 \
 --conf spark.executor.instances=4 \

If i increase number of partitions from code and number of executors
the app will finish faster, which it's ok. But if i increase only
executor-cores the finish time is the same, and i don't understand
why. I expect the time to be lower than initial time.

My second problem is if i launch twice above code i expect that both
jobs to finish in 60 seconds, but this don't happen. Both jobs finish
after 120 seconds and i don't understand why.

I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu
has 2 threads). From what i saw in default EMR configurations, yarn is
set on FIFO(default) mode with CapacityScheduler.

What do you think about this problems?

Thanks,

Cosmin