Hi Jeroen,

can you try to then use the EMR version 5.10 instead or EMR version 5.11
instead?
can you please try selecting a subnet which is in a different availability
zone?
if possible just try to increase the number of task instances and see the
difference?
also in case you are using caching, try to see the total amount of space
being used, you may also want to persist intermediate data into S3 as
default parquet format in worst case scenario and then work through the
steps that you think are failing using Jupyter or Spark notebook.
Also can you please report the number of containers that your job is
creating by looking at the metrics in the EMR console?

Also if you see the spark UI then you can easily see which particular step
is taking the longest period of time - you just have to drill in a bit in
order to see that. Generally in case shuffling is an issue then it
definitely appears in the SPARK UI as I drill into the steps and see which
particular one is taking the longest.


Since you do not have a long running cluster (which I mistook from your
statement of a long running job) therefore things should be fine.


Regards,
Gourav Sengupta


On Thu, Dec 28, 2017 at 7:43 PM, Jeroen Miller <bluedasya...@gmail.com>
wrote:

> On 28 Dec 2017, at 19:42, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
> > In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
>
> Nothing that I can think of, just a Spark step (unless EMR is doing fancy
> stuff behind my back).
>
> > Are you using SPARK Session?
>
> Yes.
>
> > If yes is your application using cluster mode or client mode?
>
> Cluster mode.
>
> > Have you read the EC2 service level agreement?
>
> I did not -- I doubt it has the answer to my problem though! :-)
>
> > Is your cluster on auto scaling group?
>
> Nope.
>
> > Are you scheduling your job by adding another new step into the EMR
> cluster? Or is it the same job running always triggered by some background
> process?
> > Since EMR are supposed to be ephemeral, have you tried creating a new
> cluster and trying your job in that?
>
> I'm creating a new cluster on demand, specifically for that job. No other
> application runs on it.
>
> JM
>
>

Reply via email to