>
>    - Spark UI shows number of succeeded tasks is more than total number
>    of tasks, eg: 3500/3000. There are no failed tasks. At this stage the
>    computation keeps carrying on for a long time without returning an answer.
>
> No sign of resubmitted tasks in the command line logs either?
You might want to get more information on what is going on in the JVM?
I don't know what others use but jvmtop is easy to install on ec2 and you
can monitor some processes.

>
>    - The only way to get an answer from an application is to hopelessly
>    keep running that application multiple times, until by some luck it gets
>    converged.
>
> I was not able to regenerate this by a minimal code, as it seems some
> random factors affect this behavior. I have a suspicion, but I'm not sure,
> that use of one or more groupByKey() calls intensifies this problem.
>
Is this related to the amount of data you are processing? Is it more likely
to happen on large data?
My experience on ec2 is whenever the the memory/partitioning/timout
settings are reasonable
the output is quite consistent. Even if I stop and restart the cluster the
other day.

Reply via email to