Re: Help: we're seeing many container and Attempt timeouts when running on a larger cluster--any suggestions on where to look?

Rohini Palaniswamy Tue, 01 Aug 2017 14:35:31 -0700

tez.task.progress.stuck.interval-ms is broken before Tez 0.9. So do not
turn that on if you are on 0.8.


On Tue, Aug 1, 2017 at 11:25 AM, Siddharth Seth <ss...@apache.org> wrote:

> There's several configs which control timeouts and heartbeats. For larger
> clusters, lowering the frequency at which some of these messages are sent
> out would likely help.
>
> For controlling the timeouts
> tez.task.progress.stuck.interval-ms - User code needs to send heartbeats
> for this property. (Default is set to -1, which disables this)
> tez.task.timeout-ms - the framework takes care of sending pings for this
> property. (Defaults to a 5 minute timeout)
>
> Some of the heartbeat intervals are a little aggressive out of the box.
> 100-200ms. This can cause network congestion, as well as an event backup on
> the AM.
> tez.task.get-task.sleep.interval-ms.max - Default 200. Set to > 3s on a
> larger cluster.
> tez.task.am.heartbeat.interval-ms.max - Default 100. Set to > 3s on a
> larger cluster.
> tez.task.am.heartbeat.counter.interval-ms.max - Default 4000. Set to >
> 30s on a larger cluster.
>
> (You can try tuning the suggested values to see how well things work).
> A problem to look for is the AM queue backing up (will show up as messages
> in the AM logs, also look at GC logs for the AM), which can lead to GC on
> the AM, as well as a general delay in processing events, which can lead to
> a timeout.
> Also, look at GC for the tasks that are running. Are heartbeats actually
> going out.
>
>
>
> On Tue, Aug 1, 2017 at 11:07 AM, Scott McCarty <smcca...@apixio.com>
> wrote:
>
>> Hi,
>>
>> We're running a Cloudera hadoop cluster (5.8.x, I believe) that we scale
>> up and down as needed.  The only jobs running on the cluster are Tez
>> (version 0.8.4) and when the cluster is small (about 200 nodes or less)
>> things work reasonably well, but when the cluster is scaled up to, say, 500
>> nodes, a very large percentage of the jobs fail due to either container
>> timeout or attempt timeout and we're trying to figure out what might be
>> causing this problem.
>>
>> The timeout(s) are set to 30 minutes and from looking at Tez code that
>> gives that timeout error, it looks like the ping that's supposed to be
>> coming from the attempt/container JVM isn't happening.  It's happened on
>> the initial input node in the DAG so it's not necessarily failing due to
>> intra-DAG communication problems.
>>
>> The Tez jobs are custom code--we're not using Tez for Hive queries--and
>> some of the processing on key/value records can take quite a while but that
>> doesn't cause problems when the cluster is smaller.  Also, we have Tez
>> sessions and container reuse turned off.
>>
>> Does anyone know if this is/was a problem with Tez 0.8.4?  Or maybe it's
>> a Cloudera/RM/cluster issue?  Any suggestions on what to look for?  (For
>> sure it would be good to upgrade to a more recent version of Tez but that
>> might have to wait for a short while.)
>>
>> Thanks in advance for any help/suggestions.
>>
>> --Scott
>>
>>
>

Re: Help: we're seeing many container and Attempt timeouts when running on a larger cluster--any suggestions on where to look?

Reply via email to