tez.task.progress.stuck.interval-ms is broken before Tez 0.9. So do not turn that on if you are on 0.8.
On Tue, Aug 1, 2017 at 11:25 AM, Siddharth Seth <ss...@apache.org> wrote: > There's several configs which control timeouts and heartbeats. For larger > clusters, lowering the frequency at which some of these messages are sent > out would likely help. > > For controlling the timeouts > tez.task.progress.stuck.interval-ms - User code needs to send heartbeats > for this property. (Default is set to -1, which disables this) > tez.task.timeout-ms - the framework takes care of sending pings for this > property. (Defaults to a 5 minute timeout) > > Some of the heartbeat intervals are a little aggressive out of the box. > 100-200ms. This can cause network congestion, as well as an event backup on > the AM. > tez.task.get-task.sleep.interval-ms.max - Default 200. Set to > 3s on a > larger cluster. > tez.task.am.heartbeat.interval-ms.max - Default 100. Set to > 3s on a > larger cluster. > tez.task.am.heartbeat.counter.interval-ms.max - Default 4000. Set to > > 30s on a larger cluster. > > (You can try tuning the suggested values to see how well things work). > A problem to look for is the AM queue backing up (will show up as messages > in the AM logs, also look at GC logs for the AM), which can lead to GC on > the AM, as well as a general delay in processing events, which can lead to > a timeout. > Also, look at GC for the tasks that are running. Are heartbeats actually > going out. > > > > On Tue, Aug 1, 2017 at 11:07 AM, Scott McCarty <smcca...@apixio.com> > wrote: > >> Hi, >> >> We're running a Cloudera hadoop cluster (5.8.x, I believe) that we scale >> up and down as needed. The only jobs running on the cluster are Tez >> (version 0.8.4) and when the cluster is small (about 200 nodes or less) >> things work reasonably well, but when the cluster is scaled up to, say, 500 >> nodes, a very large percentage of the jobs fail due to either container >> timeout or attempt timeout and we're trying to figure out what might be >> causing this problem. >> >> The timeout(s) are set to 30 minutes and from looking at Tez code that >> gives that timeout error, it looks like the ping that's supposed to be >> coming from the attempt/container JVM isn't happening. It's happened on >> the initial input node in the DAG so it's not necessarily failing due to >> intra-DAG communication problems. >> >> The Tez jobs are custom code--we're not using Tez for Hive queries--and >> some of the processing on key/value records can take quite a while but that >> doesn't cause problems when the cluster is smaller. Also, we have Tez >> sessions and container reuse turned off. >> >> Does anyone know if this is/was a problem with Tez 0.8.4? Or maybe it's >> a Cloudera/RM/cluster issue? Any suggestions on what to look for? (For >> sure it would be good to upgrade to a more recent version of Tez but that >> might have to wait for a short while.) >> >> Thanks in advance for any help/suggestions. >> >> --Scott >> >> >