Jason Lowe created MAPREDUCE-6263:
-------------------------------------

             Summary: Large jobs can lose history when killed due to brief 
client timeout
                 Key: MAPREDUCE-6263
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: client
    Affects Versions: 2.6.0
            Reporter: Jason Lowe


YARNRunner connects to the AM to send the kill job command then waits a 
hardcoded 10 seconds for the job to enter a terminal state.  If the job fails 
to enter a terminal state in that time then YARNRunner will tell YARN to kill 
the application forcefully.  The latter type of kill usually results in no job 
history, since the AM process is killed forcefully.

Ten seconds can be too short for large jobs in a large cluster, as it takes 
time to connect to all the nodemanagers, process the state machine events, and 
copy a large jhist file.  The timeout should be more lenient or configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to