Peter Bacsko created MAPREDUCE-7015:
---------------------------------------

             Summary: Possible race condition in JHS if the job is not loaded
                 Key: MAPREDUCE-7015
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobhistoryserver
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


There could be a race condition inside JHS. In our build environment, 
{{TestMRJobClient.testJobClient()}} failed with this exception:

{noformat}
ava.io.FileNotFoundException: File does not exist: 
hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
        at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at 
org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
        at 
org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
        at 
org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
{noformat}

Root cause:
1. MapReduce job completes
2. CLI calls {{cluster.getJob(jobid)}}
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from CachedHistoryStorage so JHS tries to find the 
job
5. First it scans the intermediate directory and finds the job
6. The call moveToDone() is scheduled for execution on a separate thread inside 
moveToDoneExecutor but does not get the chance to run immediately
7. RPC invocation returns with the path pointing to 
/tmp/hadoop-yarn/staging/history/done_intermediate
8. The call to moveToDone() completes which moves the contents of 
done_intermediate to done
9. Hadoop CLI tries to download the config file from done_intermediate but it's 
no longer there

Usually step #6 is fast enough to complete before step #7, but sometimes it can 
get behind, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Reply via email to