[jira] [Updated] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2358: -- Attachment: TEZ-2358.3.patch Added preconditions check in MergeManager.closeOnDiskFile(). Since we need to consider only filepath offset, we need to iterate through all items in onDiskMapOutputs (as fileChunk includes filepath, offset, length). It is still fine as it won't be expensive and makes it easier for debugging. [~gopalv] - Please have a look at the latest patch when you find time. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO
Success: TEZ-2358 PreCommit Build #540
Jira: https://issues.apache.org/jira/browse/TEZ-2358 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/540/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2773 lines...] [INFO] Final Memory: 73M/933M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728348/TEZ-2358.3.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/540//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/540//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 16b4bbd9396f2e0b5bd1dd21e0a5589578247c5b logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #537 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2576597 bytes Compression is 7.1% Took 1.4 sec Description set: TEZ-2358 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Created] (TEZ-2370) Add stages information to RM UI for debugging / visibility on job progress
Hari Sekhon created TEZ-2370: Summary: Add stages information to RM UI for debugging / visibility on job progress Key: TEZ-2370 URL: https://issues.apache.org/jira/browse/TEZ-2370 Project: Apache Tez Issue Type: Improvement Components: UI Affects Versions: 0.5.2 Environment: HDP 2.2.0 Reporter: Hari Sekhon Priority: Minor Something that has been bugging me since last year is the difficulty of debugging Tez jobs compared to MapReduce jobs. This is because Resource Manager / Application Master does not display the job stats and stages that we are used to seeing in MapReduce eg. Map and Reduce task counts and progress. I appreciate that Tez is a more flexible framework with a DAG but it would be nice if it could surface the information on the different stages, number of tasks running, completed, failed, killed, successful etc, similar to how Spark does, and the stage breakdown would be useful in understanding what the job is doing at different times, what stage is getting stuck/failing etc. At the moment the only thing available is to trawl the logs or hope to have a console output where some of that information is available, both of which are non-ideal when debugging other's people's jobs after the fact. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513802#comment-14513802 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728348/TEZ-2358.3.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/540//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/540//console This message is automatically generated. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO
Success: TEZ-2358 PreCommit Build #541
Jira: https://issues.apache.org/jira/browse/TEZ-2358 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/541/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2772 lines...] [INFO] Final Memory: 76M/1274M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728354/TEZ-2358.4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/541//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/541//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 6c74121472d38c3e18d73d3532e9348a11a7079a logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #540 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2550818 bytes Compression is 7.2% Took 1.3 sec Description set: TEZ-2358 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513830#comment-14513830 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728354/TEZ-2358.4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/541//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/541//console This message is automatically generated. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO
[jira] [Commented] (TEZ-2303) ConcurrentModificationException while processing recovery
[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513894#comment-14513894 ] Jeff Zhang commented on TEZ-2303: - [~hitesh] I didn't find way to stop accepting connections from client after DAG is recovered. Upload a another patch to use a different way. * Register to RM after recovery is done so that client will get the host/port after the recovery is completed. * There may be still one potential issue that if recovery fails, it would unregister to RM without register first, not sure whether this would cause any YarnException. ConcurrentModificationException while processing recovery - Key: TEZ-2303 URL: https://issues.apache.org/jira/browse/TEZ-2303 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Assignee: Jeff Zhang Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch Saw a Tez AM log a few ConcurrentModificationException messages while trying to recover from a previous attempt that crashed. Exception details to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2372) TestAMRecovery failing in latest build
[ https://issues.apache.org/jira/browse/TEZ-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514815#comment-14514815 ] Hitesh Shah commented on TEZ-2372: -- \cc [~zjffdu] TestAMRecovery failing in latest build --- Key: TEZ-2372 URL: https://issues.apache.org/jira/browse/TEZ-2372 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah https://builds.apache.org/job/Tez-Build/1018/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2372) TestAMRecovery failing in latest build
Hitesh Shah created TEZ-2372: Summary: TestAMRecovery failing in latest build Key: TEZ-2372 URL: https://issues.apache.org/jira/browse/TEZ-2372 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah https://builds.apache.org/job/Tez-Build/1018/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2363 PreCommit Build #550
Jira: https://issues.apache.org/jira/browse/TEZ-2363 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/550/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2770 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12727922/TEZ-2363.1.patch against master revision 21d4e2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 161 javac compiler warnings (more than the master's current 160 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/550//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/550//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/550//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 4e398a5babc65b05d4c5d541ee8a9d840188d4b6 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #549 Archived 45 artifacts Archive block size is 32768 Received 6 blocks and 2561573 bytes Compression is 7.1% Took 0.57 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
Failed: TEZ-993 PreCommit Build #552
Jira: https://issues.apache.org/jira/browse/TEZ-993 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/552/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 20 lines...] [PreCommit-TEZ-Build] $ /bin/bash /tmp/hudson5966714899514388262.sh Running in Jenkins mode == == Testing patch for TEZ-993. == == HEAD is now at 21d4e2d TEZ-2342. TestFaultTolerance.testRandomFailingTasks fails due to timeout. (Jeff Zhang via hitesh) error: pathspec 'master' did not match any file(s) known to git. From https://git-wip-us.apache.org/repos/asf/tez * branchHEAD - FETCH_HEAD Current branch HEAD is up to date. TEZ-993 patch is being downloaded at Mon Apr 27 19:47:57 UTC 2015 from http://issues.apache.org/jira/secure/attachment/12695488/TEZ-993-5.patch The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12695488/TEZ-993-5.patch against master revision 21d4e2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/552//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 7f80fb7451e0d38bad82b83e672093ce2b7d989d logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
[jira] [Commented] (TEZ-993) Remove application logic from RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514828#comment-14514828 ] TezQA commented on TEZ-993: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12695488/TEZ-993-5.patch against master revision 21d4e2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/552//console This message is automatically generated. Remove application logic from RecoveryService - Key: TEZ-993 URL: https://issues.apache.org/jira/browse/TEZ-993 Project: Apache Tez Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jeff Zhang Attachments: TEZ-993-3.patch, TEZ-993-4.patch, TEZ-993-5.patch, Tez-993-2.patch, Tez-993.patch Currently RecoveryService storage logic knows a lot about the DAG like which dag is pre-warm and does not need to be stored, which events needs special treatment etc. This kind of logic couples the DAG and the storage more than is probably necessary and can be a source of complications down the road. The storage should ideally be simply storing a sequence of arbitrary records delimited by a marker. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514826#comment-14514826 ] TezQA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 21d4e2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/551//console This message is automatically generated. Re-factor routing of events to use common code path for normal and recovery flow. - Key: TEZ-1019 URL: https://issues.apache.org/jira/browse/TEZ-1019 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, TEZ-1019-5.patch, Tez-1019.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-1019 PreCommit Build #551
Jira: https://issues.apache.org/jira/browse/TEZ-1019 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/551/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 20 lines...] [PreCommit-TEZ-Build] $ /bin/bash /tmp/hudson7789757465151017088.sh Running in Jenkins mode == == Testing patch for TEZ-1019. == == HEAD is now at 21d4e2d TEZ-2342. TestFaultTolerance.testRandomFailingTasks fails due to timeout. (Jeff Zhang via hitesh) error: pathspec 'master' did not match any file(s) known to git. From https://git-wip-us.apache.org/repos/asf/tez * branchHEAD - FETCH_HEAD Current branch HEAD is up to date. TEZ-1019 patch is being downloaded at Mon Apr 27 19:47:50 UTC 2015 from http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 21d4e2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/551//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. dcd65c91387ecf0b5d9971fa42f19824ecd6d36b logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514797#comment-14514797 ] Gopal V commented on TEZ-2358: -- [~hitesh]: marking this as blocker for 0.7.x, because it causes task failures for long running jobs. [~rajesh.balamohan]: Patch LGTM - +1 Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Priority: Blocker Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host:
Failed: TEZ-2259 PreCommit Build #547
Jira: https://issues.apache.org/jira/browse/TEZ-2259 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/547/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2770 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728478/TEZ-2259.4.patch against master revision 21d4e2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.test.Tests org.apache.tez.test.TestTests org.apache.tez.teTests org.apache.tez.test.TestDAGRecovery Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/547//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/547//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 11903cfaab5ad5323a5b4f7d0c03e59963a8e89e logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #543 Archived 44 artifacts Archive block size is 32768 Received 2 blocks and 2705121 bytes Compression is 2.4% Took 2.2 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
Success: TEZ-2325 PreCommit Build #549
Jira: https://issues.apache.org/jira/browse/TEZ-2325 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/549/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2770 lines...] [INFO] Final Memory: 77M/1181M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728394/TEZ-2325.4.patch against master revision 21d4e2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/549//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/549//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 9bb668dafec1a33e18412c90b0a387f135d4eec6 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #548 Archived 44 artifacts Archive block size is 32768 Received 0 blocks and 2752759 bytes Compression is 0.0% Took 0.59 sec Description set: TEZ-2325 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2325) Route status update event directly to the attempt
[ https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514914#comment-14514914 ] TezQA commented on TEZ-2325: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728394/TEZ-2325.4.patch against master revision 21d4e2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/549//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/549//console This message is automatically generated. Route status update event directly to the attempt -- Key: TEZ-2325 URL: https://issues.apache.org/jira/browse/TEZ-2325 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Attachments: TEZ-2325.1.patch, TEZ-2325.2.patch, TEZ-2325.3.patch, TEZ-2325.4.patch Today, all events from the attempt heartbeat are routed to the vertex. then the vertex routes (if any) status update events to the attempt. This is unnecessary and potentially creates out of order scenarios. We could route the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2369) Add a few unit tests for RootInputInitializerManager
[ https://issues.apache.org/jira/browse/TEZ-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514926#comment-14514926 ] TezQA commented on TEZ-2369: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728484/TEZ-2369.2.txt against master revision 21d4e2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/553//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/553//console This message is automatically generated. Add a few unit tests for RootInputInitializerManager Key: TEZ-2369 URL: https://issues.apache.org/jira/browse/TEZ-2369 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2369.1.txt, TEZ-2369.2.txt {code} - Integer successfulAttempt = vertexSuccessfulAttemptMap.get(taskId); + Integer successfulAttempt = vertexSuccessfulAttemptMap.get(taskId.getId()); {code} This could cause events to be sent multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-2358: - Priority: Blocker (was: Major) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Priority: Blocker Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_2, runDuration: 0] {code} This will fail depending on
[jira] [Commented] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI
[ https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514964#comment-14514964 ] Hitesh Shah commented on TEZ-2259: -- Not sure what is off with the build. Console logs show a successful run: {code} INFO] tez ... SUCCESS [ 1.048 s] [INFO] tez-api ... SUCCESS [ 39.822 s] [INFO] tez-common SUCCESS [ 3.114 s] [INFO] tez-runtime-internals . SUCCESS [ 10.713 s] [INFO] tez-runtime-library ... SUCCESS [01:13 min] [INFO] tez-mapreduce . SUCCESS [ 25.778 s] [INFO] tez-examples .. SUCCESS [ 0.330 s] [INFO] tez-dag ... SUCCESS [02:04 min] [INFO] tez-tests . SUCCESS [24:57 min] [INFO] tez-ui SUCCESS [ 14.317 s] [INFO] tez-plugins ... SUCCESS [ 0.033 s] [INFO] tez-yarn-timeline-history . SUCCESS [ 54.781 s] [INFO] tez-yarn-timeline-history-with-acls ... SUCCESS [01:03 min] [INFO] tez-mbeans-resource-calculator SUCCESS [ 0.970 s] [INFO] tez-dist .. SUCCESS [ 12.728 s] [INFO] Tez ... SUCCESS [ 0.032 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 32:04 min [INFO] Finished at: 2015-04-27T20:13:33+00:00 [INFO] Final Memory: 68M/814M [INFO] {code} {code} [INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ tez-tests --- [INFO] Surefire report directory: /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build@2/tez-tests/target/surefire-reports --- T E S T S --- --- T E S TTests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 209.311 sec Running org.apache.tez.test.Tests run: 2, FaTests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 91.521 sec Running org.apache.tez.test.TestTests run: 22, FailurTests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 143.722 sec Running org.apache.tez.teTests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 79.795 sec Running org.apache.tez.test.TestAMRecovery Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 267.144 sec Running org.apache.tez.test.TestTezJobs Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 211.42 sec Running org.apache.tez.test.TestDAGRecovery ests run: 77, Failures: 0, Errors: 0, Skipped: 0 {code} - some munging of the output seems to exist. Ran tests locally and confirmed no failures. Push additional data to Timeline for Recovery for better consumption in UI -- Key: TEZ-2259 URL: https://issues.apache.org/jira/browse/TEZ-2259 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-2259.1.patch, TEZ-2259.2.patch, TEZ-2259.3.patch, TEZ-2259.4.patch Some things I can think of: - applicationAttemptId in which the dag was submitted - appAttemptId in which the dag was completed Above provides implicit information on how many app attempts the dag spanned ( and therefore recovered how many times ). - Maybe an implicit event mentioning that the DAG was recovered and in which attempt it was recovered. Possibly add information on what state was recovered? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2259) Push additional data to Timeline for Recovery for better consumption in UI
[ https://issues.apache.org/jira/browse/TEZ-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2259: - Attachment: TEZ-2259.branch-06.patch branch 0.6 patch as master patch conflicts. Push additional data to Timeline for Recovery for better consumption in UI -- Key: TEZ-2259 URL: https://issues.apache.org/jira/browse/TEZ-2259 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-2259.1.patch, TEZ-2259.2.patch, TEZ-2259.3.patch, TEZ-2259.4.patch, TEZ-2259.branch-06.patch Some things I can think of: - applicationAttemptId in which the dag was submitted - appAttemptId in which the dag was completed Above provides implicit information on how many app attempts the dag spanned ( and therefore recovered how many times ). - Maybe an implicit event mentioning that the DAG was recovered and in which attempt it was recovered. Possibly add information on what state was recovered? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515040#comment-14515040 ] Bikas Saha commented on TEZ-391: [~zjffdu] Can you make a call on whether this is for 0.7.0 or not? IMO, if this was close to being done then perhaps yes. SharedEdge - Support for passing same output from a vertex as input to two different vertices - Key: TEZ-391 URL: https://issues.apache.org/jira/browse/TEZ-391 Project: Apache Tez Issue Type: Sub-task Reporter: Rohini Palaniswamy Assignee: Jeff Zhang Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch We need this for lot of usecases. For cases where multi-query is turned off and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2369 PreCommit Build #553
Jira: https://issues.apache.org/jira/browse/TEZ-2369 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/553/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2768 lines...] [INFO] Final Memory: 69M/917M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728484/TEZ-2369.2.txt against master revision 21d4e2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/553//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/553//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 9cc787fd89de51b13fc38b77319d6f7994b14111 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #549 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2586172 bytes Compression is 7.1% Took 0.6 sec Description set: TEZ-2369 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2358: -- Attachment: TEZ-2358.4.patch Sure, addressing review comments in the latest patch. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_2, runDuration:
Failed: TEZ-1752 PreCommit Build #539
Jira: https://issues.apache.org/jira/browse/TEZ-1752 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/539/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 1991 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728349/TEZ-1752.2.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.runtime.task.TestTaskExecution Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/539//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/539//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 5b797557a09dfb5b9d9eb69d330cf57cdea75535 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #537 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2493908 bytes Compression is 7.3% Took 1.4 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 2 tests failed. REGRESSION: org.apache.tez.runtime.task.TestTaskExecution.testHeartbeatShouldDie Error Message: test timed out after 5000 milliseconds Stack Trace: java.lang.Exception: test timed out after 5000 milliseconds at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425) at java.util.concurrent.FutureTask.get(FutureTask.java:187) at org.apache.tez.runtime.task.TestTaskExecution.testHeartbeatShouldDie(TestTaskExecution.java:317) REGRESSION: org.apache.tez.runtime.task.TestTaskExecution.testHeartbeatException Error Message: test timed out after 5000 milliseconds Stack Trace: java.lang.Exception: test timed out after 5000 milliseconds at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425) at java.util.concurrent.FutureTask.get(FutureTask.java:187) at org.apache.tez.runtime.task.TestTaskExecution.testHeartbeatException(TestTaskExecution.java:278)
[jira] [Commented] (TEZ-1752) Inputs / Outputs in the Runtime library should be interruptable
[ https://issues.apache.org/jira/browse/TEZ-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513760#comment-14513760 ] TezQA commented on TEZ-1752: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728349/TEZ-1752.2.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.runtime.task.TestTaskExecution Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/539//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/539//console This message is automatically generated. Inputs / Outputs in the Runtime library should be interruptable --- Key: TEZ-1752 URL: https://issues.apache.org/jira/browse/TEZ-1752 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Attachments: TEZ-1752.1.patch, TEZ-1752.2.patch Not possible to preempt tasks without killing containers without this. There's still the problem of Processors not supporting interrupts. We may need API enhancements to either query IPOs on whether they are interrupbtible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2342) TestFaultTolerance.testRandomFailingTasks fails due to timeout
[ https://issues.apache.org/jira/browse/TEZ-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513676#comment-14513676 ] Jeff Zhang commented on TEZ-2342: - [~bikassaha] No other issue after running many times, and check the logs on the windows jenkins server, it is failed due to timeout. TestFaultTolerance.testRandomFailingTasks fails due to timeout -- Key: TEZ-2342 URL: https://issues.apache.org/jira/browse/TEZ-2342 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Minor Attachments: TEZ-2342-1.patch, syslog_dag_1429582868137_0001_1 {code} Error Message test timed out after 12 milliseconds Stacktrace java.lang.Exception: test timed out after 12 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:126) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:114) at org.apache.tez.test.TestFaultTolerance.testRandomFailingTasks(TestFaultTolerance.java:723) Standard Output 2015-04-17 07:46:10,952 INFO [main] test.TestFaultTolerance (TestFaultTolerance.java:setup(65)) - Starting mini clusters 2015-04-17 07:46:11,508 INFO [main] hdfs.MiniDFSCluster (MiniDFSCluster.java:init(446)) - starting cluster: numNameNodes=1, numDataNodes=1 Formatting using clusterid: testClusterID 2015-04-17 07:46:12,919 INFO [main] namenode.FSNamesystem (FSNamesystem.java:init(716)) - No KeyProvider found. 2015-04-17 07:46:12,920 INFO [main] namenode.FSNamesystem (FSNamesystem.java:init(726)) - fsLock is fair:true 2015-04-17 07:46:13,021 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1173)) - hadoop.configured.node.mapping is deprecated. Instead, use net.topology.configured.node.mapping 2015-04-17 07:46:13,021 INFO [main] blockmanagement.DatanodeManager (DatanodeManager.java:init(239)) - dfs.block.invalidate.limit=1000 2015-04-17 07:46:13,022 INFO [main] blockmanagement.DatanodeManager (DatanodeManager.java:init(245)) - dfs.namenode.datanode.registration.ip-hostname-check=true 2015-04-17 07:46:13,022 INFO [main] blockmanagement.BlockManager (InvalidateBlocks.java:printBlockDeletionTime(71)) - dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000 2015-04-17 07:46:13,025 INFO [main] blockmanagement.BlockManager (InvalidateBlocks.java:printBlockDeletionTime(76)) - The block deletion will start around 2015 Apr 17 07:46:13 2015-04-17 07:46:13,029 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map BlocksMap 2015-04-17 07:46:13,030 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type = 64-bit 2015-04-17 07:46:13,032 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 2.0% max memory 910.3 MB = 18.2 MB 2015-04-17 07:46:13,033 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity = 2^21 = 2097152 entries 2015-04-17 07:46:13,079 INFO [main] blockmanagement.BlockManager (BlockManager.java:createBlockTokenSecretManager(365)) - dfs.block.access.token.enable=false 2015-04-17 07:46:13,080 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(350)) - defaultReplication = 1 2015-04-17 07:46:13,080 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(351)) - maxReplication = 512 2015-04-17 07:46:13,083 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(352)) - minReplication = 1 2015-04-17 07:46:13,083 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(353)) - maxReplicationStreams = 2 2015-04-17 07:46:13,083 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(354)) - shouldCheckForEnoughRacks = false 2015-04-17 07:46:13,084 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(355)) - replicationRecheckInterval = 3000 2015-04-17 07:46:13,084 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(356)) - encryptDataTransfer= false 2015-04-17 07:46:13,084 INFO [main] blockmanagement.BlockManager (BlockManager.java:init(357)) - maxNumBlocksToLog = 1000 2015-04-17 07:46:13,115 INFO [main] namenode.FSNamesystem (FSNamesystem.java:init(746)) - fsOwner = jenkins (auth:SIMPLE) 2015-04-17 07:46:13,116 INFO [main] namenode.FSNamesystem (FSNamesystem.java:init(747)) - supergroup = supergroup 2015-04-17 07:46:13,116 INFO [main] namenode.FSNamesystem (FSNamesystem.java:init(748)) - isPermissionEnabled = true 2015-04-17 07:46:13,116 INFO [main] namenode.FSNamesystem
[jira] [Updated] (TEZ-1752) Inputs / Outputs in the Runtime library should be interruptable
[ https://issues.apache.org/jira/browse/TEZ-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1752: -- Attachment: TEZ-1752.2.patch In LogicalIOProcessorRuntimeTask - it would be useful to log the interrupt status between each close invocation, and potentially set it if the I/O/P being closed ends up unsetting it. - Done cleanup would behave differently if initialize hasn't been invoked. We may need to track which I O Ps have been initialized - and close just those - In the MergeManager, InterruptedException thrown by MergeThraed.close likely needs to be handled (otherwise it'll end up skipping cleanup?) - Done In the invocation of finalMerge - an IOException is caught, are there specific cases here where this IO exception is actually masking an interrupt ? (and as a result the interrupt status needs to be set) - Done The TezMerger change - should we just change the interface to throw InterruptedException, instead of setting the flag. That's a private method, and will force consumers within the IOs to handle it. - Modified TezMerger to throw InterruptedException UnorderedPartitionedKVWriter / others - in the close method, instead of returning an empty event list - should this just throw an InterruptedException back ? - Done Is the change in the TaskReporter required ? taskFailed shouldn't be invoked after the currentTask has been unregistered. - No, added that since the spurious logs (NPE) were coming up which made it difficult to debug. Master already has the fix for it. Removed the changes in the patch. We likely need to ensure that cleanup / close methods aren't called twice - once during regular cleanup, second during an interrupt while the cleanup is in progress. - Tracking the close() of IPO. This would take care of not making the call twice. Not directly related to interrupts - but an invocation on Task.close() (regular flow) can cause exceptions during Processor close or Input / Output close - which would prevent subsequent Inputs / Outputs from being closed.Do we need to make sure that close() gets invoked on subsequent Inputs / Outputs despite a prior exception ? - Yes, this is needed. Tracking the IPO close() and task.cleanup() in the patch takes care of this. Inputs / Outputs in the Runtime library should be interruptable --- Key: TEZ-1752 URL: https://issues.apache.org/jira/browse/TEZ-1752 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Attachments: TEZ-1752.1.patch, TEZ-1752.2.patch Not possible to preempt tasks without killing containers without this. There's still the problem of Processors not supporting interrupts. We may need API enhancements to either query IPOs on whether they are interrupbtible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513602#comment-14513602 ] Jeff Zhang commented on TEZ-2226: - I saw the TEZ_DAG_HISTORY_LOGGING is set in the dag's configuration. So it should be able to restore this value when recovering. [~lichangleo] I think you need to update the following code in RecoveryParser.java when recovering from DAGSubmittedEvent. (Also update the skippedDAGs of ATSHistoryLoggingService in this place) {code} case DAG_SUBMITTED: { DAGSubmittedEvent submittedEvent = (DAGSubmittedEvent) event; LOG.info(Recovering from event + , eventType= + eventType + , event= + event.toString()); recoveredDAGData.recoveredDAG = dagAppMaster.createDAG(submittedEvent.getDAGPlan(), lastInProgressDAG); recoveredDAGData.cumulativeAdditionalResources = submittedEvent .getCumulativeAdditionalLocalResources(); recoveredDAGData.recoveredDagID = recoveredDAGData.recoveredDAG.getID(); dagAppMaster.setCurrentDAG(recoveredDAGData.recoveredDAG); if (recoveredDAGData.nonRecoverable) { skipAllOtherEvents = true; } break; {code} BTW there's no apache header for HistoryACLPolicyException.java Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513650#comment-14513650 ] Gopal V commented on TEZ-2358: -- The patch was again tested at 10Tb scale over the weekend - there seems to be no collisions in naming. I look at the logs and noticed that some earlier tasks did succeed with the duplicate naming, due to the fact that there were only a few spills, resulting in them being split between the disks not colliding in paths. But for the sake of preventing future breakage, it would help to have an error being triggered when someone violates the no-duplicate rule for onDiskMapOutputs (i.e no two file chunks for merging can start at the same offset of the same file). My original pre-conditions were wrong when auto-reducer parallelism kicks in as we want to merge off a DISK_DIRECT input across two reducers (when auto-reduce parallelism kicks in), which would be different index points into the same DISK_DIRECT file. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0]
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513742#comment-14513742 ] Gopal V commented on TEZ-2358: -- [~rajesh.balamohan]: minor nit on the checkargument (not or not) pattern - it gets complex to add another condition later. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent:
[jira] [Updated] (TEZ-1752) Inputs / Outputs in the Runtime library should be interruptable
[ https://issues.apache.org/jira/browse/TEZ-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1752: -- Attachment: TEZ-1752.3.patch Inputs / Outputs in the Runtime library should be interruptable --- Key: TEZ-1752 URL: https://issues.apache.org/jira/browse/TEZ-1752 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Attachments: TEZ-1752.1.patch, TEZ-1752.2.patch, TEZ-1752.3.patch Not possible to preempt tasks without killing containers without this. There's still the problem of Processors not supporting interrupts. We may need API enhancements to either query IPOs on whether they are interrupbtible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2360 PreCommit Build #544
Jira: https://issues.apache.org/jira/browse/TEZ-2360 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/544/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2768 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728386/TEZ-2360.1.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/544//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/544//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/544//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. fb6f12d03e3f1a432f3d77442a57fcf1482f7f7d logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #543 Archived 44 artifacts Archive block size is 32768 Received 4 blocks and 2626408 bytes Compression is 4.8% Took 0.78 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2325) Route status update event directly to the attempt
[ https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514125#comment-14514125 ] TezQA commented on TEZ-2325: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728394/TEZ-2325.4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/545//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/545//console This message is automatically generated. Route status update event directly to the attempt -- Key: TEZ-2325 URL: https://issues.apache.org/jira/browse/TEZ-2325 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Attachments: TEZ-2325.1.patch, TEZ-2325.2.patch, TEZ-2325.3.patch, TEZ-2325.4.patch Today, all events from the attempt heartbeat are routed to the vertex. then the vertex routes (if any) status update events to the attempt. This is unnecessary and potentially creates out of order scenarios. We could route the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-1752 PreCommit Build #543
Jira: https://issues.apache.org/jira/browse/TEZ-1752 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/543/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2785 lines...] [INFO] Final Memory: 72M/888M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728382/TEZ-1752.3.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/543//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/543//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. f834cc9c90f68fafaf5c4cf27d0ae42da5c03d06 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #542 Archived 44 artifacts Archive block size is 32768 Received 0 blocks and 2750825 bytes Compression is 0.0% Took 0.6 sec Description set: TEZ-1752 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2360) per-io counters flag should generate both overall and per-edge counters
[ https://issues.apache.org/jira/browse/TEZ-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514104#comment-14514104 ] TezQA commented on TEZ-2360: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728386/TEZ-2360.1.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/544//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/544//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/544//console This message is automatically generated. per-io counters flag should generate both overall and per-edge counters Key: TEZ-2360 URL: https://issues.apache.org/jira/browse/TEZ-2360 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Attachments: TEZ-2360.1.patch Currently, the per-io flag disables overall per task counters and retains only per edge counters. It would be useful to have both overall and per edge counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2371) Upgrade hive branch to latest Tez
Gopal V created TEZ-2371: Summary: Upgrade hive branch to latest Tez Key: TEZ-2371 URL: https://issues.apache.org/jira/browse/TEZ-2371 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Assignee: Gopal V Upgrade hive to the upcoming tez-0.7 release -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2325) Route status update event directly to the attempt
[ https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2325: -- Attachment: TEZ-2325.4.patch Route status update event directly to the attempt -- Key: TEZ-2325 URL: https://issues.apache.org/jira/browse/TEZ-2325 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Attachments: TEZ-2325.1.patch, TEZ-2325.2.patch, TEZ-2325.3.patch, TEZ-2325.4.patch Today, all events from the attempt heartbeat are routed to the vertex. then the vertex routes (if any) status update events to the attempt. This is unnecessary and potentially creates out of order scenarios. We could route the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2303 PreCommit Build #542
Jira: https://issues.apache.org/jira/browse/TEZ-2303 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/542/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2776 lines...] [INFO] Final Memory: 69M/924M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728374/TEZ-2303-4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/542//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/542//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 6f0cdddef6804ccd72a9b7336bfc0ab1be9c0ab0 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #541 Archived 44 artifacts Archive block size is 32768 Received 2 blocks and 2754872 bytes Compression is 2.3% Took 1.5 sec Description set: TEZ-2303 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2303) ConcurrentModificationException while processing recovery
[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514082#comment-14514082 ] TezQA commented on TEZ-2303: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728374/TEZ-2303-4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/542//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/542//console This message is automatically generated. ConcurrentModificationException while processing recovery - Key: TEZ-2303 URL: https://issues.apache.org/jira/browse/TEZ-2303 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Assignee: Jeff Zhang Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch Saw a Tez AM log a few ConcurrentModificationException messages while trying to recover from a previous attempt that crashed. Exception details to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1752) Inputs / Outputs in the Runtime library should be interruptable
[ https://issues.apache.org/jira/browse/TEZ-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514089#comment-14514089 ] TezQA commented on TEZ-1752: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728382/TEZ-1752.3.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/543//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/543//console This message is automatically generated. Inputs / Outputs in the Runtime library should be interruptable --- Key: TEZ-1752 URL: https://issues.apache.org/jira/browse/TEZ-1752 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Attachments: TEZ-1752.1.patch, TEZ-1752.2.patch, TEZ-1752.3.patch Not possible to preempt tasks without killing containers without this. There's still the problem of Processors not supporting interrupts. We may need API enhancements to either query IPOs on whether they are interrupbtible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2360) per-io counters flag should generate both overall and per-edge counters
[ https://issues.apache.org/jira/browse/TEZ-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2360: -- Attachment: TEZ-2360.1.patch per-io counters flag should generate both overall and per-edge counters Key: TEZ-2360 URL: https://issues.apache.org/jira/browse/TEZ-2360 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Attachments: TEZ-2360.1.patch Currently, the per-io flag disables overall per task counters and retains only per edge counters. It would be useful to have both overall and per edge counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2360) per-io counters flag should generate both overall and per-edge counters
[ https://issues.apache.org/jira/browse/TEZ-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2360: -- Attachment: TEZ-2360.2.patch Fixed findbug warnings. per-io counters flag should generate both overall and per-edge counters Key: TEZ-2360 URL: https://issues.apache.org/jira/browse/TEZ-2360 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Attachments: TEZ-2360.1.patch, TEZ-2360.2.patch Currently, the per-io flag disables overall per task counters and retains only per edge counters. It would be useful to have both overall and per edge counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2365) Update tez-ui war's license/notice to reflect OFL license correctly
[ https://issues.apache.org/jira/browse/TEZ-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514226#comment-14514226 ] Prakash Ramachandran commented on TEZ-2365: --- +1 LGTM. Update tez-ui war's license/notice to reflect OFL license correctly Key: TEZ-2365 URL: https://issues.apache.org/jira/browse/TEZ-2365 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-2365.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2366) Pig tez MiniTezCluster unit tests fail intermittently after TEZ-2333
[ https://issues.apache.org/jira/browse/TEZ-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2366: -- Attachment: TEZ-2366.wip.1.patch [~sseth] attaching a patch which checks the port along with the host. one quick question though. the mapreduce.shuffle.port is not exposed by yarn. is it fine to rely on that conf and its default value? if the patch looks ok. i can add the tests. Pig tez MiniTezCluster unit tests fail intermittently after TEZ-2333 Key: TEZ-2366 URL: https://issues.apache.org/jira/browse/TEZ-2366 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Priority: Critical Attachments: TEZ-2366.test.txt, TEZ-2366.wip.1.patch There are around 20 unit tests (out of around 2000) fail intermittently after TEZ-2333. Here is a stack: {code} org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find output/attempt_1429899954360_0001_1_01_00_1_10003/file.out.index in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:449) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164) at org.apache.tez.runtime.library.common.shuffle.Fetcher.getShuffleInputFileName(Fetcher.java:611) at org.apache.tez.runtime.library.common.shuffle.Fetcher.getTezIndexRecord(Fetcher.java:591) at org.apache.tez.runtime.library.common.shuffle.Fetcher.doLocalDiskFetch(Fetcher.java:536) at org.apache.tez.runtime.library.common.shuffle.Fetcher.setupLocalDiskFetch(Fetcher.java:517) at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:190) at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:72) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} To reproduce that in Pig test, using the following commands: svn co http://svn.apache.org/repos/asf/pig/trunk ant -Dhadoopversion=23 -Dtest.exec.type=tez -Dtestcase=TestTezAutoParallelism test Note in Pig codebase, we already set TEZ_RUNTIME_OPTIMIZE_LOCAL_FETCH to true (http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java?view=markup). I tried changing TEZ_RUNTIME_OPTIMIZE_LOCAL_FETCH to false in Pig and does not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2314) Tez task attempt failures due to bad event serialization
[ https://issues.apache.org/jira/browse/TEZ-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516076#comment-14516076 ] Siddharth Seth commented on TEZ-2314: - bq. Updates to volatile longs are atomic according to the Java language specification. That's good to know. bq. This is unrelated to the actual contents of the stats etc. This is more around having the right number of objects in the hearbeat request. There should be N stats objects for N IOs. So that code upstream (serde or non-serde) can simply work on the correct number of objects. About consistency of the objects internal state while updates are in progress, those will have to be looked at as needed. Already said this was OK to go in for now. It does however have issues when stats are added dynamically - which we will hit at a later point when this is supported. There's no relation to the code upstream requiring N objects, since we handle the absence of stats correctly. One input initialized - reports some stats - which may or may not show up in the AM. Another one blocked on initialization, we don't report stats. Tez task attempt failures due to bad event serialization Key: TEZ-2314 URL: https://issues.apache.org/jira/browse/TEZ-2314 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Rohini Palaniswamy Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2314.1.patch, TEZ-2314.log.patch {code} 2015-04-13 19:21:48,516 WARN [Socket Reader #3 for port 53530] ipc.Server: Unable to read call parameters for client 10.216.13.112on connection protocol org.apache.tez.common.TezTaskUmbilicalProtocol for rpcKind RPC_WRITABLE java.lang.ArrayIndexOutOfBoundsException: 1935896432 at org.apache.tez.runtime.api.impl.EventMetaData.readFields(EventMetaData.java:120) at org.apache.tez.runtime.api.impl.TezEvent.readFields(TezEvent.java:271) at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.readFields(TezHeartbeatRequest.java:110) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285) at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.readFields(WritableRpcEngine.java:160) at org.apache.hadoop.ipc.Server$Connection.processRpcRequest(Server.java:1884) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1816) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1574) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:806) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:673) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:644) {code} cc/ [~hitesh] and [~bikassaha] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2373) Whitespace cleanup in tez codebase
Hitesh Shah created TEZ-2373: Summary: Whitespace cleanup in tez codebase Key: TEZ-2373 URL: https://issues.apache.org/jira/browse/TEZ-2373 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Trivial Found only 480 out of 790 java files need a cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2314) Tez task attempt failures due to bad event serialization
[ https://issues.apache.org/jira/browse/TEZ-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515837#comment-14515837 ] Siddharth Seth commented on TEZ-2314: - The plugin where the stats are eventually used - the VMPlugin I believe. Looks like that path is handled via null checks while accumulating statistics. One thing I did notice though, is that TaskAttempt.getStatistics is outside any lock - can be fixed here or a spearate jira since it's not related directly to the issue. On the patch itself - volatile long instead of synchronizing the updates to the values can be problematic - since operations on longs are not atomic. The approach of sending the data only after initialization is fine for now. We'll have to keep this in mind when adding user specified statistics, or stats which are not setup during initialization. Synchronization is a simpler approach though, and won't run into these potential pitfalls later. Tez task attempt failures due to bad event serialization Key: TEZ-2314 URL: https://issues.apache.org/jira/browse/TEZ-2314 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Rohini Palaniswamy Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2314.1.patch, TEZ-2314.log.patch {code} 2015-04-13 19:21:48,516 WARN [Socket Reader #3 for port 53530] ipc.Server: Unable to read call parameters for client 10.216.13.112on connection protocol org.apache.tez.common.TezTaskUmbilicalProtocol for rpcKind RPC_WRITABLE java.lang.ArrayIndexOutOfBoundsException: 1935896432 at org.apache.tez.runtime.api.impl.EventMetaData.readFields(EventMetaData.java:120) at org.apache.tez.runtime.api.impl.TezEvent.readFields(TezEvent.java:271) at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.readFields(TezHeartbeatRequest.java:110) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285) at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.readFields(WritableRpcEngine.java:160) at org.apache.hadoop.ipc.Server$Connection.processRpcRequest(Server.java:1884) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1816) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1574) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:806) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:673) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:644) {code} cc/ [~hitesh] and [~bikassaha] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2226: - Attachment: TEZ-2226.12.patch Renamed combined patch to patch 12. Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.12.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.addon-for-patch10, TEZ-2226.addon-for-patch10-combined.full.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2226 PreCommit Build #555
Jira: https://issues.apache.org/jira/browse/TEZ-2226 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/555/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2780 lines...] [INFO] Final Memory: 72M/929M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728523/TEZ-2226.addon-for-patch10-combined.full.patch against master revision aa87a14. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/555//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/555//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 1c99c398084a0d570542251b364b522bae05bb99 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #554 Archived 44 artifacts Archive block size is 32768 Received 8 blocks and 2491263 bytes Compression is 9.5% Took 1.1 sec Description set: TEZ-2226 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515843#comment-14515843 ] TezQA commented on TEZ-2226: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728523/TEZ-2226.addon-for-patch10-combined.full.patch against master revision aa87a14. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/555//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/555//console This message is automatically generated. Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.12.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.addon-for-patch10, TEZ-2226.addon-for-patch10-combined.full.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2374) Fix build break against hadoop-2.2 due to TEZ-2325
[ https://issues.apache.org/jira/browse/TEZ-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2374: - Attachment: TEZ-2374.1.patch Fix build break against hadoop-2.2 due to TEZ-2325 -- Key: TEZ-2374 URL: https://issues.apache.org/jira/browse/TEZ-2374 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-2374.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516003#comment-14516003 ] TezQA commented on TEZ-2226: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728605/TEZ-2226.12.patch against master revision 9e9cf99. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/556//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/556//console This message is automatically generated. Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.12.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.addon-for-patch10, TEZ-2226.addon-for-patch10-combined.full.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2359) Deadlock in DAGAppMaster
[ https://issues.apache.org/jira/browse/TEZ-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2359: Priority: Blocker (was: Critical) Deadlock in DAGAppMaster Key: TEZ-2359 URL: https://issues.apache.org/jira/browse/TEZ-2359 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Priority: Blocker {code} Found one Java-level deadlock: = Timer-1: waiting for ownable synchronizer 0x0007cd0f8a30, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by Dispatcher thread: Central Dispatcher thread: Central: waiting to lock monitor 0x7fb829866d18 (object 0x0007cd5ab958, a org.apache.tez.dag.app.rm.YarnTaskSchedulerService), which is held by DelayedContainerManager DelayedContainerManager: waiting for ownable synchronizer 0x0007cd0f8a30, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by Dispatcher thread: Central Java stack information for the threads listed above: === Timer-1: at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007cd0f8a30 (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945) at org.apache.tez.dag.app.DAGAppMaster.checkAndHandleSessionTimeout(DAGAppMaster.java:2015) - locked 0x0007cd0f2ff0 (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.tez.dag.app.DAGAppMaster$3.run(DAGAppMaster.java:1825) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) Dispatcher thread: Central: at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.dagComplete(YarnTaskSchedulerService.java:842) - waiting to lock 0x0007cd5ab958 (a org.apache.tez.dag.app.rm.YarnTaskSchedulerService) at org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.dagCompleted(TaskSchedulerEventHandler.java:566) at org.apache.tez.dag.app.DAGAppMaster.checkForCompletion(DAGAppMaster.java:832) at org.apache.tez.dag.app.DAGAppMaster.access$4800(DAGAppMaster.java:201) at org.apache.tez.dag.app.DAGAppMaster$DAGFinishedTransition.transition(DAGAppMaster.java:2362) at org.apache.tez.dag.app.DAGAppMaster$DAGFinishedTransition.transition(DAGAppMaster.java:2356) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - locked 0x0007cd1d0208 (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.tez.dag.app.DAGAppMaster.handle(DAGAppMaster.java:510) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler.handle(DAGAppMaster.java:879) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler.handle(DAGAppMaster.java:868) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:182) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:113) at java.lang.Thread.run(Thread.java:745) DelayedContainerManager: at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007cd0f8a30 (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731) at org.apache.tez.dag.app.DAGAppMaster.getState(DAGAppMaster.java:531) at
[jira] [Updated] (TEZ-2359) Deadlock in DAGAppMaster
[ https://issues.apache.org/jira/browse/TEZ-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2359: Target Version/s: 0.7.0 Deadlock in DAGAppMaster Key: TEZ-2359 URL: https://issues.apache.org/jira/browse/TEZ-2359 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Priority: Blocker {code} Found one Java-level deadlock: = Timer-1: waiting for ownable synchronizer 0x0007cd0f8a30, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by Dispatcher thread: Central Dispatcher thread: Central: waiting to lock monitor 0x7fb829866d18 (object 0x0007cd5ab958, a org.apache.tez.dag.app.rm.YarnTaskSchedulerService), which is held by DelayedContainerManager DelayedContainerManager: waiting for ownable synchronizer 0x0007cd0f8a30, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by Dispatcher thread: Central Java stack information for the threads listed above: === Timer-1: at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007cd0f8a30 (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945) at org.apache.tez.dag.app.DAGAppMaster.checkAndHandleSessionTimeout(DAGAppMaster.java:2015) - locked 0x0007cd0f2ff0 (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.tez.dag.app.DAGAppMaster$3.run(DAGAppMaster.java:1825) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) Dispatcher thread: Central: at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.dagComplete(YarnTaskSchedulerService.java:842) - waiting to lock 0x0007cd5ab958 (a org.apache.tez.dag.app.rm.YarnTaskSchedulerService) at org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.dagCompleted(TaskSchedulerEventHandler.java:566) at org.apache.tez.dag.app.DAGAppMaster.checkForCompletion(DAGAppMaster.java:832) at org.apache.tez.dag.app.DAGAppMaster.access$4800(DAGAppMaster.java:201) at org.apache.tez.dag.app.DAGAppMaster$DAGFinishedTransition.transition(DAGAppMaster.java:2362) at org.apache.tez.dag.app.DAGAppMaster$DAGFinishedTransition.transition(DAGAppMaster.java:2356) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - locked 0x0007cd1d0208 (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.tez.dag.app.DAGAppMaster.handle(DAGAppMaster.java:510) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler.handle(DAGAppMaster.java:879) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler.handle(DAGAppMaster.java:868) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:182) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:113) at java.lang.Thread.run(Thread.java:745) DelayedContainerManager: at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007cd0f8a30 (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731) at org.apache.tez.dag.app.DAGAppMaster.getState(DAGAppMaster.java:531) at
[jira] [Commented] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515120#comment-14515120 ] Chang Li commented on TEZ-2226: --- Thanks a lot for help [~zjffdu], [~hitesh]! I updated my latest patch to handle the am crash and recover scenario, have tested in my single node cluster. Could you please help review, thanks! Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515150#comment-14515150 ] Hitesh Shah commented on TEZ-2368: -- Comments: typo in Get a numeric identifier for the dto which the task belongs +1 once the typo is fixed. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt, TEZ-2368.2.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2325) Route status update event directly to the attempt
[ https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515160#comment-14515160 ] Hitesh Shah commented on TEZ-2325: -- Committing shortly. Route status update event directly to the attempt -- Key: TEZ-2325 URL: https://issues.apache.org/jira/browse/TEZ-2325 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Attachments: TEZ-2325.1.patch, TEZ-2325.2.patch, TEZ-2325.3.patch, TEZ-2325.4.patch Today, all events from the attempt heartbeat are routed to the vertex. then the vertex routes (if any) status update events to the attempt. This is unnecessary and potentially creates out of order scenarios. We could route the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2368: Attachment: TEZ-2368.3.txt Fixed the typo. Thanks for the review. Committing. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt, TEZ-2368.2.txt, TEZ-2368.3.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2368) Make a dag identifier available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2368: Summary: Make a dag identifier available in Context classes (was: Make the dag number available in Context classes) Make a dag identifier available in Context classes -- Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt, TEZ-2368.2.txt, TEZ-2368.3.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515125#comment-14515125 ] Hitesh Shah edited comment on TEZ-2226 at 4/27/15 10:32 PM: Thanks for the patch 11 [~lichangleo]. I started making some minor mods over patch 10 in addition to recovery support. Mostly cleanup ( some renames ) but also handling one case where history events are generated that are not related to a dag ( app launched etc ). Will upload an add-on patch for .10 shortly in addition to a combined patch. was (Author: hitesh): Thanks for the patch 11 [~lichangleo]. I started making some minor mods over patch 10 in addition to recovery support. Mostly cleanup but also handling one case where history events are generated that are not related to a dag ( app launched etc ). Will upload an add-on patch for .10 shortly in addition to a combined patch. Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.addon-for-patch10, TEZ-2226.addon-for-patch10-combined.full.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2325) Route status update event directly to the attempt
[ https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2325: - Target Version/s: 0.7.0 Route status update event directly to the attempt -- Key: TEZ-2325 URL: https://issues.apache.org/jira/browse/TEZ-2325 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Prakash Ramachandran Attachments: TEZ-2325.1.patch, TEZ-2325.2.patch, TEZ-2325.3.patch, TEZ-2325.4.patch Today, all events from the attempt heartbeat are routed to the vertex. then the vertex routes (if any) status update events to the attempt. This is unnecessary and potentially creates out of order scenarios. We could route the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2363) Counters: off by 1 error for REDUCE_INPUT_GROUPS counter
[ https://issues.apache.org/jira/browse/TEZ-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515996#comment-14515996 ] Rajesh Balamohan commented on TEZ-2363: --- lgtm. +1. I believe the javac warning can be addressed by adding @SuppressWarnings(unchecked) near TestValuesIterator.createCountedIterator? Counters: off by 1 error for REDUCE_INPUT_GROUPS counter Key: TEZ-2363 URL: https://issues.apache.org/jira/browse/TEZ-2363 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Gopal V Priority: Minor Attachments: TEZ-2363.1.patch The reduce input key groups are not incremented for the first key in operation, only for the second key does it increment in moveToNext() - nextKey() - inputKeyCounter.increment(1); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-391: --- Target Version/s: 0.8.0 SharedEdge - Support for passing same output from a vertex as input to two different vertices - Key: TEZ-391 URL: https://issues.apache.org/jira/browse/TEZ-391 Project: Apache Tez Issue Type: Sub-task Reporter: Rohini Palaniswamy Assignee: Jeff Zhang Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch We need this for lot of usecases. For cases where multi-query is turned off and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2226: - Attachment: TEZ-2226.addon-for-patch10-combined.full.patch TEZ-2226.addon-for-patch10 [~lichangleo] Take a look. [~zjffdu] [~pramachandran] Mind reviewing. Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.addon-for-patch10, TEZ-2226.addon-for-patch10-combined.full.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2226 PreCommit Build #556
Jira: https://issues.apache.org/jira/browse/TEZ-2226 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/556/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2782 lines...] [INFO] Final Memory: 68M/852M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728605/TEZ-2226.12.patch against master revision 9e9cf99. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/556//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/556//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 584b970040c28bcc7375f80bb496af75e711f4af logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #555 Archived 44 artifacts Archive block size is 32768 Received 23 blocks and 1995647 bytes Compression is 27.4% Took 1.5 sec Description set: TEZ-2226 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2374) Fix build break against hadoop-2.2 due to TEZ-2325
[ https://issues.apache.org/jira/browse/TEZ-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516094#comment-14516094 ] TezQA commented on TEZ-2374: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728610/TEZ-2374.1.patch against master revision 9e9cf99. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 160 javac compiler warnings (more than the master's current 159 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/557//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/557//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/557//console This message is automatically generated. Fix build break against hadoop-2.2 due to TEZ-2325 -- Key: TEZ-2374 URL: https://issues.apache.org/jira/browse/TEZ-2374 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-2374.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2374 PreCommit Build #557
Jira: https://issues.apache.org/jira/browse/TEZ-2374 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/557/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2770 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728610/TEZ-2374.1.patch against master revision 9e9cf99. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 160 javac compiler warnings (more than the master's current 159 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/557//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/557//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/557//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 0846e03c24003fad190fd562be96a766a151fb8b logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #556 Archived 45 artifacts Archive block size is 32768 Received 26 blocks and 1902290 bytes Compression is 30.9% Took 0.84 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-2226) Disable writing history to timeline if domain creation fails.
[ https://issues.apache.org/jira/browse/TEZ-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated TEZ-2226: -- Attachment: TEZ-2226.11.patch Disable writing history to timeline if domain creation fails. - Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Chang Li Priority: Blocker Attachments: TEZ-2226.10.patch, TEZ-2226.11.patch, TEZ-2226.2.patch, TEZ-2226.3.patch, TEZ-2226.4.patch, TEZ-2226.5.patch, TEZ-2226.6.patch, TEZ-2226.7.patch, TEZ-2226.8.patch, TEZ-2226.9.patch, TEZ-2226.patch, TEZ-2226.wip.2.patch, TEZ-2226.wip.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2303) ConcurrentModificationException while processing recovery
[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516216#comment-14516216 ] Hitesh Shah commented on TEZ-2303: -- In that case, +1 for patch 1. Please open a new jira for the long term fix. ConcurrentModificationException while processing recovery - Key: TEZ-2303 URL: https://issues.apache.org/jira/browse/TEZ-2303 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Assignee: Jeff Zhang Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch Saw a Tez AM log a few ConcurrentModificationException messages while trying to recover from a previous attempt that crashed. Exception details to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2375) Don't return dag status to client when dag is still in recovering
[ https://issues.apache.org/jira/browse/TEZ-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516222#comment-14516222 ] Hitesh Shah commented on TEZ-2375: -- A different approach is to send back a recovering status back to the client and the client should be changed to cache the last seen valid progress. Using this, the user will never see a regression in progress unless recovery fails. Don't return dag status to client when dag is still in recovering -- Key: TEZ-2375 URL: https://issues.apache.org/jira/browse/TEZ-2375 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Should only return dag status to client after the whole recovery process is done (DAG/Vertex/Task/TaskAttempt are all recovered to its correct state) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2375) Don't return dag status to client when dag is still in recovering
[ https://issues.apache.org/jira/browse/TEZ-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516222#comment-14516222 ] Hitesh Shah edited comment on TEZ-2375 at 4/28/15 2:41 AM: --- A different approach is to send back a recovering status back to the client and the client should be changed to cache the last seen valid progress. Using this, the user will never see a regression in progress unless recovery fails or all tasks are not recovered from previous attempt. was (Author: hitesh): A different approach is to send back a recovering status back to the client and the client should be changed to cache the last seen valid progress. Using this, the user will never see a regression in progress unless recovery fails. Don't return dag status to client when dag is still in recovering -- Key: TEZ-2375 URL: https://issues.apache.org/jira/browse/TEZ-2375 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Should only return dag status to client after the whole recovery process is done (DAG/Vertex/Task/TaskAttempt are all recovered to its correct state) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1577) Recover attempt information when recovering from task desired state
[ https://issues.apache.org/jira/browse/TEZ-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516262#comment-14516262 ] Hitesh Shah commented on TEZ-1577: -- \cc [~zjffdu] Recover attempt information when recovering from task desired state --- Key: TEZ-1577 URL: https://issues.apache.org/jira/browse/TEZ-1577 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Priority: Critical TaskImpl has a TODO item for this - // TODO recover attempts if desired state is given?. InputInitializerEvent recovery will fail without this change, since the successful attempt number is important. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1522) Scheduling can result in out of order execution and slowdown of upstream work
[ https://issues.apache.org/jira/browse/TEZ-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1522: - Target Version/s: (was: 0.6.0) Scheduling can result in out of order execution and slowdown of upstream work - Key: TEZ-1522 URL: https://issues.apache.org/jira/browse/TEZ-1522 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: performance Attachments: TEZ-1522.1.wip.txt, TEZ-1522.2.wip.txt, TEZ-1522.am.log.gz, task_runtime.svg M2 M7 \ / (sg) \/ R3/ (b) \ / (b) \ / \ / M5 | R6 Plz refer to the attachment (task runtime SVG). In this case, M5 got scheduled much earlier than R3 (green color in the diagram) and retained lots of containers. R3 got less containers to work with. Attaching the output from the status monitor when the job ran; Map_5 has taken up almost all of cluster resource, whereas Reducer_3 got fraction of the capacity. Map_2: 1/1 Map_5: 0(+373)/1000 Map_7: 1/1 Reducer_3: 0/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0(+1)/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 14(+7)/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 63(+14)/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 159(+22)/8000Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 308(+29)/8000Reducer_6: 0/1 ... Creating this JIRA as a placeholder for scheduler enhancement. One possibililty could be to schedule lesser number of tasks in downstream vertices, based on the information available for the upstream vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1732) Temporary mitigation for out of order scheduling
[ https://issues.apache.org/jira/browse/TEZ-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516261#comment-14516261 ] Hitesh Shah edited comment on TEZ-1732 at 4/28/15 3:16 AM: --- [~bikassaha] [~sseth] Mind setting a target version as well as affects version was (Author: hitesh): [~bikassaha] [~sseth] Mind setting a target version Temporary mitigation for out of order scheduling Key: TEZ-1732 URL: https://issues.apache.org/jira/browse/TEZ-1732 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2375) Don't return dag status to client when dag is still in recovering
[ https://issues.apache.org/jira/browse/TEZ-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516275#comment-14516275 ] Jeff Zhang commented on TEZ-2375: - Agree. Don't return dag status to client when dag is still in recovering -- Key: TEZ-2375 URL: https://issues.apache.org/jira/browse/TEZ-2375 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Should only return dag status to client after the whole recovery process is done (DAG/Vertex/Task/TaskAttempt are all recovered to its correct state) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2314) Tez task attempt failures due to bad event serialization
[ https://issues.apache.org/jira/browse/TEZ-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516151#comment-14516151 ] Bikas Saha commented on TEZ-2314: - Thanks! Will wait for [~rohini] to confirm that this patch fixes the issue she reported. If not then I will open a separate jira for this and commit it. Tez task attempt failures due to bad event serialization Key: TEZ-2314 URL: https://issues.apache.org/jira/browse/TEZ-2314 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Rohini Palaniswamy Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2314.1.patch, TEZ-2314.log.patch {code} 2015-04-13 19:21:48,516 WARN [Socket Reader #3 for port 53530] ipc.Server: Unable to read call parameters for client 10.216.13.112on connection protocol org.apache.tez.common.TezTaskUmbilicalProtocol for rpcKind RPC_WRITABLE java.lang.ArrayIndexOutOfBoundsException: 1935896432 at org.apache.tez.runtime.api.impl.EventMetaData.readFields(EventMetaData.java:120) at org.apache.tez.runtime.api.impl.TezEvent.readFields(TezEvent.java:271) at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.readFields(TezHeartbeatRequest.java:110) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285) at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.readFields(WritableRpcEngine.java:160) at org.apache.hadoop.ipc.Server$Connection.processRpcRequest(Server.java:1884) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1816) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1574) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:806) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:673) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:644) {code} cc/ [~hitesh] and [~bikassaha] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1560) Invalid state machine transition in recovery
[ https://issues.apache.org/jira/browse/TEZ-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1560: - Target Version/s: 0.7.0 Invalid state machine transition in recovery Key: TEZ-1560 URL: https://issues.apache.org/jira/browse/TEZ-1560 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Critical Attachments: failed_tez_job.txt.gz {code} 2014-09-04 16:08:25,504 INFO [main] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1409818083015_0001_1 transitioned from NEW to RUNNING 2014-09-04 16:08:25,504 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, vertexId=vertex_1409818083015_0001_1_00 [v1], state=NEW, numInitedSourceVertices=0, numStartedSourceVertices=0, numRecoveredSourceVertices=0, recoveredEvents=0, tasksIsNull=false, numTasks=0 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Root Inputs exist for Vertex: v1 : {Input={InputName=Input}, {Descriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$NoOpInput, hasPayload=false}, {ControllerDescriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer, hasPayload=false}} 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializer for input: Input, with class: [org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer] 2014-09-04 16:08:25,506 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Setting user vertex manager plugin: org.apache.tez.test.dag.MultiAttemptDAG$FailOnAttemptVertexManagerPlugin on vertex: v1 2014-09-04 16:08:25,508 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Creating 2 for vertex: vertex_1409818083015_0001_1_00 [v1] 2014-09-04 16:08:25,518 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializers: 1 2014-09-04 16:08:25,520 INFO [InputInitializer [v1] #0] org.apache.tez.dag.app.dag.RootInputInitializerManager: Starting InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 [v1] 2014-09-04 16:08:25,522 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.RootInputInitializerManager: Succeeded InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 [v1] 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: vertex_1409818083015_0001_1_00 [v1] transitioned from NEW to INITIALIZING due to event V_INIT 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, vertexId=vertex_1409818083015_0001_1_01 [v2], state=NEW, numInitedSourceVertices0, numStartedSourceVertices=0, numRecoveredSourceVertices=1, tasksIsNull=false, numTasks=0 2014-09-04 16:08:25,523 ERROR [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event V_SOURCE_VERTEX_RECOVERED on vertex v2 with vertexId vertex_1409818083015_0001_1_01 at current state NEW org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: V_SOURCE_VERTEX_RECOVERED at NEW at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1344) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1641) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-09-04 16:08:25,524 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-711) Fix memory leak when not reading from inputs due to caching
[ https://issues.apache.org/jira/browse/TEZ-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516264#comment-14516264 ] Hitesh Shah commented on TEZ-711: - [~rajesh.balamohan] [~sseth] is this still valid? Fix memory leak when not reading from inputs due to caching --- Key: TEZ-711 URL: https://issues.apache.org/jira/browse/TEZ-711 Project: Apache Tez Issue Type: Bug Affects Versions: 0.2.0 Reporter: Rohini Palaniswamy Assignee: Siddharth Seth Priority: Critical Attachments: OOM-threaddump-711-5-patch.txt, OOM-threaddump-till-TEZ-752.txt, TEZ-711.5.txt, TEZ-711.wip.1.txt, TEZ-711.wip.2.txt, TEZ-711.wip.3.txt, TEZ-711.wip.4.txt When you are reading from inputs and caching objects with vertex scope, you don't have to read the input again when container is reused. But it allocates memory and that leaks causing OOM. KeyValueReader does not have a API to close the reader to clear allotted memory without reading from it. Also if there was a option to pre-close inputs in Processor and not fetch input at all over the wire and do shuffle/sort it would be a good optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2372) TestAMRecovery failing in latest build
[ https://issues.apache.org/jira/browse/TEZ-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516307#comment-14516307 ] Jeff Zhang commented on TEZ-2372: - Very weird, no test info for this. https://builds.apache.org/job/Tez-Build/1018/testReport/org.apache.tez.test/ TestAMRecovery failing in latest build --- Key: TEZ-2372 URL: https://issues.apache.org/jira/browse/TEZ-2372 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah https://builds.apache.org/job/Tez-Build/1018/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2303) ConcurrentModificationException while processing recovery
[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516200#comment-14516200 ] Jeff Zhang commented on TEZ-2303: - [~hitesh] Yes I think it make sense for the short term fix as least it fix the ConcurrentModificationException. Regarding the issue of not providing info to clients until the recovery phase is over, I think there are 2 main scenario: * ClientHandler RPC is started but recovery log is not read. In this case, it will throw No dag running exception in AM, no effect on the client side. so I think it is OK. {code} 2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC Server handler 0 on 6000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from 127.0.0.1:63539 Call#9557 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {code} * The second scenario is that even the recovery log is read, the RecoveryTransition may not have completed. Then the client side may still get wrong dag status. As I mentioned, this may need some big change on the recovery. We can leave it in future and take it into account when refactoring the recovery code. ConcurrentModificationException while processing recovery - Key: TEZ-2303 URL: https://issues.apache.org/jira/browse/TEZ-2303 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Assignee: Jeff Zhang Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch Saw a Tez AM log a few ConcurrentModificationException messages while trying to recover from a previous attempt that crashed. Exception details to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2305) MR compatibility sleep job fails with IOException: Undefined job output-path
[ https://issues.apache.org/jira/browse/TEZ-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516258#comment-14516258 ] Hitesh Shah commented on TEZ-2305: -- Sorry for the delay in getting back [~zjffdu]. If we are going with patch .2, would you mind adding your unit test to the patch? Would be good to have some coverage. MR compatibility sleep job fails with IOException: Undefined job output-path Key: TEZ-2305 URL: https://issues.apache.org/jira/browse/TEZ-2305 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Tassapol Athiapinya Priority: Critical Attachments: TEZ-2305-3.patch, TEZ-2305-4.patch, TEZ-2305.1.patch, TEZ-2305.2.patch Running MR sleep job has an IOException. {code} 15/04/09 20:52:25 INFO mapreduce.Job: Job job_1428612196442_0002 failed with state FAILED due to: Vertex failed, vertexName=initialmap, vertexId=vertex_1428612196442_0002_1_00, diagnostics=[Task failed, taskId=task_1428612196442_0002_1_00_01, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.io.IOException: Undefined job output-path at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:248) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:121) at org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:401) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:436) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:415) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 1 failed, info=[Error: Failure while running task:java.io.IOException: Undefined job output-path at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:248) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:121) at org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:401) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:436) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:415) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 2 failed, info=[Error: Failure while running task:java.io.IOException: Undefined job output-path at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:248) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:121) at org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:401) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:436) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:415) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 3 failed, info=[Error: Failure while running task:java.io.IOException: Undefined job output-path at
[jira] [Commented] (TEZ-1675) Remove deprecated keys added in TEZ-1674
[ https://issues.apache.org/jira/browse/TEZ-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516290#comment-14516290 ] Siddharth Seth commented on TEZ-1675: - Should we just remove these in 0.7. Has been in there since 0.5 Remove deprecated keys added in TEZ-1674 Key: TEZ-1675 URL: https://issues.apache.org/jira/browse/TEZ-1675 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-711) Fix memory leak when not reading from inputs due to caching
[ https://issues.apache.org/jira/browse/TEZ-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516291#comment-14516291 ] Siddharth Seth commented on TEZ-711: Yes it is. Fix memory leak when not reading from inputs due to caching --- Key: TEZ-711 URL: https://issues.apache.org/jira/browse/TEZ-711 Project: Apache Tez Issue Type: Bug Affects Versions: 0.2.0 Reporter: Rohini Palaniswamy Assignee: Siddharth Seth Priority: Critical Attachments: OOM-threaddump-711-5-patch.txt, OOM-threaddump-till-TEZ-752.txt, TEZ-711.5.txt, TEZ-711.wip.1.txt, TEZ-711.wip.2.txt, TEZ-711.wip.3.txt, TEZ-711.wip.4.txt When you are reading from inputs and caching objects with vertex scope, you don't have to read the input again when container is reused. But it allocates memory and that leaks causing OOM. KeyValueReader does not have a API to close the reader to clear allotted memory without reading from it. Also if there was a option to pre-close inputs in Processor and not fetch input at all over the wire and do shuffle/sort it would be a good optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1675) Remove deprecated keys added in TEZ-1674
[ https://issues.apache.org/jira/browse/TEZ-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516317#comment-14516317 ] Hitesh Shah edited comment on TEZ-1675 at 4/28/15 4:22 AM: --- Given that we have had just one 0.6.0 release since then, it might be worth keeping around for a release more. was (Author: hitesh): Given that we have had just one 0.6.0 release, it might be worth keeping around for a release more. Remove deprecated keys added in TEZ-1674 Key: TEZ-1675 URL: https://issues.apache.org/jira/browse/TEZ-1675 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1732) Temporary mitigation for out of order scheduling
[ https://issues.apache.org/jira/browse/TEZ-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516324#comment-14516324 ] Bikas Saha commented on TEZ-1732: - Temporary is not relevant anymore. Temporary mitigation for out of order scheduling Key: TEZ-1732 URL: https://issues.apache.org/jira/browse/TEZ-1732 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1732) Temporary mitigation for out of order scheduling
[ https://issues.apache.org/jira/browse/TEZ-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha resolved TEZ-1732. - Resolution: Won't Fix Temporary mitigation for out of order scheduling Key: TEZ-1732 URL: https://issues.apache.org/jira/browse/TEZ-1732 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2372) TestAMRecovery failing in latest build
[ https://issues.apache.org/jira/browse/TEZ-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516376#comment-14516376 ] Jeff Zhang edited comment on TEZ-2372 at 4/28/15 5:08 AM: -- [~hitesh], Yes this is the only info I can find, even no client side log. It seems TestAMRecovery is killed before it started https://builds.apache.org/job/Tez-Build/1018/testReport/org.apache.tez.test/ was (Author: zjffdu): [~hitesh], Yes this is the only info I can find. It seems TestAMRecovery is killed beofre it started https://builds.apache.org/job/Tez-Build/1018/testReport/org.apache.tez.test/ TestAMRecovery failing in latest build --- Key: TEZ-2372 URL: https://issues.apache.org/jira/browse/TEZ-2372 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah https://builds.apache.org/job/Tez-Build/1018/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-604) Revert temporary changes made in TEZ-603
[ https://issues.apache.org/jira/browse/TEZ-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-604: Target Version/s: 0.8.0 (was: 0.7.0) Revert temporary changes made in TEZ-603 Key: TEZ-604 URL: https://issues.apache.org/jira/browse/TEZ-604 Project: Apache Tez Issue Type: Task Reporter: Siddharth Seth Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1675) Remove deprecated keys added in TEZ-1674
[ https://issues.apache.org/jira/browse/TEZ-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1675: - Target Version/s: 0.8.0 (was: 0.7.0) Remove deprecated keys added in TEZ-1674 Key: TEZ-1675 URL: https://issues.apache.org/jira/browse/TEZ-1675 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-316) [Umbrella] Address findbugs warnings in tez codebase
[ https://issues.apache.org/jira/browse/TEZ-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-316: Target Version/s: 0.8.0 (was: 0.7.0) [Umbrella] Address findbugs warnings in tez codebase Key: TEZ-316 URL: https://issues.apache.org/jira/browse/TEZ-316 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Blocker findbugs output attached to TEZ-272. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2164) Shade the guava version used by Tez
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2164: - Target Version/s: 0.8.0 (was: 0.7.0) Shade the guava version used by Tez --- Key: TEZ-2164 URL: https://issues.apache.org/jira/browse/TEZ-2164 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Critical Attachments: allow-guava-16.0.1.patch Should allow us to upgrade to a newer version without shipping a guava dependency. Would be good to do this in 0.7 so that we stop shipping guava as early as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2303) ConcurrentModificationException while processing recovery
[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516259#comment-14516259 ] Jeff Zhang commented on TEZ-2303: - Thanks [~hitesh] Committed to 0.5, 0.6 and master. Create TEZ-2375 for long term fix. ConcurrentModificationException while processing recovery - Key: TEZ-2303 URL: https://issues.apache.org/jira/browse/TEZ-2303 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Assignee: Jeff Zhang Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch Saw a Tez AM log a few ConcurrentModificationException messages while trying to recover from a previous attempt that crashed. Exception details to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2164) Shade the guava version used by Tez
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2164: - Target Version/s: 0.7.0 (was: 0.8.0) Shade the guava version used by Tez --- Key: TEZ-2164 URL: https://issues.apache.org/jira/browse/TEZ-2164 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Critical Attachments: allow-guava-16.0.1.patch Should allow us to upgrade to a newer version without shipping a guava dependency. Would be good to do this in 0.7 so that we stop shipping guava as early as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1908) Analyse and fix javac warnings in tez codebase
[ https://issues.apache.org/jira/browse/TEZ-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1908: - Target Version/s: 0.8.0 (was: 0.7.0) Analyse and fix javac warnings in tez codebase --- Key: TEZ-1908 URL: https://issues.apache.org/jira/browse/TEZ-1908 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/patchJavacWarnings.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2266) Synchronization in VertexImpl etc. broken
[ https://issues.apache.org/jira/browse/TEZ-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2266: - Target Version/s: 0.7.0 Synchronization in VertexImpl etc. broken - Key: TEZ-2266 URL: https://issues.apache.org/jira/browse/TEZ-2266 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.3 Reporter: Bikas Saha Priority: Critical There is mixed usage of synchronized blocks and a read-write lock which are not mutually exclusive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1732) Temporary mitigation for out of order scheduling
[ https://issues.apache.org/jira/browse/TEZ-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516261#comment-14516261 ] Hitesh Shah commented on TEZ-1732: -- [~bikassaha] [~sseth] Mind setting a target version Temporary mitigation for out of order scheduling Key: TEZ-1732 URL: https://issues.apache.org/jira/browse/TEZ-1732 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2303) ConcurrentModificationException while processing recovery
[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516200#comment-14516200 ] Jeff Zhang edited comment on TEZ-2303 at 4/28/15 2:22 AM: -- [~hitesh] Yes I think it make sense for the short term fix as least it fix the ConcurrentModificationException, the recovery process can keep going. Regarding the issue of not providing info to clients until the recovery phase is over, I think there are 2 main scenario: * ClientHandler RPC is started but recovery log is not read. In this case, it will throw No dag running exception in AM, no effect on the client side. so I think it is OK. {code} 2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC Server handler 0 on 6000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from 127.0.0.1:63539 Call#9557 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {code} * The second scenario is that even the recovery log is read, the RecoveryTransition may not have completed. Then the client side may still get wrong dag status. As I mentioned, this may need some big change on the recovery. We can leave it in future and take it into account when refactoring the recovery code. was (Author: zjffdu): [~hitesh] Yes I think it make sense for the short term fix as least it fix the ConcurrentModificationException. Regarding the issue of not providing info to clients until the recovery phase is over, I think there are 2 main scenario: * ClientHandler RPC is started but recovery log is not read. In this case, it will throw No dag running exception in AM, no effect on the client side. so I think it is OK. {code} 2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC Server handler 0 on 6000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from 127.0.0.1:63539 Call#9557 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {code} * The second scenario is that even the recovery log is read, the RecoveryTransition may not have completed. Then the client side may still get wrong dag status. As I mentioned, this may need some big change on the recovery. We can leave it in future and take it into account when refactoring the recovery code. ConcurrentModificationException while processing recovery - Key: TEZ-2303 URL: https://issues.apache.org/jira/browse/TEZ-2303 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Assignee: Jeff Zhang Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch Saw a Tez AM log a
[jira] [Created] (TEZ-2375) Don't return dag status to client when dag is still in recovering
Jeff Zhang created TEZ-2375: --- Summary: Don't return dag status to client when dag is still in recovering Key: TEZ-2375 URL: https://issues.apache.org/jira/browse/TEZ-2375 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Should only return dag status to client after the whole recovery process is done (DAG/Vertex/Task/TaskAttempt are all recovered to its correct state) -- This message was sent by Atlassian JIRA (v6.3.4#6332)