[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513802#comment-14513802 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728348/TEZ-2358.3.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/540//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/540//console This message is automatically generated. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513830#comment-14513830 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728354/TEZ-2358.4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/541//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/541//console This message is automatically generated. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514797#comment-14514797 ] Gopal V commented on TEZ-2358: -- [~hitesh]: marking this as blocker for 0.7.x, because it causes task failures for long running jobs. [~rajesh.balamohan]: Patch LGTM - +1 Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Priority: Blocker Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host:
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513650#comment-14513650 ] Gopal V commented on TEZ-2358: -- The patch was again tested at 10Tb scale over the weekend - there seems to be no collisions in naming. I look at the logs and noticed that some earlier tasks did succeed with the duplicate naming, due to the fact that there were only a few spills, resulting in them being split between the disks not colliding in paths. But for the sake of preventing future breakage, it would help to have an error being triggered when someone violates the no-duplicate rule for onDiskMapOutputs (i.e no two file chunks for merging can start at the same offset of the same file). My original pre-conditions were wrong when auto-reducer parallelism kicks in as we want to merge off a DISK_DIRECT input across two reducers (when auto-reduce parallelism kicks in), which would be different index points into the same DISK_DIRECT file. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0]
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513742#comment-14513742 ] Gopal V commented on TEZ-2358: -- [~rajesh.balamohan]: minor nit on the checkargument (not or not) pattern - it gets complex to add another condition later. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent:
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510894#comment-14510894 ] Gopal V commented on TEZ-2358: -- [~rajesh.balamohan]: The patch is looking good in tests. The ETL succeeded (Time taken: 14184.052 seconds) with this fix and there were no task failures due to full disks - also tested with local.fetch/shared.fetch turned on. FYI, I added a rough precondition to throw an exception if a duplicate name was added to FileChunks, which pretty much gave me the confidence that this isn't hit anymore. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] 2015-04-23 03:28:08,270 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] 2015-04-23 03:28:08,272 INFO
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511227#comment-14511227 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12727924/TEZ-2358.2.patch against master revision 778c1f5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/529//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/529//console This message is automatically generated. Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task - Key: TEZ-2358 URL: https://issues.apache.org/jira/browse/TEZ-2358 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 The Tez MergeManager code assumes that the src-task-id is unique between merge operations, this results in some confusion when two merge sequences have to process output from the same src-task-id. {code} private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, ListMapOutput inMemoryMapOutputs, ListFileChunk onDiskMapOutputs ... if (inMemoryMapOutputs.size() 0) { int srcTaskId = inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); ... // must spill to disk, but can't retain in-mem for intermediate merge final Path outputPath = mapOutputFile.getInputFileForWrite(srcTaskId, inMemToDiskBytes).suffix( Constants.MERGED_OUTPUT_PREFIX); ... {code} This or some scenario related to this, results in the following FileChunks list which contains identical named paths with different lengths. {code} 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Merging 6 sorted segments 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down to the last merge-pass, with 6 segments left of total size: 1165944755 bytes 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 Merge of the 6 files in-memory complete. Local file is /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged of size 785583965 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs and 5 on-disk map-outputs 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out {code} The multiple instances of 404.out indicates that we pulled two pipelined chunks of the same shuffle src id, once into memory and twice onto disk. {code} 2015-04-23 03:28:08,256 INFO [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]]