[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1274: -- Fix Version/s: 0.7.0 Remove Key/Value type checks in IFile - Key: TEZ-1274 URL: https://issues.apache.org/jira/browse/TEZ-1274 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Rajesh Balamohan Fix For: 0.7.0 Attachments: TEZ-1274.1.patch We check key and value types for each record - this should be removed from the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1931) Publish tez version info to Timeline
Hitesh Shah created TEZ-1931: Summary: Publish tez version info to Timeline Key: TEZ-1931 URL: https://issues.apache.org/jira/browse/TEZ-1931 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Critical We are not publishing any version info to Timeline. This will be useful to compare different dags/apps over time and also to catch issues if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1931) Publish tez version info to Timeline
[ https://issues.apache.org/jira/browse/TEZ-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270392#comment-14270392 ] Hitesh Shah commented on TEZ-1931: -- [~jeagles] Should have a patch soon. Are you ok with putting this into 0.6.0 ? Publish tez version info to Timeline Key: TEZ-1931 URL: https://issues.apache.org/jira/browse/TEZ-1931 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Critical We are not publishing any version info to Timeline. This will be useful to compare different dags/apps over time and also to catch issues if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1932) Add Prakash Ramachandran to team list
Prakash Ramachandran created TEZ-1932: - Summary: Add Prakash Ramachandran to team list Key: TEZ-1932 URL: https://issues.apache.org/jira/browse/TEZ-1932 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1932) Add Prakash Ramachandran to team list
[ https://issues.apache.org/jira/browse/TEZ-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-1932: -- Attachment: TEZ-1932.1.patch [~rajesh.balamohan] can you review Add Prakash Ramachandran to team list - Key: TEZ-1932 URL: https://issues.apache.org/jira/browse/TEZ-1932 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Priority: Minor Attachments: TEZ-1932.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270281#comment-14270281 ] Siddharth Seth commented on TEZ-1923: - +1. Looks good. Thanks [~rajesh.balamohan] Minor {code}+ LOG.info(Starting inMemoryMerger's merge since commitMemory= + + commitMemory + mergeThreshold= + mergeThreshold + + . Current usedMemory= + usedMemory); {code} This log line can be misleading since startMemToDiskMerge may not start the merge if another is already running. Should be after the condition in startMemToDiskMerge. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold.
[jira] [Commented] (TEZ-1932) Add Prakash Ramachandran to team list
[ https://issues.apache.org/jira/browse/TEZ-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270516#comment-14270516 ] Rajesh Balamohan commented on TEZ-1932: --- +1 Add Prakash Ramachandran to team list - Key: TEZ-1932 URL: https://issues.apache.org/jira/browse/TEZ-1932 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Priority: Minor Attachments: TEZ-1932.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270711#comment-14270711 ] Tsuyoshi OZAWA commented on TEZ-1421: - Sure, wait a moment. MRCombiner throws NPE in MapredWordCount on master branch - Key: TEZ-1421 URL: https://issues.apache.org/jira/browse/TEZ-1421 Project: Apache Tez Issue Type: Bug Reporter: Tsuyoshi OZAWA Priority: Blocker I tested MapredWordCount against 70GB generated by RandowTextWriter. When a Combiner runs, it throws NPE. It looks setCombinerClass doesn't work correctly. {quote} Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) at org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) at org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) at org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) at org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Attachment: TEZ-1923.4.patch Right; Incorporated it in the latest patch. Will check in to master asap. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Fix Version/s: 0.7.0 FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.7.0 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the sample hive command just for
[jira] [Commented] (TEZ-1931) Publish tez version info to Timeline
[ https://issues.apache.org/jira/browse/TEZ-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270637#comment-14270637 ] Jonathan Eagles commented on TEZ-1931: -- This will be good to go into 0.6.0, [~hitesh] Publish tez version info to Timeline Key: TEZ-1931 URL: https://issues.apache.org/jira/browse/TEZ-1931 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Critical We are not publishing any version info to Timeline. This will be useful to compare different dags/apps over time and also to catch issues if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Attachment: TEZ-1923.3.patch Addressing review comments. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the sample hive command just for reference
[jira] [Resolved] (TEZ-1573) Exception from InputInitializer and VertexManagerPlugin is not propogated to client
[ https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang resolved TEZ-1573. - Resolution: Won't Fix Fixed in other jiras. Exception from InputInitializer and VertexManagerPlugin is not propogated to client --- Key: TEZ-1573 URL: https://issues.apache.org/jira/browse/TEZ-1573 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat
Rajesh Balamohan created TEZ-1929: - Summary: AM intermittently sending kill signal to running task in heartbeat Key: TEZ-1929 URL: https://issues.apache.org/jira/browse/TEZ-1929 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Observed this behavior 3 or 4 times - Ran a hive query with tez (query_17 at 10 TB scale) - Occasionally, Map_7 task will get into failed state in the middle of fetching data from other sources (only one task is available in Map_7). {code} 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: Completed fetch for attempt: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, TimeTaken=5, Rate=1.29 MB/s 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All inputs fetched for input vertex : Map 6 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Shutting down FetchScheduler, Was Interrupted: false 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Scheduler thread completed 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Received should die response from AM 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked to die via task heartbeat 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while waiting for task to complete. Interrupting task 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) {code} From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat is unable to identify the containerId from registeredContainers. Need to verify this. I will attach the sample task log and the tez-ui details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1274: -- Attachment: TEZ-1274.1.patch [~sseth] Please review when you find time. Removed unwanted checks. It is left to the caller to ensure that proper type checks and length checks are done (since its all in tez code, it should be fine removing these checks). We also don't want the length checks (if negative key/value lengths are passed, outputstream would anyways throw IndexOutOfBoundsException). Remove Key/Value type checks in IFile - Key: TEZ-1274 URL: https://issues.apache.org/jira/browse/TEZ-1274 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Rajesh Balamohan Attachments: TEZ-1274.1.patch We check key and value types for each record - this should be removed from the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat
[ https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1929: -- Attachment: tasklog.txt Screen Shot 2015-01-08 at 2.09.11 PM.png Screen Shot 2015-01-08 at 2.28.04 PM.png AM intermittently sending kill signal to running task in heartbeat -- Key: TEZ-1929 URL: https://issues.apache.org/jira/browse/TEZ-1929 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 2015-01-08 at 2.28.04 PM.png, tasklog.txt Observed this behavior 3 or 4 times - Ran a hive query with tez (query_17 at 10 TB scale) - Occasionally, Map_7 task will get into failed state in the middle of fetching data from other sources (only one task is available in Map_7). {code} 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: Completed fetch for attempt: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, TimeTaken=5, Rate=1.29 MB/s 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All inputs fetched for input vertex : Map 6 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Shutting down FetchScheduler, Was Interrupted: false 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Scheduler thread completed 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Received should die response from AM 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked to die via task heartbeat 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while waiting for task to complete. Interrupting task 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) {code} From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat is unable to identify the containerId from registeredContainers. Need to verify this. I will attach the sample task log and the tez-ui details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269485#comment-14269485 ] Rajesh Balamohan commented on TEZ-1904: --- Minor comments: - In pipelinedSorter, comparator isn't used in SpanMerger's constructor. Should we remove it? - In SecureShuffleUtils, toHex() is not used anywhere. Should we remove it, if relevant? - IFileInputStream.getChecksum() is not used anywhere. Should we remove it, if relevant? Fix findbugs warnings in tez-runtime-library Key: TEZ-1904 URL: https://issues.apache.org/jira/browse/TEZ-1904 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1904.1.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1904: Attachment: TEZ-1904.2.txt Thanks for taking a look. Updated patch with comments addressed. Fix findbugs warnings in tez-runtime-library Key: TEZ-1904 URL: https://issues.apache.org/jira/browse/TEZ-1904 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1904.1.txt, TEZ-1904.2.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1904: Attachment: TEZ-1904.3.txt Updated to fix some more inconsistent_sync warnings. Fix findbugs warnings in tez-runtime-library Key: TEZ-1904 URL: https://issues.apache.org/jira/browse/TEZ-1904 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Siddharth Seth Attachments: TEZ-1904.1.txt, TEZ-1904.2.txt, TEZ-1904.3.txt https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269939#comment-14269939 ] Siddharth Seth commented on TEZ-1923: - {code} + if (usedMemory memoryLimit) { +LOG.info(Starting inMemoryMerger's merge since usedMemory= + +memoryLimit + memoryLimit= + memoryLimit + +. commitMemory= + commitMemory + , mergeThreshold= + mergeThreshold); +startMemToDiskMerge(); + } {code} This will, at best, attempt to start the memToDiskMerger - there's no guarantee that it'll actually run since one may already be in progress. It ends up not waiting for the MemToMemMerger to complete - which would free up some memory - and potentially trigger another merge based on thresholds. The usedMemory at this point will be determined by a race between the current thread and the memtomemmerge thread (whether the unconditional reserve has been done yet or not). Meanwhile, Fetchers block in any case - since memory isn't available. I think it's better to leave this section of the patch out - to be fixed in the MemToMem merger jiras. {code} + if ((usedMemory + mergeOutputSize) memoryLimit) { +LOG.info(Not enough memory to carry out mem-to-mem merging. usedMemory= + usedMemory + + memoryLimit= + memoryLimit); +return; + } {code} usedMemory may not be visible correctly - since it isn't inside the main MergeManager lock. This could also be part of the MemToMemMerger fixes. {code}merger.waitForShuffleToMergeMemory();{code} Would this be a problem in terms of connection timeouts - since this wait is while the connection is established. IThis could be in the run() method similar to merger.waitForInMemoryMerge() instead. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are
[jira] [Commented] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat
[ https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269943#comment-14269943 ] Siddharth Seth commented on TEZ-1929: - [~rajesh.balamohan] - do you have the AM logs as well ? This could be a result of pre-emption - either because the wrong task is running, or because YARN decided the application is over it's resource limit. AM intermittently sending kill signal to running task in heartbeat -- Key: TEZ-1929 URL: https://issues.apache.org/jira/browse/TEZ-1929 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 2015-01-08 at 2.28.04 PM.png, tasklog.txt Observed this behavior 3 or 4 times - Ran a hive query with tez (query_17 at 10 TB scale) - Occasionally, Map_7 task will get into failed state in the middle of fetching data from other sources (only one task is available in Map_7). {code} 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: Completed fetch for attempt: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, TimeTaken=5, Rate=1.29 MB/s 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All inputs fetched for input vertex : Map 6 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Shutting down FetchScheduler, Was Interrupted: false 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Scheduler thread completed 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Received should die response from AM 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked to die via task heartbeat 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while waiting for task to complete. Interrupting task 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) {code} From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat is unable to identify the containerId from registeredContainers. Need to verify this. I will attach the sample task log and the tez-ui details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)