[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-28 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1929:

Attachment: TEZ-1929.3.patch

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Rajesh Balamohan
>Assignee: Bikas Saha
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, TEZ-1929.1.patch, TEZ-1929.2.patch, 
> TEZ-1929.3.patch, applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-27 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1929:

Target Version/s: 0.6.1

This doesn't affect the 0.5.x line (TEZ-1773). Targeting 0.6.

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Rajesh Balamohan
>Assignee: Bikas Saha
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, TEZ-1929.1.patch, TEZ-1929.2.patch, 
> applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-27 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1929:

Affects Version/s: 0.6.0

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Rajesh Balamohan
>Assignee: Bikas Saha
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, TEZ-1929.1.patch, TEZ-1929.2.patch, 
> applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-09 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1929:

Attachment: TEZ-1929.2.patch

Updated patch with e2e test that should help catch such regressions in the 
future. Different parts were unit tested but one of the unit tests was not 
updated when the e2e broke.

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, TEZ-1929.1.patch, TEZ-1929.2.patch, 
> applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-09 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1929:

Attachment: TEZ-1929.1.patch

Attaching a patch for the regression with test updated.

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, TEZ-1929.1.patch, applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-09 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1929:
--
Attachment: applog.txt.gz

Attaching the AM log.  Appears that 2 times the task (pertaining to Map 7) got 
killed due to internal preemption.

{code}
2015-01-08 00:18:14,508 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_142126204_0233_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 7, taskAttemptId=attempt_142126204_0233_1_07_00_0, 
startTime=1420705088342, finishTime=1420705094506, timeTaken=6164, 
status=FAILED, errorEnum=INTERNAL_PREEMPTION, diagnostics=Container 
container_142126204_0233_01_55 finished with diagnostics set to 
[Container preempted internally], counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
...
...
2015-01-08 00:19:41,951 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_142126204_0233_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 7, taskAttemptId=attempt_142126204_0233_1_07_00_1, 
startTime=1420705150232, finishTime=1420705181950, timeTaken=31718, 
status=FAILED, errorEnum=INTERNAL_PREEMPTION, diagnostics=Container 
container_142126204_0233_01_000107 finished with diagnostics set to 
[Container preempted internally], counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
{code}

I didn't see task getting killed due to memory pressure in NodeManager logs.

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}

[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1929:
--
Attachment: tasklog.txt
Screen Shot 2015-01-08 at 2.09.11 PM.png
Screen Shot 2015-01-08 at 2.28.04 PM.png

> AM intermittently sending kill signal to running task in heartbeat
> --
>
> Key: TEZ-1929
> URL: https://issues.apache.org/jira/browse/TEZ-1929
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
> Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>   at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>   at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>   at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)