[jira] [Updated] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10649:
--
Assignee: (was: Siddharth Seth)

 LLAP: AM gets stuck completely if one node is dead
 --

 Key: HIVE-10649
 URL: https://issues.apache.org/jira/browse/HIVE-10649
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin

 See HIVE-10648.
 When AM cannot connect to a node, that appears to cause it to stall; example 
 log, there are no other interleaving logs even though this is happening in 
 the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
 From Assigning messages I can also see tasks are scheduled to all the nodes 
 before and after the pause, not just to the problematic node. 
 LLAP daemons have corresponding gaps where between two fragments nothing is 
 ran for a long time on any daemon.
 {noformat}
 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: 
 task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to 
 RUNNING due to event T_ATTEMPT_LAUNCHED
 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 10 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] 
 impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 11 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 12 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 13 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 14 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 15 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 16 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 17 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 18 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 19 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 20 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] 
 impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 21 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already 

[jira] [Updated] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead

2015-05-07 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-10649:

Description: 
See HIVE-10648.
When AM cannot connect to a node, that appears to cause it to stall; example 
log, there are no other interleaving logs even though this is happening in the 
middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
From Assigning messages I can also see tasks are scheduled to all the nodes 
before and after the pause, not just to the problematic node. 
LLAP daemons have corresponding gaps where between two fragments nothing is ran 
for a long time on any daemon.
{noformat}
2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: 
task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to RUNNING 
due to event T_ATTEMPT_LAUNCHED
2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 10 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] 
impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 11 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 12 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 13 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 14 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 15 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 16 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 17 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 18 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 19 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 20 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] 
impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 21 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 22 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already 
tried 23 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-05-07 12:14:00,819 INFO