[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=652088&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-652088 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 17/Sep/21 03:49 Start Date: 17/Sep/21 03:49 Worklog Time Spent: 10m Work Description: maheshk114 merged pull request #2645: URL: https://github.com/apache/hive/pull/2645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 652088) Time Spent: 1h 50m (was: 1h 40m) > LLAP Scheduler task exits with fatal error if the executor node is down > --- > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651785&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651785 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 15:06 Start Date: 16/Sep/21 15:06 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r710210043 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1820,6 +1830,8 @@ private static boolean removeFromRunningTaskMap(TreeMap LLAP Scheduler task exits with fatal error if the executor node is down > --- > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651784&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651784 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 15:06 Start Date: 16/Sep/21 15:06 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r710209656 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1447,23 +1454,26 @@ private SelectHostResult selectHost(TaskInfo request, Map if (request.shouldForceLocality()) { requestedHostsWillBecomeAvailable = true; } else { - LlapServiceInstance inst = activeInstances.getByHost(host).stream().findFirst().get(); - NodeInfo nodeInfo = instanceToNodeMap.get(inst.getWorkerIdentity()); - if (nodeInfo != null && nodeInfo.getEnableTime() > request.getLocalityDelayTimeout() - && nodeInfo.isDisabled() && nodeInfo.hadCommFailure()) { -LOG.debug("Host={} will not become available within requested timeout", nodeInfo); -// This node will likely be activated after the task timeout expires. - } else { -// Worth waiting for the timeout. -requestedHostsWillBecomeAvailable = true; + for (LlapServiceInstance inst : activeInstancesByHost) { +NodeInfo nodeInfo = instanceToNodeMap.get(inst.getWorkerIdentity()); +if (nodeInfo == null) { + LOG.warn("Null NodeInfo when attempting to get host {}", host); + // Leave requestedHostWillBecomeAvailable as is. If some other host is found - delay, + // else ends up allocating to a random host immediately. + continue; Review comment: we can avoid continue by changing second if to else if -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651784) Time Spent: 1.5h (was: 1h 20m) > LLAP Scheduler task exits with fatal error if the executor node is down > --- > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651782&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651782 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 15:05 Start Date: 16/Sep/21 15:05 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r710209073 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1447,23 +1454,26 @@ private SelectHostResult selectHost(TaskInfo request, Map if (request.shouldForceLocality()) { requestedHostsWillBecomeAvailable = true; } else { - LlapServiceInstance inst = activeInstances.getByHost(host).stream().findFirst().get(); - NodeInfo nodeInfo = instanceToNodeMap.get(inst.getWorkerIdentity()); - if (nodeInfo != null && nodeInfo.getEnableTime() > request.getLocalityDelayTimeout() - && nodeInfo.isDisabled() && nodeInfo.hadCommFailure()) { -LOG.debug("Host={} will not become available within requested timeout", nodeInfo); -// This node will likely be activated after the task timeout expires. - } else { -// Worth waiting for the timeout. -requestedHostsWillBecomeAvailable = true; + for (LlapServiceInstance inst : activeInstancesByHost) { +NodeInfo nodeInfo = instanceToNodeMap.get(inst.getWorkerIdentity()); +if (nodeInfo == null) { + LOG.warn("Null NodeInfo when attempting to get host {}", host); + // Leave requestedHostWillBecomeAvailable as is. If some other host is found - delay, + // else ends up allocating to a random host immediately. + continue; Review comment: we can avoid continue by changing second if to else if -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651782) Time Spent: 1h 20m (was: 1h 10m) > LLAP Scheduler task exits with fatal error if the executor node is down > --- > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651776&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651776 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 15:00 Start Date: 16/Sep/21 15:00 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r710204368 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1820,6 +1830,8 @@ private static boolean removeFromRunningTaskMap(TreeMap LLAP Scheduler task exits with fatal error if the executor node is down > --- > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651774&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651774 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 14:58 Start Date: 16/Sep/21 14:58 Worklog Time Spent: 10m Work Description: maheshk114 commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r710203031 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1430,6 +1430,13 @@ private SelectHostResult selectHost(TaskInfo request, Map boolean requestedHostsWillBecomeAvailable = false; for (String host : requestedHosts) { prefHostCount++; + + // Check if the host is removed from the registry after availableHostMap is created. + Set activeInstancesByHost = activeInstances.getByHost(host); + if (activeInstancesByHost == null || activeInstancesByHost.isEmpty()) { +continue; + } + // Pick the first host always. Weak attempt at cache affinity. if (availableHostMap.containsKey(host)) { Review comment: i think having this separate check makes the code more readable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651774) Time Spent: 1h (was: 50m) > LLAP Scheduler task exits with fatal error if the executor node is down > --- > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651519&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651519 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 08:03 Start Date: 16/Sep/21 08:03 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r709881332 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1820,6 +1830,8 @@ private static boolean removeFromRunningTaskMap(TreeMap LLAP Scheduler task exits with fatal error if the executor node is down. > > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651517&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651517 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 16/Sep/21 08:01 Start Date: 16/Sep/21 08:01 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r709880097 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1430,6 +1430,13 @@ private SelectHostResult selectHost(TaskInfo request, Map boolean requestedHostsWillBecomeAvailable = false; for (String host : requestedHosts) { prefHostCount++; + + // Check if the host is removed from the registry after availableHostMap is created. + Set activeInstancesByHost = activeInstances.getByHost(host); + if (activeInstancesByHost == null || activeInstancesByHost.isEmpty()) { +continue; + } + // Pick the first host always. Weak attempt at cache affinity. if (availableHostMap.containsKey(host)) { Review comment: I would avoid the continue statement above and modify the condition to: if (availableHostMap.containsKey(host) && activeInstancesByHost != null && !activeInstancesByHost.isEmpty()) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651517) Time Spent: 40m (was: 0.5h) > LLAP Scheduler task exits with fatal error if the executor node is down. > > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651135&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651135 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 15/Sep/21 14:53 Start Date: 15/Sep/21 14:53 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r709269471 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1447,8 +1447,14 @@ private SelectHostResult selectHost(TaskInfo request, Map if (request.shouldForceLocality()) { requestedHostsWillBecomeAvailable = true; } else { - LlapServiceInstance inst = activeInstances.getByHost(host).stream().findFirst().get(); - NodeInfo nodeInfo = instanceToNodeMap.get(inst.getWorkerIdentity()); + Set instanceTypes = activeInstances.getByHost(host); Review comment: Looks like this may happen when a node goes down between getResourceAvailability() call until selectHost() is triggered. Following the previous logic I believe the check should be performed at the same level as: ```availableHostMap.containsKey(host)``` as these type of requests should not be checking for requestedHostsWillBecomeAvailable. Does it make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651135) Time Spent: 0.5h (was: 20m) > LLAP Scheduler task exits with fatal error if the executor node is down. > > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651134&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651134 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 15/Sep/21 14:52 Start Date: 15/Sep/21 14:52 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2645: URL: https://github.com/apache/hive/pull/2645#discussion_r709269471 ## File path: llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java ## @@ -1447,8 +1447,14 @@ private SelectHostResult selectHost(TaskInfo request, Map if (request.shouldForceLocality()) { requestedHostsWillBecomeAvailable = true; } else { - LlapServiceInstance inst = activeInstances.getByHost(host).stream().findFirst().get(); - NodeInfo nodeInfo = instanceToNodeMap.get(inst.getWorkerIdentity()); + Set instanceTypes = activeInstances.getByHost(host); Review comment: Looks like this may happen when a node goes down between getResourceAvailability() call until selectHost() is triggered. Following the previous logic I believe the check should be performed at the same level as: ```availableHostMap.containsKey(host)``` as these type of requests should not be waiting for requestedHostsWillBecomeAvailable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651134) Time Spent: 20m (was: 10m) > LLAP Scheduler task exits with fatal error if the executor node is down. > > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.
[ https://issues.apache.org/jira/browse/HIVE-25527?focusedWorklogId=651113&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651113 ] ASF GitHub Bot logged work on HIVE-25527: - Author: ASF GitHub Bot Created on: 15/Sep/21 14:18 Start Date: 15/Sep/21 14:18 Worklog Time Spent: 10m Work Description: maheshk114 opened a new pull request #2645: URL: https://github.com/apache/hive/pull/2645 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 651113) Remaining Estimate: 0h Time Spent: 10m > LLAP Scheduler task exits with fatal error if the executor node is down. > > > Key: HIVE-25527 > URL: https://issues.apache.org/jira/browse/HIVE-25527 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In case the executor host has gone down, activeInstances will be updated with > null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)