[jira] [Updated] (YARN-8477) Minor code refactor on ProcfsBasedProcessTree.java
[ https://issues.apache.org/jira/browse/YARN-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Zhang updated YARN-8477: -- Attachment: YARN-8477.001.patch > Minor code refactor on ProcfsBasedProcessTree.java > -- > > Key: YARN-8477 > URL: https://issues.apache.org/jira/browse/YARN-8477 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Shuai Zhang >Priority: Trivial > Attachments: YARN-8477.001.patch > > > Minor code refactor on ProcfsBasedProcessTree.java to improve readability. > Split the functionality of "read the first line of file" into separate > function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8477) Minor code refactor on ProcfsBasedProcessTree.java
Shuai Zhang created YARN-8477: - Summary: Minor code refactor on ProcfsBasedProcessTree.java Key: YARN-8477 URL: https://issues.apache.org/jira/browse/YARN-8477 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.1.0 Reporter: Shuai Zhang Minor code refactor on ProcfsBasedProcessTree.java to improve readability. Split the functionality of "read the first line of file" into separate function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8467) AsyncDispatcher should have a name & display it in logs to improve debug
[ https://issues.apache.org/jira/browse/YARN-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Zhang updated YARN-8467: -- Attachment: YARN-8467.002.patch > AsyncDispatcher should have a name & display it in logs to improve debug > > > Key: YARN-8467 > URL: https://issues.apache.org/jira/browse/YARN-8467 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.1.0 >Reporter: Shuai Zhang >Priority: Trivial > Attachments: YARN-8467.001.patch, YARN-8467.002.patch > > > Currently each AbstractService has a dispatcher, but the dispatcher is not > named. Logs from dispatcher is mixed, which is quite hard to debug any hang > issues. I suggest > # Make it possible to name AsyncDispatcher & its thread (partially done in > YARN-6015) > # Mention the AsyncDispatcher name in all its logs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8476) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
[ https://issues.apache.org/jira/browse/YARN-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YunFan Zhou resolved YARN-8476. --- Resolution: Won't Fix > Should check the resource of assignment is greater than Resources.none() > before submitResourceCommitRequest > --- > > Key: YARN-8476 > URL: https://issues.apache.org/jira/browse/YARN-8476 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: YunFan Zhou >Assignee: YunFan Zhou >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8476) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
[ https://issues.apache.org/jira/browse/YARN-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YunFan Zhou updated YARN-8476: -- Description: (was: Hi, [~leftnoteasy] We recently merge https://issues.apache.org/jira/browse/YARN-5139 into our version and found some bug. Below is the more serious bugs I've encountered: {code:java} LeafQueue queue = ((LeafQueue) reservedApplication.getQueue()); assignment = queue.assignContainers(getClusterResource(), candidates, // TODO, now we only consider limits for parent for non-labeled // resources, should consider labeled resources as well. new ResourceLimits(labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, getClusterResource())), SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); if (assignment.isFulfilledReservation()) { if (withNodeHeartbeat) { // Only update SchedulerHealth in sync scheduling, existing // Data structure of SchedulerHealth need to be updated for // Async mode updateSchedulerHealth(lastNodeUpdateTime, node.getNodeID(), assignment); } schedulerHealth.updateSchedulerFulfilledReservationCounts(1); ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.ALLOCATED_FROM_RESERVED); } else{ ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.SKIPPED); } assignment.setSchedulingMode( SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); submitResourceCommitRequest(getClusterResource(), assignment); } {code} Before we submit assignment to *resourceCommitterService* service, we must check the assignment is greater than the *Resources. none().* Because the assignment can be *CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL)* after call *getRootQueue().assignContainers* method, which is a meaningless value. But we are still going to submit it to *resourceCommitterService* service, and lead to a bunch of meaningless assignments blocks other meaningful event processing. I think this is a very serious bug! Any Suggestions?) > Should check the resource of assignment is greater than Resources.none() > before submitResourceCommitRequest > --- > > Key: YARN-8476 > URL: https://issues.apache.org/jira/browse/YARN-8476 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: YunFan Zhou >Assignee: YunFan Zhou >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8476) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
[ https://issues.apache.org/jira/browse/YARN-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YunFan Zhou updated YARN-8476: -- Priority: Minor (was: Blocker) > Should check the resource of assignment is greater than Resources.none() > before submitResourceCommitRequest > --- > > Key: YARN-8476 > URL: https://issues.apache.org/jira/browse/YARN-8476 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: YunFan Zhou >Assignee: YunFan Zhou >Priority: Minor > > Hi, [~leftnoteasy] > We recently merge https://issues.apache.org/jira/browse/YARN-5139 into our > version and found some bug. > Below is the more serious bugs I've encountered: > > {code:java} > LeafQueue queue = ((LeafQueue) reservedApplication.getQueue()); > assignment = queue.assignContainers(getClusterResource(), candidates, > // TODO, now we only consider limits for parent for non-labeled > // resources, should consider labeled resources as well. > new ResourceLimits(labelManager > .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, > getClusterResource())), > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); > if (assignment.isFulfilledReservation()) { > if (withNodeHeartbeat) { > // Only update SchedulerHealth in sync scheduling, existing > // Data structure of SchedulerHealth need to be updated for > // Async mode > updateSchedulerHealth(lastNodeUpdateTime, node.getNodeID(), > assignment); > } > schedulerHealth.updateSchedulerFulfilledReservationCounts(1); > ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, > queue.getParent().getQueueName(), queue.getQueueName(), > ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); > ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, > node, reservedContainer.getContainerId(), > AllocationState.ALLOCATED_FROM_RESERVED); > } else{ > ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, > queue.getParent().getQueueName(), queue.getQueueName(), > ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); > ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, > node, reservedContainer.getContainerId(), AllocationState.SKIPPED); > } > assignment.setSchedulingMode( > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); > submitResourceCommitRequest(getClusterResource(), assignment); > } > {code} > > Before we submit assignment to *resourceCommitterService* service, we must > check the assignment is greater than the *Resources. none().* > Because the assignment can be *CSAssignment(Resources.createResource(0, 0), > NodeType.NODE_LOCAL)* after call *getRootQueue().assignContainers* method, > which is a meaningless value. > > But we are still going to submit it to *resourceCommitterService* service, > and lead to a bunch of meaningless assignments blocks other meaningful event > processing. > > I think this is a very serious bug! Any Suggestions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8476) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
[ https://issues.apache.org/jira/browse/YARN-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YunFan Zhou updated YARN-8476: -- Description: Hi, [~leftnoteasy] We recently merge https://issues.apache.org/jira/browse/YARN-5139 into our version and found some bug. Below is the more serious bugs I've encountered: {code:java} LeafQueue queue = ((LeafQueue) reservedApplication.getQueue()); assignment = queue.assignContainers(getClusterResource(), candidates, // TODO, now we only consider limits for parent for non-labeled // resources, should consider labeled resources as well. new ResourceLimits(labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, getClusterResource())), SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); if (assignment.isFulfilledReservation()) { if (withNodeHeartbeat) { // Only update SchedulerHealth in sync scheduling, existing // Data structure of SchedulerHealth need to be updated for // Async mode updateSchedulerHealth(lastNodeUpdateTime, node.getNodeID(), assignment); } schedulerHealth.updateSchedulerFulfilledReservationCounts(1); ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.ALLOCATED_FROM_RESERVED); } else{ ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.SKIPPED); } assignment.setSchedulingMode( SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); submitResourceCommitRequest(getClusterResource(), assignment); } {code} Before we submit assignment to *resourceCommitterService* service, we must check the assignment is greater than the *Resources. none().* Because the assignment can be *CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL)* after call *getRootQueue().assignContainers* method, which is a meaningless value. But we are still going to submit it to *resourceCommitterService* service, and lead to a bunch of meaningless assignments blocks other meaningful event processing. I think this is a very serious bug! Any Suggestions? > Should check the resource of assignment is greater than Resources.none() > before submitResourceCommitRequest > --- > > Key: YARN-8476 > URL: https://issues.apache.org/jira/browse/YARN-8476 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: YunFan Zhou >Assignee: YunFan Zhou >Priority: Blocker > > Hi, [~leftnoteasy] > We recently merge https://issues.apache.org/jira/browse/YARN-5139 into our > version and found some bug. > > Below is the more serious bugs I've encountered: > > {code:java} > LeafQueue queue = ((LeafQueue) reservedApplication.getQueue()); > assignment = queue.assignContainers(getClusterResource(), candidates, > // TODO, now we only consider limits for parent for non-labeled > // resources, should consider labeled resources as well. > new ResourceLimits(labelManager > .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, > getClusterResource())), > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); > if (assignment.isFulfilledReservation()) { > if (withNodeHeartbeat) { > // Only update SchedulerHealth in sync scheduling, existing > // Data structure of SchedulerHealth need to be updated for > // Async mode > updateSchedulerHealth(lastNodeUpdateTime, node.getNodeID(), > assignment); > } > schedulerHealth.updateSchedulerFulfilledReservationCounts(1); > ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, > queue.getParent().getQueueName(), queue.getQueueName(), > ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); > ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, > node, reservedContainer.getContainerId(), > AllocationState.ALLOCATED_FROM_RESERVED); > } else{ > ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, > queue.getParent().getQueueName(), queue.getQueueName(), > ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); >
[jira] [Updated] (YARN-8476) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
[ https://issues.apache.org/jira/browse/YARN-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YunFan Zhou updated YARN-8476: -- Description: Hi, [~leftnoteasy] We recently merge https://issues.apache.org/jira/browse/YARN-5139 into our version and found some bug. Below is the more serious bugs I've encountered: {code:java} LeafQueue queue = ((LeafQueue) reservedApplication.getQueue()); assignment = queue.assignContainers(getClusterResource(), candidates, // TODO, now we only consider limits for parent for non-labeled // resources, should consider labeled resources as well. new ResourceLimits(labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, getClusterResource())), SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); if (assignment.isFulfilledReservation()) { if (withNodeHeartbeat) { // Only update SchedulerHealth in sync scheduling, existing // Data structure of SchedulerHealth need to be updated for // Async mode updateSchedulerHealth(lastNodeUpdateTime, node.getNodeID(), assignment); } schedulerHealth.updateSchedulerFulfilledReservationCounts(1); ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.ALLOCATED_FROM_RESERVED); } else{ ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.SKIPPED); } assignment.setSchedulingMode( SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); submitResourceCommitRequest(getClusterResource(), assignment); } {code} Before we submit assignment to *resourceCommitterService* service, we must check the assignment is greater than the *Resources. none().* Because the assignment can be *CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL)* after call *getRootQueue().assignContainers* method, which is a meaningless value. But we are still going to submit it to *resourceCommitterService* service, and lead to a bunch of meaningless assignments blocks other meaningful event processing. I think this is a very serious bug! Any Suggestions? was: Hi, [~leftnoteasy] We recently merge https://issues.apache.org/jira/browse/YARN-5139 into our version and found some bug. Below is the more serious bugs I've encountered: {code:java} LeafQueue queue = ((LeafQueue) reservedApplication.getQueue()); assignment = queue.assignContainers(getClusterResource(), candidates, // TODO, now we only consider limits for parent for non-labeled // resources, should consider labeled resources as well. new ResourceLimits(labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, getClusterResource())), SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); if (assignment.isFulfilledReservation()) { if (withNodeHeartbeat) { // Only update SchedulerHealth in sync scheduling, existing // Data structure of SchedulerHealth need to be updated for // Async mode updateSchedulerHealth(lastNodeUpdateTime, node.getNodeID(), assignment); } schedulerHealth.updateSchedulerFulfilledReservationCounts(1); ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.ALLOCATED_FROM_RESERVED); } else{ ActivitiesLogger.QUEUE.recordQueueActivity(activitiesManager, node, queue.getParent().getQueueName(), queue.getQueueName(), ActivityState.ACCEPTED, ActivityDiagnosticConstant.EMPTY); ActivitiesLogger.NODE.finishAllocatedNodeAllocation(activitiesManager, node, reservedContainer.getContainerId(), AllocationState.SKIPPED); } assignment.setSchedulingMode( SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY); submitResourceCommitRequest(getClusterResource(), assignment); } {code} Before we submit assignment to *resourceCommitterService* service, we must check the assignment is greater than the *Resources. none().* Because the assignment can be *CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL)* after call *getRootQueue().assignContainers* method, which is a meaningless value.
[jira] [Updated] (YARN-8455) Add basic acl check for all TS v2 REST APIs
[ https://issues.apache.org/jira/browse/YARN-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8455: Attachment: YARN-8455.004.patch > Add basic acl check for all TS v2 REST APIs > --- > > Key: YARN-8455 > URL: https://issues.apache.org/jira/browse/YARN-8455 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8455.001.patch, YARN-8455.002.patch, > YARN-8455.003.patch, YARN-8455.004.patch > > > YARN-8319 filter check for flows pages. The same behavior need to be added > for all other REST API as long as ATS provides support for ACLs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8476) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
YunFan Zhou created YARN-8476: - Summary: Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest Key: YARN-8476 URL: https://issues.apache.org/jira/browse/YARN-8476 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler Reporter: YunFan Zhou Assignee: YunFan Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8475) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
JackZhou created YARN-8475: -- Summary: Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest Key: YARN-8475 URL: https://issues.apache.org/jira/browse/YARN-8475 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: JackZhou -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527148#comment-16527148 ] genericqa commented on YARN-8474: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 29s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 33s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s{color} | {color:green} hadoop-yarn-services-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 61m 47s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8474 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929658/YARN-8474.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux bce47b9a478d 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e4d7227 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21143/testReport/ | | Max. process+thread count | 539 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21143/console | | Powered by |
[jira] [Comment Edited] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527118#comment-16527118 ] tianjuan edited comment on YARN-8471 at 6/29/18 3:17 AM: - yes, it's related to YARN-8193. this is an applicable patch for 2.9.0 was (Author: jutia): yes, it's related to YARN-8193. this is a applicable patch for 2.9.0 > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527118#comment-16527118 ] tianjuan commented on YARN-8471: yes, it's related to YARN-8193. this is a applicable patch for 2.9.0 > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned YARN-8474: --- Assignee: Eric Yang Affects Version/s: 3.1.0 Target Version/s: 3.2.0, 3.1.1 - Fix Kerberos challenge from client side. > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Eric Yang >Priority: Critical > Attachments: YARN-8474.001.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8474: Attachment: YARN-8474.001.patch > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Priority: Critical > Attachments: YARN-8474.001.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8455) Add basic acl check for all TS v2 REST APIs
[ https://issues.apache.org/jira/browse/YARN-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527047#comment-16527047 ] genericqa commented on YARN-8455: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 19s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 21s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 1m 14s{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 67m 54s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServices | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8455 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929653/YARN-8455.003.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 30216ddc3a5d 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e4d7227 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21142/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21142/testReport/ | | Max. process+thread count | 334 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice U:
[jira] [Commented] (YARN-8455) Add basic acl check for all TS v2 REST APIs
[ https://issues.apache.org/jira/browse/YARN-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527000#comment-16527000 ] Rohith Sharma K S commented on YARN-8455: - While testing I found one of the REST query was not handled properly. I attached the patch fixing that API. > Add basic acl check for all TS v2 REST APIs > --- > > Key: YARN-8455 > URL: https://issues.apache.org/jira/browse/YARN-8455 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8455.001.patch, YARN-8455.002.patch, > YARN-8455.003.patch > > > YARN-8319 filter check for flows pages. The same behavior need to be added > for all other REST API as long as ATS provides support for ACLs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8455) Add basic acl check for all TS v2 REST APIs
[ https://issues.apache.org/jira/browse/YARN-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8455: Attachment: YARN-8455.003.patch > Add basic acl check for all TS v2 REST APIs > --- > > Key: YARN-8455 > URL: https://issues.apache.org/jira/browse/YARN-8455 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8455.001.patch, YARN-8455.002.patch, > YARN-8455.003.patch > > > YARN-8319 filter check for flows pages. The same behavior need to be added > for all other REST API as long as ATS provides support for ACLs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumana Sathish updated YARN-8474: - Description: Sleeper job fails with Authentication required. {code} yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition from local FS: /a/YarnServiceLogs/sleeper-orig.json 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required {code} was: Sleeper job fails with Authentication required. yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition from local FS: /a/YarnServiceLogs/sleeper-orig.json 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Priority: Critical > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8474) sleeper service fails to launch with "Authentication Required"
Sumana Sathish created YARN-8474: Summary: sleeper service fails to launch with "Authentication Required" Key: YARN-8474 URL: https://issues.apache.org/jira/browse/YARN-8474 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Sleeper job fails with Authentication required. yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition from local FS: /a/YarnServiceLogs/sleeper-orig.json 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7863) Modify placement constraints to support node attributes
[ https://issues.apache.org/jira/browse/YARN-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526965#comment-16526965 ] Sunil Govindan commented on YARN-7863: -- Attaching first version for this patch. I will add more test cases and other enhancements in next patch. > Modify placement constraints to support node attributes > --- > > Key: YARN-7863 > URL: https://issues.apache.org/jira/browse/YARN-7863 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-7863.v0.patch > > > This Jira will track to *Modify existing placement constraints to support > node attributes.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7863) Modify placement constraints to support node attributes
[ https://issues.apache.org/jira/browse/YARN-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-7863: - Attachment: YARN-7863.v0.patch > Modify placement constraints to support node attributes > --- > > Key: YARN-7863 > URL: https://issues.apache.org/jira/browse/YARN-7863 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-7863.v0.patch > > > This Jira will track to *Modify existing placement constraints to support > node attributes.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1013) CS should watch resource utilization of containers and allocate speculative containers if appropriate
[ https://issues.apache.org/jira/browse/YARN-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526926#comment-16526926 ] Haibo Chen commented on YARN-1013: -- {quote} where is the enforcement flag? {quote} It is per ResourceRequest, included in the ExecutionTypeRequest of a ResourceRequest. Essentially, a RequestRequest can opt out of oversubscription by setting its enforcement flag to true. (G, false) requests can start eagerly as O containers, but there is a possibility that the O containers can sometimes be preempted if the node is running hot. Applications can decide for themselves what tasks are critical enough that the risk of starting as O containers and being preempted is not acceptable. YARN-8240 added control on a queue level, that is, if a queue opts out of oversubscription, all applications running in the queue will never get Opportunistic containers for their (G, false) requests. {quote}Does this considers resource usages for O container or it is just consider G container usages? {quote} The fair scheduler policy (SchedulingPolicy) is plug-able, so FairScheduler queues can be sorted with O resource usage of the queue in mind. > CS should watch resource utilization of containers and allocate speculative > containers if appropriate > - > > Key: YARN-1013 > URL: https://issues.apache.org/jira/browse/YARN-1013 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Weiwei Yang >Priority: Major > > CS should watch resource utilization of containers (provided by NM in > heartbeat) and allocate speculative containers (at lower OS priority) if > appropriate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526924#comment-16526924 ] Yufei Gu commented on YARN-8468: Sounds good to me. Thanks [~mrbillau]. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526900#comment-16526900 ] Mike Billau commented on YARN-8468: --- Hi [~yufeigu], [~bsteinbach], and team - sorry for the delay. The motivation behind this case came from one of my customers who has a very large cluster and many different users. They are using FairScheduler and have many different rules set up. Overall they are using "yarn.scheduler.maximum-allocation-mb" to limit the size of containers that their users create - this is to gently encourage the users to write "better" jobs and not just request massive containers. This is working fine, except once in a while they actually DO need to create massive containers for enterprise jobs. Originally we were looking for ways to "exclude" these specific enterprise jobs from this maximum-allocation-mb, but since this property is set globally and applies to all queues, there was no way to do this. If we could set this property at a per-queue basis we could achieve this. Additionally, it looks like you CAN already set this maximum-allocation-mb setting on a per queue basis for the CapacityScheduler, so this ticket would add feature parity with the FairScheduler. Under queue properties for teh CapacityScheduler doc page, we read: "The per queue maximum limit of memory to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration yarn.scheduler.maximum-allocation-mb. This value must be smaller than or equal to the cluster maximum." Hopefully that is enough justification - please let me know if you guys need anything else! I don't have voting power but I agree that the naming scheme is not friendly to newcomers. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526881#comment-16526881 ] Xiao Liang commented on YARN-8471: -- Thanks [~jutia], is it related to YARN-8193 ? > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526869#comment-16526869 ] Yufei Gu commented on YARN-8468: [~bsteinbach] since you filed this jira and provided the patch, you have the responsibility to justify the motivation. However, I am OK with this feature. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526815#comment-16526815 ] Jason Lowe commented on YARN-8473: -- Sample error transitions from the NM log: {noformat} 2018-06-21 22:10:08,433 [AsyncDispatcher event handler] WARN application.ApplicationImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: INIT_CONTAINER at FINISHING_CONTAINERS_WAIT at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:458) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:63) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1325) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1317) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) {noformat} {noformat} 2018-06-21 22:10:09,020 [AsyncDispatcher event handler] WARN application.ApplicationImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: INIT_CONTAINER at FINISHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:458) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:63) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1325) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1317) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) {noformat} > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
Jason Lowe created YARN-8473: Summary: Containers being launched as app tears down can leave containers in NEW state Key: YARN-8473 URL: https://issues.apache.org/jira/browse/YARN-8473 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.8.4 Reporter: Jason Lowe Assignee: Jason Lowe I saw a case where containers were stuck on a nodemanager in the NEW state because they tried to launch just as an application was tearing down. The container sent an INIT_CONTAINER event to the ApplicationImpl which then executed an invalid transition since that event is not handled/expected when the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526745#comment-16526745 ] genericqa commented on YARN-8451: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 32s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 17m 58s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 77m 27s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8451 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929613/YARN-8451.v2.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux b2897a23a46d 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 2911943 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21141/testReport/ | | Max. process+thread count | 302 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21141/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Multiple NM heartbeat thread created when a slow NM resync
[jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526655#comment-16526655 ] Botong Huang commented on YARN-8451: Good point, fixed in v2 patch! > Multiple NM heartbeat thread created when a slow NM resync with RM > -- > > Key: YARN-8451 > URL: https://issues.apache.org/jira/browse/YARN-8451 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8451.v1.patch, YARN-8451.v2.patch > > > During a NM resync with RM (say RM did a master slave switch), if NM is > running slow, more than one RESYNC event may be put into the NM dispatcher by > the existing heartbeat thread before they are processed. As a result, > multiple new heartbeat thread are later created and start to hb to RM > concurrently with their own responseId. If at some point of time, one thread > becomes more than one step behind others, RM will send back a resync signal > in this heartbeat response, killing all containers in this NM. > See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8451: --- Attachment: YARN-8451.v2.patch > Multiple NM heartbeat thread created when a slow NM resync with RM > -- > > Key: YARN-8451 > URL: https://issues.apache.org/jira/browse/YARN-8451 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8451.v1.patch, YARN-8451.v2.patch > > > During a NM resync with RM (say RM did a master slave switch), if NM is > running slow, more than one RESYNC event may be put into the NM dispatcher by > the existing heartbeat thread before they are processed. As a result, > multiple new heartbeat thread are later created and start to hb to RM > concurrently with their own responseId. If at some point of time, one thread > becomes more than one step behind others, RM will send back a resync signal > in this heartbeat response, killing all containers in this NM. > See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526650#comment-16526650 ] Jason Lowe commented on YARN-8451: -- Ah, sorry, I missed that there was a thread earlier as well. Should have used {{diff -b}} after applying the patch. ;-) Since the AtomicBoolean is referred to as a lock, I'd like to see it treated as such where the release of it is in a {{finally}} block so it's always released even if exceptions occur. Otherwise looks good. > Multiple NM heartbeat thread created when a slow NM resync with RM > -- > > Key: YARN-8451 > URL: https://issues.apache.org/jira/browse/YARN-8451 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8451.v1.patch > > > During a NM resync with RM (say RM did a master slave switch), if NM is > running slow, more than one RESYNC event may be put into the NM dispatcher by > the existing heartbeat thread before they are processed. As a result, > multiple new heartbeat thread are later created and start to hb to RM > concurrently with their own responseId. If at some point of time, one thread > becomes more than one step behind others, RM will send back a resync signal > in this heartbeat response, killing all containers in this NM. > See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526640#comment-16526640 ] Botong Huang commented on YARN-8451: Hi [~jlowe], I am actually not changing this behavior (not to block dispatcher for resync), existing code has been creating a new thread for it. I think the reason is that resync involves draining existing heartbeat thread and a register call to RM, which can take a long time (say network slow or RM is down during master-slave switch). We don't want to block the entire NM for this. It maybe much more involved if we want to change this behavior. > Multiple NM heartbeat thread created when a slow NM resync with RM > -- > > Key: YARN-8451 > URL: https://issues.apache.org/jira/browse/YARN-8451 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8451.v1.patch > > > During a NM resync with RM (say RM did a master slave switch), if NM is > running slow, more than one RESYNC event may be put into the NM dispatcher by > the existing heartbeat thread before they are processed. As a result, > multiple new heartbeat thread are later created and start to hb to RM > concurrently with their own responseId. If at some point of time, one thread > becomes more than one step behind others, RM will send back a resync signal > in this heartbeat response, killing all containers in this NM. > See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types
[ https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526621#comment-16526621 ] Wangda Tan commented on YARN-8453: -- +1 to the patch, thanks [~sunilg]. > Additional Unit tests to verify queue limit and max-limit with multiple > resource types > --- > > Key: YARN-8453 > URL: https://issues.apache.org/jira/browse/YARN-8453 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.0.2 >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8453.001.patch > > > Post support of additional resource types other then CPU and Memory, it could > be possible that one such new resource is exhausted its quota on a given > queue. But other resources such as Memory / CPU is still there beyond its > guaranteed limit (under max-limit). Adding more units test to ensure we are > not starving such allocation requests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8379) Improve balancing resources in already satisfied queues by using Capacity Scheduler preemption
[ https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526603#comment-16526603 ] Hudson commented on YARN-8379: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14497 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14497/]) YARN-8379. Improve balancing resources in already satisfied queues by (sunilg: rev 291194302cc1a875d6d94ea93cf1184a3f1fc2cc) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/PreemptableResourceCalculator.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/CapacitySchedulerPreemptionContext.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/CapacitySchedulerPreemptionUtils.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/FifoCandidatesSelector.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/AbstractPreemptableResourceCalculator.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/QueuePriorityContainerCandidateSelector.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestPreemptionForQueueWithPriorities.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TempQueuePerPartition.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/IntraQueueCandidatesSelector.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyPreemptToBalance.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/PreemptionCandidatesSelector.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ReservedContainerCandidatesSelector.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerSurgicalPreemption.java > Improve balancing resources in already satisfied queues by using Capacity > Scheduler preemption > -- > > Key: YARN-8379 > URL: https://issues.apache.org/jira/browse/YARN-8379 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Zian Chen >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8379.001.patch, YARN-8379.002.patch, > YARN-8379.003.patch, YARN-8379.004.patch, YARN-8379.005.patch, > YARN-8379.006.patch, ericpayne.confs.tgz > > > Existing capacity scheduler only supports preemption for an underutilized > queue to reach its guaranteed resource. In addition to that, there’s an > requirement to get better balance between queues when all of them reach > guaranteed resource but with different fairness resource. > An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c > = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing > scheduler preemption won't happen. But this is unfair to queue_a since > queue_a has the same guaranteed resources. > Before YARN-5864, capacity scheduler do additional preemption to balance > queues. We
[jira] [Updated] (YARN-8472) YARN Container Phase 2
[ https://issues.apache.org/jira/browse/YARN-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8472: Description: In YARN-3611, we have implemented basic Docker container support for YARN. This story is the next phase to improve container usability. Several area for improvements are: # Software defined network support # Interactive shell to container # User management sss/nscd integration # Runc/containerd support # Metrics/Logs integration with Timeline service v2 was: In YARN-3611, we have implemented basic Docker container support for YARN. This story is the next phase to improve container usability. Several area for improvements are: # Software defined network support # Interactive shell to container # User management sss/nscd integration # Runc/containerd support > YARN Container Phase 2 > -- > > Key: YARN-8472 > URL: https://issues.apache.org/jira/browse/YARN-8472 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > > In YARN-3611, we have implemented basic Docker container support for YARN. > This story is the next phase to improve container usability. > Several area for improvements are: > # Software defined network support > # Interactive shell to container > # User management sss/nscd integration > # Runc/containerd support > # Metrics/Logs integration with Timeline service v2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8472) YARN Container Phase 2
Eric Yang created YARN-8472: --- Summary: YARN Container Phase 2 Key: YARN-8472 URL: https://issues.apache.org/jira/browse/YARN-8472 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Yang In YARN-3611, we have implemented basic Docker container support for YARN. This story is the next phase to improve container usability. Several area for improvements are: # Software defined network support # Interactive shell to container # User management sss/nscd integration # Runc/containerd support -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8455) Add basic acl check for all TS v2 REST APIs
[ https://issues.apache.org/jira/browse/YARN-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526586#comment-16526586 ] Sunil Govindan commented on YARN-8455: -- Latest patch looks good to me. +1 Committing shortly. > Add basic acl check for all TS v2 REST APIs > --- > > Key: YARN-8455 > URL: https://issues.apache.org/jira/browse/YARN-8455 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8455.001.patch, YARN-8455.002.patch > > > YARN-8319 filter check for flows pages. The same behavior need to be added > for all other REST API as long as ATS provides support for ACLs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8379) Improve balancing resources in already satisfied queues by using Capacity Scheduler preemption
[ https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-8379: - Summary: Improve balancing resources in already satisfied queues by using Capacity Scheduler preemption (was: Add an option to allow Capacity Scheduler preemption to balance satisfied queues) > Improve balancing resources in already satisfied queues by using Capacity > Scheduler preemption > -- > > Key: YARN-8379 > URL: https://issues.apache.org/jira/browse/YARN-8379 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8379.001.patch, YARN-8379.002.patch, > YARN-8379.003.patch, YARN-8379.004.patch, YARN-8379.005.patch, > YARN-8379.006.patch, ericpayne.confs.tgz > > > Existing capacity scheduler only supports preemption for an underutilized > queue to reach its guaranteed resource. In addition to that, there’s an > requirement to get better balance between queues when all of them reach > guaranteed resource but with different fairness resource. > An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c > = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing > scheduler preemption won't happen. But this is unfair to queue_a since > queue_a has the same guaranteed resources. > Before YARN-5864, capacity scheduler do additional preemption to balance > queues. We changed the logic since it could preempt too many containers > between queues when all queues are satisfied. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526578#comment-16526578 ] Chandni Singh commented on YARN-8409: - Thanks [~eyang] for reviewing and merging the patch. > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8378) Missing default implementation of loading application with FileSystemApplicationHistoryStore
[ https://issues.apache.org/jira/browse/YARN-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526570#comment-16526570 ] Sunil Govindan commented on YARN-8378: -- Changes seems fine to me. [~rohithsharma] could u pls help to take a look. > Missing default implementation of loading application with > FileSystemApplicationHistoryStore > - > > Key: YARN-8378 > URL: https://issues.apache.org/jira/browse/YARN-8378 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, yarn >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Minor > Attachments: YARN-8378.1.patch > > > [YARN-3700|https://issues.apache.org/jira/browse/YARN-3700] and > [YARN-3787|https://issues.apache.org/jira/browse/YARN-3787] add some > limitations (number, time) to loading applications from yarn timelineservice. > But this API missing the default implementation when we use > FileSystemApplicationHistoryStore for applicationhistoryservice instead of > using timelineservice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526561#comment-16526561 ] Hudson commented on YARN-8409: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14496 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14496/]) YARN-8409. Fixed NPE in ActiveStandbyElectorBasedElectorService. (eyang: rev 384764cdeac6490bc47fa0eb7b936baa4c0d3230) * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ActiveStandbyElector.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent
[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang resolved YARN-8414. - Resolution: Cannot Reproduce This has not happened in the last two weeks of stress test. Close this as can not reproduce. > Nodemanager crashes soon if ATSv2 HBase is either down or absent > > > Key: YARN-8414 > URL: https://issues.apache.org/jira/browse/YARN-8414 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Priority: Critical > > Test cluster has 1000 apps running, and a user trigger capacity scheduler > queue changes. This crashes all node managers. It looks like node manager > encounter too many files open while aggregating logs for containers: > {code} > 2018-06-07 21:17:59,307 WARN server.AbstractConnector > (AbstractConnector.java:handleAcceptFailure(544)) - > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371) > at > org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:18:00,842 WARN client.ConnectionUtils > (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, > please check your network > java.net.UnknownHostException: host1.example.com: System error > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getByName(InetAddress.java:1076) > at > org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189) > at > org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Timeline service has thousands of exceptions: > {code} > 2018-06-07 21:18:34,182 ERROR client.AsyncProcess > (AsyncProcess.java:submit(291)) - Failed to get region location > java.io.InterruptedIOException > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236) > at >
[jira] [Commented] (YARN-7690) Expose reserved Memory/Vcores of Node Manager at WebUI
[ https://issues.apache.org/jira/browse/YARN-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526550#comment-16526550 ] genericqa commented on YARN-7690: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} YARN-7690 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-7690 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12903982/YARN-7690.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21140/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Expose reserved Memory/Vcores of Node Manager at WebUI > -- > > Key: YARN-7690 > URL: https://issues.apache.org/jira/browse/YARN-7690 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Attachments: YARN-7690.patch > > > Now only total reserved memory/Vcores are exposed at RM Web UI, reserved > memory/Vcores of a single nodemanager is hard to find out. It confuses users > that they observe that there are available memory/Vcores at nodes page, but > their jobs are stuck and waiting for resource to be allocated. It is helpful > for debug to expose reserved memory/Vcores of every single nodemanager, and > memory/Vcores that can be allocated (unallocated minus reserved). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7690) Expose reserved Memory/Vcores of Node Manager at WebUI
[ https://issues.apache.org/jira/browse/YARN-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526505#comment-16526505 ] Íñigo Goiri commented on YARN-7690: --- Thanks [~jutia] for the patch. Can you add unit tests for this? A couple screenshots would also be helpful. > Expose reserved Memory/Vcores of Node Manager at WebUI > -- > > Key: YARN-7690 > URL: https://issues.apache.org/jira/browse/YARN-7690 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Attachments: YARN-7690.patch > > > Now only total reserved memory/Vcores are exposed at RM Web UI, reserved > memory/Vcores of a single nodemanager is hard to find out. It confuses users > that they observe that there are available memory/Vcores at nodes page, but > their jobs are stuck and waiting for resource to be allocated. It is helpful > for debug to expose reserved memory/Vcores of every single nodemanager, and > memory/Vcores that can be allocated (unallocated minus reserved). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7690) Expose reserved Memory/Vcores of Node Manager at WebUI
[ https://issues.apache.org/jira/browse/YARN-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated YARN-7690: -- Description: Now only total reserved memory/Vcores are exposed at RM Web UI, reserved memory/Vcores of a single nodemanager is hard to find out. It confuses users that they observe that there are available memory/Vcores at nodes page, but their jobs are stuck and waiting for resource to be allocated. It is helpful for debug to expose reserved memory/Vcores of every single nodemanager, and memory/Vcores that can be allocated (unallocated minus reserved). (was: now only total reserved memory/Vcores are exposed at RM webUI, reserved memory/Vcores of a single nodemanager is hard to find out. it confuses users that they obeserve that there are available memory/Vcores at nodes page, but their jobs are stuck and waiting for resouce to be allocated. It is helpful for bedug to expose reserved memory/Vcores of every single nodemanager, and memory/Vcores that can be allocated( unallocated minus reserved)) > Expose reserved Memory/Vcores of Node Manager at WebUI > -- > > Key: YARN-7690 > URL: https://issues.apache.org/jira/browse/YARN-7690 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Attachments: YARN-7690.patch > > > Now only total reserved memory/Vcores are exposed at RM Web UI, reserved > memory/Vcores of a single nodemanager is hard to find out. It confuses users > that they observe that there are available memory/Vcores at nodes page, but > their jobs are stuck and waiting for resource to be allocated. It is helpful > for debug to expose reserved memory/Vcores of every single nodemanager, and > memory/Vcores that can be allocated (unallocated minus reserved). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526499#comment-16526499 ] Íñigo Goiri commented on YARN-8471: --- Can you also post the stack traces and describe how this can happen? > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7690) Expose reserved Memory/Vcores of Node Manager at WebUI
[ https://issues.apache.org/jira/browse/YARN-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri reassigned YARN-7690: - Assignee: tianjuan > Expose reserved Memory/Vcores of Node Manager at WebUI > -- > > Key: YARN-7690 > URL: https://issues.apache.org/jira/browse/YARN-7690 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Attachments: YARN-7690.patch > > > now only total reserved memory/Vcores are exposed at RM webUI, reserved > memory/Vcores of a single nodemanager is hard to find out. it confuses users > that they obeserve that there are available memory/Vcores at nodes page, but > their jobs are stuck and waiting for resouce to be allocated. It is helpful > for bedug to expose reserved memory/Vcores of every single nodemanager, and > memory/Vcores that can be allocated( unallocated minus reserved) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7690) Expose reserved Memory/Vcores of Node Manager at WebUI
[ https://issues.apache.org/jira/browse/YARN-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated YARN-7690: -- Summary: Expose reserved Memory/Vcores of Node Manager at WebUI (was: expose reserved memory/Vcores of nodemanager at webUI) > Expose reserved Memory/Vcores of Node Manager at WebUI > -- > > Key: YARN-7690 > URL: https://issues.apache.org/jira/browse/YARN-7690 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: tianjuan >Priority: Major > Attachments: YARN-7690.patch > > > now only total reserved memory/Vcores are exposed at RM webUI, reserved > memory/Vcores of a single nodemanager is hard to find out. it confuses users > that they obeserve that there are available memory/Vcores at nodes page, but > their jobs are stuck and waiting for resouce to be allocated. It is helpful > for bedug to expose reserved memory/Vcores of every single nodemanager, and > memory/Vcores that can be allocated( unallocated minus reserved) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526497#comment-16526497 ] Eric Yang commented on YARN-8409: - [~csingh] Thank you for the patch. TestAppManager error seems to be Jenkins out of resource to fork. The error doesn't happen when I ran the unit test locally. RM handles ZooKeeper unavailability more gracefully with this patch. +1 on this patch, and will commit shortly. > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri reassigned YARN-8471: - Assignee: tianjuan > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526496#comment-16526496 ] Íñigo Goiri commented on YARN-8471: --- Thanks [~jutia] for the patch, does this apply to trunk too? If so you need to provide one for trunk and another one for branch-2 (or branch-2.9). Can we add a unit test to reproduce this? > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated YARN-8471: -- Description: At some point RM just hangs and stops allocating resources. At the point RM get hangs, YARN throws NullPointerException at RegularContainerAllocator#allocate, and RegularContainerAllocator#preCheckForPlacementSet, and RegularContainerAllocator#getLocalityWaitFactor. (was: at some point RM just hangs and stops allocating resources. At the point RM get hangs, YARN throw NullPointerException at RegularContainerAllocator#allocate, and RegularContainerAllocator#preCheckForPlacementSet, and RegularContainerAllocator#getLocalityWaitFactor) > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > At some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throws NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8103) Add CLI interface to query node attributes
[ https://issues.apache.org/jira/browse/YARN-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526412#comment-16526412 ] Bibin A Chundatt commented on YARN-8103: Thank you [~Naganarasimha] for review and commit > Add CLI interface to query node attributes > --- > > Key: YARN-8103 > URL: https://issues.apache.org/jira/browse/YARN-8103 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > Fix For: YARN-3409 > > Attachments: YARN-8103-YARN-3409.001.patch, > YARN-8103-YARN-3409.002.patch, YARN-8103-YARN-3409.003.patch, > YARN-8103-YARN-3409.004.patch, YARN-8103-YARN-3409.005.patch, > YARN-8103-YARN-3409.006.patch, YARN-8103-YARN-3409.WIP.patch > > > YARN-8100 will add API interface for querying the attributes. CLI interface > for querying node attributes for each nodes and list all attributes in > cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526393#comment-16526393 ] Bibin A Chundatt commented on YARN-8459: [~leftnoteasy] Can we remove the CapacityScheduler#allocate {code} LOG.info("Allocation for application " + applicationAttemptId + " : " + allocation + " with cluster resource : " + getClusterResource()); {code} Observed in one of cluster seems to be flooding the logs .. Since its printed even of allocation is empty.. thoughts? > Improve logs of Capacity Scheduler to better debug invalid states > - > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-8459.001.patch, YARN-8459.002.patch, > YARN-8459.003.patch > > > Improve logs in CS to better debug invalid states -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526350#comment-16526350 ] Antal Bálint Steinbach commented on YARN-8468: -- [~szegedim], "yarn.scheduler.maximum-allocation-mb" is the existing property for general container resource allocation setting not a new one, but I agree the name is a bit confusing. [~yufeigu], can you please ask the details from [~mrbillau]? [~haibochen], can you please review YARN-7556? This patch is created based on that. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
[ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526343#comment-16526343 ] Jason Lowe commented on YARN-8451: -- Thanks for the report and patch! Are we sure it's safe to unblock the Nodemanager's async dispatcher during this reboot? I'm worried that other events could be dispatched to subsystems while they are trying to reset and cause other problems. I think it would be simpler and safer to have NodeManager#resyncWithRM check a "resyncing" boolean when it's called, avoiding redundantly resyncing if it is currently resyncing. No need for separate threads and atomic booleans. > Multiple NM heartbeat thread created when a slow NM resync with RM > -- > > Key: YARN-8451 > URL: https://issues.apache.org/jira/browse/YARN-8451 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8451.v1.patch > > > During a NM resync with RM (say RM did a master slave switch), if NM is > running slow, more than one RESYNC event may be put into the NM dispatcher by > the existing heartbeat thread before they are processed. As a result, > multiple new heartbeat thread are later created and start to hb to RM > concurrently with their own responseId. If at some point of time, one thread > becomes more than one step behind others, RM will send back a resync signal > in this heartbeat response, killing all containers in this NM. > See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526331#comment-16526331 ] genericqa commented on YARN-8468: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color} | {color:red} YARN-8468 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-8468 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929579/YARN-8468.000.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21139/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526328#comment-16526328 ] Manikandan R commented on YARN-4606: [~eepayne] Thank you for great explanation. I am able to understand the flow better now. I revisited "move apps" problem which i raised earlier based on new patch and don't think it requires any changes as variables required to calculate numActiveUsersWithOnlyPendingApps are already being set through submitApplication, finishApplication etc calls. However, I am seeing an minor update issue as described below: Lets say, We want to move all apps from queue, A1 to queue, B1. A1 has 4 apps (Only 2 were accommodated because of max am limit constraint. So, remaining 2 not yet activated). All these 4 apps are triggered by different users from u1 to u4. For example app1 by u1 and so on. Only for app 1 & app2, there is an allocate request in pipeline. At this point, {{numActiveUsers}} is 4 and {{numActiveUsersWithOnlyPendingApps}} is 2 in Queue, A1. Now move has been triggered. Since there were running containers for both app 1 and app 2, app3 and app4 has been activated before app 1 and app 2 in Queue, B1 as both these apps were busy in detaching and attaching containers. After the move operation and thread sleep of 5s, pulled these counts expecting u1 and u2 as ActiveUsersWithOnlyPendingApps, but couldn't able to see it. {{numActiveUsers}} is 2 as u3 and u4 had become active users and {{numActiveUsersWithOnlyPendingApps}} is 0 in Queue B1. Then, introduced an NodeUpdate event after the move operation just to force the user limit computation to see the impact on these counts. Now, can able to ActiveUsersWithOnlyPendingApps as 2 and ActiveUsers as 0 (as both u3 and u4 had become non active users by this time as there are no pending allocate request). So, after move app operation and if there is no events (which can trigger user limit computation) for brief amount of time, am seeing incorrect {{numActiveUsersWithOnlyPendingApps}} count. Is this acceptable? or Should we trigger user limit computation after move operation like how we are doing it in other places? Please share your thoughts and correct my understanding if you see a gap > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8467) AsyncDispatcher should have a name & display it in logs to improve debug
[ https://issues.apache.org/jira/browse/YARN-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526178#comment-16526178 ] Shuai Zhang commented on YARN-8467: --- Because it's just related to debug logs, there's no need to add new unit tests. > AsyncDispatcher should have a name & display it in logs to improve debug > > > Key: YARN-8467 > URL: https://issues.apache.org/jira/browse/YARN-8467 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.1.0 >Reporter: Shuai Zhang >Priority: Trivial > Attachments: YARN-8467.001.patch > > > Currently each AbstractService has a dispatcher, but the dispatcher is not > named. Logs from dispatcher is mixed, which is quite hard to debug any hang > issues. I suggest > # Make it possible to name AsyncDispatcher & its thread (partially done in > YARN-6015) > # Mention the AsyncDispatcher name in all its logs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8378) Missing default implementation of loading application with FileSystemApplicationHistoryStore
[ https://issues.apache.org/jira/browse/YARN-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526165#comment-16526165 ] Lantao Jin commented on YARN-8378: -- [~sunilg] Could you have a time to review? > Missing default implementation of loading application with > FileSystemApplicationHistoryStore > - > > Key: YARN-8378 > URL: https://issues.apache.org/jira/browse/YARN-8378 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, yarn >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Minor > Attachments: YARN-8378.1.patch > > > [YARN-3700|https://issues.apache.org/jira/browse/YARN-3700] and > [YARN-3787|https://issues.apache.org/jira/browse/YARN-3787] add some > limitations (number, time) to loading applications from yarn timelineservice. > But this API missing the default implementation when we use > FileSystemApplicationHistoryStore for applicationhistoryservice instead of > using timelineservice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianjuan updated YARN-8471: --- Affects Version/s: 2.9.0 > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > at some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throw NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526071#comment-16526071 ] genericqa commented on YARN-8471: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} YARN-8471 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-8471 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929529/YARN-8471.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21138/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: tianjuan >Priority: Major > Fix For: 2.9.0 > > Attachments: YARN-8471.001.patch > > > at some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throw NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianjuan updated YARN-8471: --- Summary: YARN RM hangs and stops allocating resources when applications successively running (was: YARN throw NullPointerException at RegularContainerAllocator) > YARN RM hangs and stops allocating resources when applications successively > running > --- > > Key: YARN-8471 > URL: https://issues.apache.org/jira/browse/YARN-8471 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: tianjuan >Priority: Major > Fix For: 2.9.0 > > > at some point RM just hangs and stops allocating resources. At the point RM > get hangs, YARN throw NullPointerException at > RegularContainerAllocator#allocate, and > RegularContainerAllocator#preCheckForPlacementSet, and > RegularContainerAllocator#getLocalityWaitFactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8471) YARN throw NullPointerException at RegularContainerAllocator
tianjuan created YARN-8471: -- Summary: YARN throw NullPointerException at RegularContainerAllocator Key: YARN-8471 URL: https://issues.apache.org/jira/browse/YARN-8471 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: tianjuan Fix For: 2.9.0 at some point RM just hangs and stops allocating resources. At the point RM get hangs, YARN throw NullPointerException at RegularContainerAllocator#allocate, and RegularContainerAllocator#preCheckForPlacementSet, and RegularContainerAllocator#getLocalityWaitFactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8435) NPE when the same client simultaneously contact for the first time Yarn Router
[ https://issues.apache.org/jira/browse/YARN-8435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526017#comment-16526017 ] genericqa commented on YARN-8435: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 31s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 34s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 31s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 35s{color} | {color:green} hadoop-yarn-server-router in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 57m 54s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8435 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929515/YARN-8435.v5.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 6d12671ea775 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ddbff7c | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21137/testReport/ | | Max. process+thread count | 676 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21137/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > NPE when the same client simultaneously contact for the first time Yarn Router
[jira] [Commented] (YARN-8435) NPE when the same client simultaneously contact for the first time Yarn Router
[ https://issues.apache.org/jira/browse/YARN-8435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525975#comment-16525975 ] rangjiaheng commented on YARN-8435: --- Thanks [~giovanni.fumarola] for review. Sorry for misunderstood what you mean yesterday. The java doc is OK in YARN-8435.v5.patch. > NPE when the same client simultaneously contact for the first time Yarn Router > -- > > Key: YARN-8435 > URL: https://issues.apache.org/jira/browse/YARN-8435 > Project: Hadoop YARN > Issue Type: Bug > Components: router >Affects Versions: 2.9.0, 3.0.2 >Reporter: rangjiaheng >Priority: Critical > Attachments: YARN-8435.v1.patch, YARN-8435.v2.patch, > YARN-8435.v3.patch, YARN-8435.v4.patch, YARN-8435.v5.patch > > > When Two client process (with the same user name and the same hostname) begin > to connect to yarn router at the same time, to submit application, kill > application, ... and so on, then a java.lang.NullPointerException may throws > from yarn router. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8435) NPE when the same client simultaneously contact for the first time Yarn Router
[ https://issues.apache.org/jira/browse/YARN-8435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rangjiaheng updated YARN-8435: -- Attachment: YARN-8435.v5.patch > NPE when the same client simultaneously contact for the first time Yarn Router > -- > > Key: YARN-8435 > URL: https://issues.apache.org/jira/browse/YARN-8435 > Project: Hadoop YARN > Issue Type: Bug > Components: router >Affects Versions: 2.9.0, 3.0.2 >Reporter: rangjiaheng >Priority: Critical > Attachments: YARN-8435.v1.patch, YARN-8435.v2.patch, > YARN-8435.v3.patch, YARN-8435.v4.patch, YARN-8435.v5.patch > > > When Two client process (with the same user name and the same hostname) begin > to connect to yarn router at the same time, to submit application, kill > application, ... and so on, then a java.lang.NullPointerException may throws > from yarn router. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org