[jira] [Created] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage
aimahou created YARN-10298: -- Summary: TimeLine entity information only stored in one region when use apache HBase as backend storage Key: YARN-10298 URL: https://issues.apache.org/jira/browse/YARN-10298 Project: Hadoop YARN Issue Type: Improvement Components: ATSv2, timelineservice Affects Versions: 3.1.1 Reporter: aimahou h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication) { ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); } else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage
[ https://issues.apache.org/jira/browse/YARN-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] aimahou updated YARN-10298: --- Description: h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java {quote}public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication) { ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); } else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } {quote} h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. was: h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication) { ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); } else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. > TimeLine entity information only stored in one region when use apache HBase > as backend storage > -- > > Key: YARN-10298 > URL: https://issues.apache.org/jira/browse/YARN-10298 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2, timelineservice >Affects Versions: 3.1.1 >Reporter: aimahou >Priority: Major > > h2. Issue > TimeLine entity information only stored in one region when use apache HBase > as backend storage > h2. Probable cause > We found in the source code that the rowKey is composed of > clusterId、userId、flowName、flowRunId and appId when hbase timeline writer > stores timeline entity info,which probably cause the rowKey is sorted by > dictionary order. Thus timeline entity may only store in one region or few > adjacent regions. > h2. Related code snippet > HBaseTimelineWriterImpl.java > {quote}public TimelineWriteResponse write(TimelineCollectorContext context, > TimelineEntities data, UserGroupInformation callerUgi) > throws IOException { > ... > boolean isApplication = ApplicationEntity.isApplicationEntity(te); > byte[] rowKey; > if (isApplication) > { > ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, > userId, flowName, flowRunId, appId); > rowKey = applicationRowKey.getRowKey(); > store(r
[jira] [Created] (YARN-10299) TimeLine Service V1.5 use levelDB as backend storage will crash when data scale amount to 100GB
aimahou created YARN-10299: -- Summary: TimeLine Service V1.5 use levelDB as backend storage will crash when data scale amount to 100GB Key: YARN-10299 URL: https://issues.apache.org/jira/browse/YARN-10299 Project: Hadoop YARN Issue Type: Bug Components: timelineservice Affects Versions: 3.1.1 Reporter: aimahou h2. Issue: TimeLine Service V1.5 use levelDB as backend storage will crash when data scale amount to 100GB h2. *Specific exception:* 2020-04-24 16:06:59,914 INFO applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore (ApplicationHistoryManagerOnTimelineStore.java:generateApplicationReport(691)) - No application attempt found for application_1587696012637_1143. Use a placeholder for its latest attempt id. 2020-04-24 16:06:59,914 INFO applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore (ApplicationHistoryManagerOnTimelineStore.java:generateApplicationReport(691)) - No application attempt found for application_1587696012637_1143. Use a placeholder for its latest attempt id. org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: The entity for application attempt appattempt_1587696012637_1143_01 doesn't exist in the timeline store at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplicationAttempt(ApplicationHistoryManagerOnTimelineStore.java:183) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.generateApplicationReport(ApplicationHistoryManagerOnTimelineStore.java:677) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplications(ApplicationHistoryManagerOnTimelineStore.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getApplications(ApplicationHistoryClientService.java:195) at org.apache.hadoop.yarn.server.webapp.AppsBlock.getApplicationReport(AppsBlock.java:129) at org.apache.hadoop.yarn.server.webapp.AppsBlock.fetchData(AppsBlock.java:114) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:137) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:243) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:206) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:165) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592) at org.eclipse.jetty.servlet.ServletHa
[jira] [Updated] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage
[ https://issues.apache.org/jira/browse/YARN-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] aimahou updated YARN-10298: --- Description: h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java {quote} {code:java} public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication){ ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); }else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } {code} {quote} h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. was: h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java {quote} {code:java} else public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication){ ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); }else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } {code} {quote} h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. > TimeLine entity information only stored in one region when use apache HBase > as backend storage > -- > > Key: YARN-10298 > URL: https://issues.apache.org/jira/browse/YARN-10298 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2, timelineservice >Affects Versions: 3.1.1 >Reporter: aimahou >Priority: Major > > h2. Issue > TimeLine entity information only stored in one region when use apache HBase > as backend storage > h2. Probable cause > We found in the source code that the rowKey is composed of > clusterId、userId、flowName、flowRunId and appId when hbase timeline writer > stores timeline entity info,which probably cause the rowKey is sorted by > dictionary order. Thus timeline entity may only store in one region or few > adjacent regions. > h2. Related code snippet > HBaseTimelineWriterImpl.java > {quote} > {code:java} > public TimelineWriteResponse write(TimelineCollectorContext context, > TimelineEntities data, UserGroupInformation callerUgi) > throws IOException { > ... > boolean isApplication = ApplicationEntity.isApplicationEntity(te); > byte[] rowKey; > if (isApplication){ > ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, >
[jira] [Updated] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage
[ https://issues.apache.org/jira/browse/YARN-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] aimahou updated YARN-10298: --- Description: h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java {quote} {code:java} else public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication){ ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); }else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } {code} {quote} h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. was: h2. Issue TimeLine entity information only stored in one region when use apache HBase as backend storage h2. Probable cause We found in the source code that the rowKey is composed of clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores timeline entity info,which probably cause the rowKey is sorted by dictionary order. Thus timeline entity may only store in one region or few adjacent regions. h2. Related code snippet HBaseTimelineWriterImpl.java {quote}public TimelineWriteResponse write(TimelineCollectorContext context, TimelineEntities data, UserGroupInformation callerUgi) throws IOException { ... boolean isApplication = ApplicationEntity.isApplicationEntity(te); byte[] rowKey; if (isApplication) { ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); } else { EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); rowKey = entityRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); } if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { SubApplicationRowKey subApplicationRowKey = new SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), te.getIdPrefix(), te.getId(), userId); rowKey = subApplicationRowKey.getRowKey(); store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); } ... } {quote} h2. Suggestion We can use the hash code of original rowKey as the rowKey to store and read timeline entity data. > TimeLine entity information only stored in one region when use apache HBase > as backend storage > -- > > Key: YARN-10298 > URL: https://issues.apache.org/jira/browse/YARN-10298 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2, timelineservice >Affects Versions: 3.1.1 >Reporter: aimahou >Priority: Major > > h2. Issue > TimeLine entity information only stored in one region when use apache HBase > as backend storage > h2. Probable cause > We found in the source code that the rowKey is composed of > clusterId、userId、flowName、flowRunId and appId when hbase timeline writer > stores timeline entity info,which probably cause the rowKey is sorted by > dictionary order. Thus timeline entity may only store in one region or few > adjacent regions. > h2. Related code snippet > HBaseTimelineWriterImpl.java > {quote} > {code:java} > else > public TimelineWriteResponse write(TimelineCollectorContext context, > TimelineEntities data, UserGroupInformation callerUgi) > throws IOException { > ... > boolean isApplication = ApplicationEntity.isApplicationEntity(te); > byte[] rowKey; > if (isApplication){ > ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, > userId, flowN
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120982#comment-17120982 ] Prabhu Joseph commented on YARN-10293: -- Thanks [~wangda] for reviewing. The older behavior of Allocate Container on Single Node skips scheduling on a node when it has reserved container or no available containers. {code} if (calculator.computeAvailableContainers(Resources .add(node.getUnallocatedResource(), node.getTotalKillableResources()), minimumAllocation) <= 0) { {code} Multi Node Placement checks the used partition capacity which includes the reserved capacity. But there can be still nodes with available containers which is ignored. (as per JIRA description) {code} if (getRootQueue().getQueueCapacities().getUsedCapacity( candidates.getPartition()) >= 1.0f && preemptionManager.getKillableResource( {code} This condition can be removed, don't see any impact. [~Tao Yang] Can you confirm the same. Other approaches are the one in patch. Or adding extra check of if available containers in any node part of candidates in addition to above checks. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:
[jira] [Updated] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement
[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10259: - Fix Version/s: 3.3.1 > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10259-001.patch, YARN-10259-002.patch, > YARN-10259-003.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_01 reserved container > container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 > available= used=. This attempt > currently has 1 reserved containers at priority 0; currentReservation > > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_01, on node=host: h1:1234 > #containers=1 available= used= > with resource= >RESERVED=[(Application=appattempt_1588684773609_0003_01; > Node=h1:1234; Resource=)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_01, but node is > not reserved >ALLOCATED=[(Application=appattempt_1588684773609_0003_01; > Node=h2:1234; Resource=)] > {code} > Attached testcase which reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement
[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120998#comment-17120998 ] Prabhu Joseph commented on YARN-10259: -- Have cherry-picked to branch-3.3.1. Thanks. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10259-001.patch, YARN-10259-002.patch, > YARN-10259-003.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_01 reserved container > container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 > available= used=. This attempt > currently has 1 reserved containers at priority 0; currentReservation > > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_01, on node=host: h1:1234 > #containers=1 available= used= > with resource= >RESERVED=[(Application=appattempt_1588684773609_0003_01; > Node=h1:1234; Resource=)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_01, but node is > not reserved >ALLOCATED=[(Application=appattempt_1588684773609_0003_01; > Node=h2:1234; Resource=)] > {code} > Attached testcase which reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10289) spark on yarn execption
[ https://issues.apache.org/jira/browse/YARN-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved YARN-10289. --- Resolution: Invalid > spark on yarn execption > > > Key: YARN-10289 > URL: https://issues.apache.org/jira/browse/YARN-10289 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.3 > Environment: hadoop 3.0.0 >Reporter: huang xin >Priority: Major > > i execute spark on yarn and get the issue like this: > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > prelaunch.out70Setting up env variables2_03? > Setting up job resources > Launching container > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > stdout0(&container_1590115508504_0033_02_01Ωcontainer-localizer-syslog1842020-05-24 > 15:39:20,867 INFO [main] > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: > Disk Validator: yarn.nodemanager.disk-validator is loaded. > prelaunch.out70Setting up env variables > Setting up job resources > Launching container > stderr333ERROR StatusLogger No log4j2 configuration file found. Using default > configuration: logging only errors to the console. Set system property > 'org.apache.logging.log4j.simplelog.StatusLogger.level' to TRACE to show > Log4j2 internal initialization logging. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > prelaunch.out70Setting up env variables1_05? > Setting up job resources > Launching container > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > prelaunch.out70Setting up env variables1_04? > Setting up job resources > Launching container > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > stdout0 > VERSION*(&container_1590115508504_0033_01_0none??data:BCFile.indexnone? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10289) spark on yarn execption
[ https://issues.apache.org/jira/browse/YARN-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121009#comment-17121009 ] Steve Loughran commented on YARN-10289: --- # looks more like a spark error. # And a config one. So not a bug in their code. Check your classpath take it up on the spark mailing lists. > spark on yarn execption > > > Key: YARN-10289 > URL: https://issues.apache.org/jira/browse/YARN-10289 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.3 > Environment: hadoop 3.0.0 >Reporter: huang xin >Priority: Major > > i execute spark on yarn and get the issue like this: > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > prelaunch.out70Setting up env variables2_03? > Setting up job resources > Launching container > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > stdout0(&container_1590115508504_0033_02_01Ωcontainer-localizer-syslog1842020-05-24 > 15:39:20,867 INFO [main] > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: > Disk Validator: yarn.nodemanager.disk-validator is loaded. > prelaunch.out70Setting up env variables > Setting up job resources > Launching container > stderr333ERROR StatusLogger No log4j2 configuration file found. Using default > configuration: logging only errors to the console. Set system property > 'org.apache.logging.log4j.simplelog.StatusLogger.level' to TRACE to show > Log4j2 internal initialization logging. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > prelaunch.out70Setting up env variables1_05? > Setting up job resources > Launching container > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > prelaunch.out70Setting up env variables1_04? > Setting up job resources > Launching container > stderr96Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > stdout0 > VERSION*(&container_1590115508504_0033_01_0none??data:BCFile.indexnone? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10290) Resourcemanager recover failed when fair scheduler queue acl changed
[ https://issues.apache.org/jira/browse/YARN-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YARN-10290. -- Resolution: Duplicate This issue is fixed in YARN-7913. That change fixes a number of issues around restores that fail. The change was not backported to Hadoop 2.x > Resourcemanager recover failed when fair scheduler queue acl changed > > > Key: YARN-10290 > URL: https://issues.apache.org/jira/browse/YARN-10290 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: yehuanhuan >Priority: Blocker > > Resourcemanager recover failed when fair scheduler queue acl changed. Because > of queue acl changed, when recover the application (addApplication() in > fairscheduler) is rejected. Then recover the applicationAttempt > (addApplicationAttempt() in fairscheduler) get Application is null. This will > lead to two RM is at standby. Repeat as follows. > > # user run a long running application. > # change queue acl (aclSubmitApps) so that the user does not have permission. > # restart the RM. > {code:java} > 2020-05-25 16:04:06,191 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1590393162216_0005 with final state: FAILED > 2020-05-25 16:04:06,192 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to > load/recover state > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:663) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1246) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1072) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1036) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:897) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:850) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:723) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:322) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:427) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:584) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:980) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1021) > at > org.apa
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Fix Version/s: 2.10.1 2.9.3 > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 2.9.3, 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, > YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, > YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
Eric Badger created YARN-10300: -- Summary: appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat Key: YARN-10300 URL: https://issues.apache.org/jira/browse/YARN-10300 Project: Hadoop YARN Issue Type: Bug Reporter: Eric Badger Assignee: Eric Badger {noformat} 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=,applicationType=MAPREDUCE {noformat} {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs
YCozy created YARN-10301: Summary: "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs Key: YARN-10301 URL: https://issues.apache.org/jira/browse/YARN-10301 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.0 Reporter: YCozy We observed the "Mismatched response." error in RM's log when a NM gets network-partitioned after RM failover. Here's how it happens: Initially, we have a sleeper YARN service running in a cluster with two RMs (an active RM1 and a standby RM2) and one NM. At some point, we perform a RM failover from RM1 to RM2. RM1's log: {noformat} 2020-06-01 16:29:20,387 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state{noformat} RM2's log: {noformat} 2020-06-01 16:29:27,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to active state{noformat} After the RM failover, the NM encounters a network partition and fails to register with RM2. In other words, there's no "NodeManager from node *** registered" in RM2's log. This does not affect the sleeper YARN service. The sleeper service successfully recovers after the RM failover. We can see in RM2's log: {noformat} 2020-06-01 16:30:06,703 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = REGISTERED{noformat} Then, we stop the sleeper service. In RM2's log, we can see that: {noformat} 2020-06-01 16:30:12,157 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: application_6_0001 unregistered successfully. ... 2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: Successfully stopped service sleeper1{noformat} And in AM's log, we can see that: {noformat} 2020-06-01 16:30:12,651 [shutdown-hook-0] INFO service.ServiceMaster - SHUTDOWN_MSG:{noformat} Some time later, we observe the "Mismatched response" in RM2's log: {noformat} 2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response. at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623) at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667) at org.apache.hadoop.ipc.Client.call(Client.java:1483) at org.apache.hadoop.ipc.Client.call(Client.java:1436) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy102.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy103.stopContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153) at org.apache.hadoop.yarn.se
[jira] [Updated] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs
[ https://issues.apache.org/jira/browse/YARN-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-10301: - Description: We observed the "Mismatched response." error in RM's log when a NM gets network-partitioned after RM failover. Here's how it happens: Initially, we have a sleeper YARN service running in a cluster with two RMs (an active RM1 and a standby RM2) and one NM. At some point, we perform a RM failover from RM1 to RM2. RM1's log: {noformat} 2020-06-01 16:29:20,387 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state{noformat} RM2's log: {noformat} 2020-06-01 16:29:27,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to active state{noformat} After the RM failover, the NM encounters a network partition and fails to register with RM2. In other words, there's no "NodeManager from node *** registered" in RM2's log. This does not affect the sleeper YARN service. The sleeper service successfully recovers after the RM failover. We can see in RM2's log: {noformat} 2020-06-01 16:30:06,703 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = REGISTERED{noformat} Then, we stop the sleeper service. In RM2's log, we can see that: {noformat} 2020-06-01 16:30:12,157 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: application_6_0001 unregistered successfully. ... 2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: Successfully stopped service sleeper1{noformat} And in AM's log, we can see that: {noformat} 2020-06-01 16:30:12,651 [shutdown-hook-0] INFO service.ServiceMaster - SHUTDOWN_MSG:{noformat} Some time later, we observe the "Mismatched response" in RM2's log: {noformat} 2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response. at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623) at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667) at org.apache.hadoop.ipc.Client.call(Client.java:1483) at org.apache.hadoop.ipc.Client.call(Client.java:1436) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy102.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy103.stopContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:354) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
[jira] [Updated] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-10251: -- Attachment: YARN-10251.branch-2.10.003.patch > Show extended resources on legacy RM UI. > > > Key: YARN-10251 > URL: https://issues.apache.org/jira/browse/YARN-10251 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Legacy RM UI With Not All Resources Shown.png, Updated > NodesPage UI With GPU columns.png, Updated RM UI With All Resources > Shown.png.png, YARN-10251.branch-2.10.001.patch, > YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch > > > It would be great to update the legacy RM UI to include GPU resources in the > overview and in the per-app sections. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121359#comment-17121359 ] Hadoop QA commented on YARN-10251: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-2.10 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 12s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 23s{color} | {color:green} branch-2.10 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 44s{color} | {color:green} branch-2.10 passed with JDK Oracle Corporation-1.7.0_95-b00 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s{color} | {color:green} branch-2.10 passed with JDK Private Build-1.8.0_252-8u252-b09-1~16.04-b09 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} branch-2.10 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green} branch-2.10 passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 17s{color} | {color:red} hadoop-yarn-server-common in branch-2.10 failed with JDK Oracle Corporation-1.7.0_95-b00. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 18s{color} | {color:red} hadoop-yarn-server-resourcemanager in branch-2.10 failed with JDK Oracle Corporation-1.7.0_95-b00. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} branch-2.10 passed with JDK Private Build-1.8.0_252-8u252-b09-1~16.04-b09 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 29s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 33s{color} | {color:green} branch-2.10 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 17s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 40s{color} | {color:green} the patch passed with JDK Oracle Corporation-1.7.0_95-b00 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 10s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_252-8u252-b09-1~16.04-b09 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 38s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 4 new + 43 unchanged - 1 fixed = 47 total (was 44) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 29s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkOracleCorporation-1.7.0_95-b00 with JDK Oracle Corporation-1.7.0_95-b00 generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} hadoop-yarn-server-common in the patch passed wit
[jira] [Created] (YARN-10302) Support custom packing algorithm for FairScheduler
William W. Graham Jr created YARN-10302: --- Summary: Support custom packing algorithm for FairScheduler Key: YARN-10302 URL: https://issues.apache.org/jira/browse/YARN-10302 Project: Hadoop YARN Issue Type: New Feature Reporter: William W. Graham Jr The {{FairScheduler}} class allocates containers to nodes based on the node with the most available memory[0]. Create the ability to instead configure a custom packing algorithm with different logic. For instance for effective auto scaling, a bin packing algorithm might be a better choice. 0 - https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-10251: -- Attachment: YARN-10251.003.patch > Show extended resources on legacy RM UI. > > > Key: YARN-10251 > URL: https://issues.apache.org/jira/browse/YARN-10251 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Legacy RM UI With Not All Resources Shown.png, Updated > NodesPage UI With GPU columns.png, Updated RM UI With All Resources > Shown.png.png, YARN-10251.003.patch, YARN-10251.branch-2.10.001.patch, > YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch > > > It would be great to update the legacy RM UI to include GPU resources in the > overview and in the per-app sections. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10302) Support custom packing algorithm for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William W. Graham Jr updated YARN-10302: https://github.com/apache/hadoop/pull/2044 > Support custom packing algorithm for FairScheduler > -- > > Key: YARN-10302 > URL: https://issues.apache.org/jira/browse/YARN-10302 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: William W. Graham Jr >Priority: Major > > The {{FairScheduler}} class allocates containers to nodes based on the node > with the most available memory[0]. Create the ability to instead configure a > custom packing algorithm with different logic. For instance for effective > auto scaling, a bin packing algorithm might be a better choice. > 0 - > https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10300: --- Attachment: YARN-10300.001.patch > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121419#comment-17121419 ] Wangda Tan commented on YARN-10293: --- [~prabhujoseph], I agree with you, I think the entire {{if}} check is helpful when cluster is full, we won't go into the allocation phase and save some CPU cycles. However, it won't matter too much if the cluster is full – we cannot get container allocation in any case. I suggest simplifying this logic by removing the if check, it sounds dangerous to me. If we see it cause performance issue, we can solve it in a different way (like increase wait time if nothing can be allocated or reserved). > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved containe
[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121458#comment-17121458 ] Hadoop QA commented on YARN-10251: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 38s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 20s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 39s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 45s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 25s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 35s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 54s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 4 new + 34 unchanged - 0 fixed = 38 total (was 34) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 40s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 44s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 37s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 31s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}181m 34s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Possible null pointer dere
[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121484#comment-17121484 ] Zhankun Tang commented on YARN-10302: - [~billgraham], thanks for the contribution. Could you please generate a patch "git diff trunk...HEAD > YARN-10302-trunk.001.patch", upload it and click "submitPatch" to trigger the CI? > Support custom packing algorithm for FairScheduler > -- > > Key: YARN-10302 > URL: https://issues.apache.org/jira/browse/YARN-10302 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: William W. Graham Jr >Priority: Major > > The {{FairScheduler}} class allocates containers to nodes based on the node > with the most available memory[0]. Create the ability to instead configure a > custom packing algorithm with different logic. For instance for effective > auto scaling, a bin packing algorithm might be a better choice. > 0 - > https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121495#comment-17121495 ] Hadoop QA commented on YARN-10300: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 24s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 29s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 48s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 46s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 13s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 31s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}158m 48s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26095/artifact/out/Dockerfile | | JIRA Issue | YARN-10300 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004526/YARN-10300.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux aa1ebd84c7c1 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 9fe4c37c25b | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/26095/a
[jira] [Created] (YARN-10303) One yarn rest api example of yarn document is error
bright.zhou created YARN-10303: -- Summary: One yarn rest api example of yarn document is error Key: YARN-10303 URL: https://issues.apache.org/jira/browse/YARN-10303 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 3.2.1, 3.1.1 Reporter: bright.zhou Attachments: image-2020-06-02-10-27-35-020.png deSelects value should be resourceRequests !image-2020-06-02-10-27-35-020.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9767) PartitionQueueMetrics Issues
[ https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikandan R resolved YARN-9767. Resolution: Fixed > PartitionQueueMetrics Issues > > > Key: YARN-9767 > URL: https://issues.apache.org/jira/browse/YARN-9767 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Attachments: YARN-9767.001.patch > > > The intent of the Jira is to capture the issues/observations encountered as > part of YARN-6492 development separately for ease of tracking. > Observations: > Please refer this > https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027 > 1. Since partition info are being extracted from request and node, there is a > problem. For example, > > Node N has been mapped to Label X (Non exclusive). Queue A has been > configured with ANY Node label. App A requested resources from Queue A and > its containers ran on Node N for some reasons. During > AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) > would get used for calculation. Lets say allocate call has been fired for 3 > containers of 1 GB each, then > a. PartitionDefault * queue A -> pending mb is 3 GB > b. PartitionX * queue A -> pending mb is -3 GB > > is the outcome. Because app request has been fired without any label > specification and #a metrics has been derived. After allocation is over, > pending resources usually gets decreased. When this happens, it use node > partition info. hence #b metrics has derived. > > Given this kind of situation, We will need to put some thoughts on achieving > the metrics correctly. > > 2. Though the intent of this jira is to do Partition Queue Metrics, we would > like to retain the existing Queue Metrics for backward compatibility (as you > can see from jira's discussion). > With this patch and YARN-9596 patch, queuemetrics (for queue's) would be > overridden either with some specific partition values or default partition > values. It could be vice - versa as well. For example, after the queues (say > queue A) has been initialised with some min and max cap and also with node > label's min and max cap, Queuemetrics (availableMB) for queue A return values > based on node label's cap config. > I've been working on these observations to provide a fix and attached > .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, > availableVcores is correct (Please refer above #2 observation). Added more > asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for > #2 is working properly. > Also one more thing to note is, user metrics for availableMB, availableVcores > at root queue was not there even before. Retained the same behaviour. User > metrics for availableMB, availableVcores is available only at child queue's > level and also with partitions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues
[ https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123337#comment-17123337 ] Manikandan R commented on YARN-9767: YARN-6492 patch covered this fixes too. Hence closing this. > PartitionQueueMetrics Issues > > > Key: YARN-9767 > URL: https://issues.apache.org/jira/browse/YARN-9767 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Attachments: YARN-9767.001.patch > > > The intent of the Jira is to capture the issues/observations encountered as > part of YARN-6492 development separately for ease of tracking. > Observations: > Please refer this > https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027 > 1. Since partition info are being extracted from request and node, there is a > problem. For example, > > Node N has been mapped to Label X (Non exclusive). Queue A has been > configured with ANY Node label. App A requested resources from Queue A and > its containers ran on Node N for some reasons. During > AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) > would get used for calculation. Lets say allocate call has been fired for 3 > containers of 1 GB each, then > a. PartitionDefault * queue A -> pending mb is 3 GB > b. PartitionX * queue A -> pending mb is -3 GB > > is the outcome. Because app request has been fired without any label > specification and #a metrics has been derived. After allocation is over, > pending resources usually gets decreased. When this happens, it use node > partition info. hence #b metrics has derived. > > Given this kind of situation, We will need to put some thoughts on achieving > the metrics correctly. > > 2. Though the intent of this jira is to do Partition Queue Metrics, we would > like to retain the existing Queue Metrics for backward compatibility (as you > can see from jira's discussion). > With this patch and YARN-9596 patch, queuemetrics (for queue's) would be > overridden either with some specific partition values or default partition > values. It could be vice - versa as well. For example, after the queues (say > queue A) has been initialised with some min and max cap and also with node > label's min and max cap, Queuemetrics (availableMB) for queue A return values > based on node label's cap config. > I've been working on these observations to provide a fix and attached > .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, > availableVcores is correct (Please refer above #2 observation). Added more > asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for > #2 is working properly. > Also one more thing to note is, user metrics for availableMB, availableVcores > at root queue was not there even before. Retained the same behaviour. User > metrics for availableMB, availableVcores is available only at child queue's > level and also with partitions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9964) Queue metrics turn negative when relabeling a node with running containers to default partition
[ https://issues.apache.org/jira/browse/YARN-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123338#comment-17123338 ] Manikandan R commented on YARN-9964: [~jhung] YARN-6492 patch covered this fixes too. Can we close this? > Queue metrics turn negative when relabeling a node with running containers to > default partition > > > Key: YARN-9964 > URL: https://issues.apache.org/jira/browse/YARN-9964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Priority: Major > > YARN-6467 changed queue metrics logic to only update certain metrics if it's > for default partition. But if an app runs containers in a labeled node, then > this node is moved to default partition, then the container is released, this > container's resource won't increment queue's allocated resource, but will > decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9964) Queue metrics turn negative when relabeling a node with running containers to default partition
[ https://issues.apache.org/jira/browse/YARN-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikandan R reassigned YARN-9964: -- Assignee: Manikandan R > Queue metrics turn negative when relabeling a node with running containers to > default partition > > > Key: YARN-9964 > URL: https://issues.apache.org/jira/browse/YARN-9964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > > YARN-6467 changed queue metrics logic to only update certain metrics if it's > for default partition. But if an app runs containers in a labeled node, then > this node is moved to default partition, then the container is released, this > container's resource won't increment queue's allocated resource, but will > decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Antal updated YARN-10284: -- Attachment: YARN-10284.004.patch > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123361#comment-17123361 ] Adam Antal commented on YARN-10284: --- Fixed last checkstyle in v4. > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10304) Create an endpoint for remote application log directory path query
Andras Gyori created YARN-10304: --- Summary: Create an endpoint for remote application log directory path query Key: YARN-10304 URL: https://issues.apache.org/jira/browse/YARN-10304 Project: Hadoop YARN Issue Type: Improvement Reporter: Andras Gyori Assignee: Andras Gyori The logic of the aggregated log directory path determination (currently based on configuration) is scattered around the codebase and duplicated multiple times. By providing a separate class for creating the path for a specific user, it allows for an abstraction over this logic. This could be used in place of the previously duplicated logic, moreover, we could provide an endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123401#comment-17123401 ] Hadoop QA commented on YARN-10284: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 41s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 11s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 15s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 30s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 64m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26096/artifact/out/Dockerfile | | JIRA Issue | YARN-10284 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004558/YARN-10284.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 244f95bd5d3a 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 9fe4c37c25b | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/26096/testReport/ | | Max. process+thread count | 314 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common U: hadoop-yarn-project/had
[jira] [Created] (YARN-10305) Lost system-credentials when restarting RM
kyungwan nam created YARN-10305: --- Summary: Lost system-credentials when restarting RM Key: YARN-10305 URL: https://issues.apache.org/jira/browse/YARN-10305 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam System-credentials introduced in YARN-2704, it makes it to keep the long-running apps. I’ve met a situation where system-credentials lost when restarting RM. Since then, if an app’s AM is stopped, restarting AM will be failed because NMs do not have HDFS delegation token which is needed for resource localization. The app has a couple of delegation token including timeline-server token and HDFS delegation token. When restarting RM, RM will request a new HDFS delegation token for an app that was submitted long ago. (It's fixed by YARN-5098) But, If an app has a couple of delegation token and an exception occur in the token processed first, the next tokens are not processed. I think that’s why lost system-credentials. Here are RM’s logs at the time of restarting RM. {code} 2020-05-19 14:25:05,712 WARN security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to add the application to the delegation token renewer on recovery. java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, sequenceNumber=2193, masterKeyId=340) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: HTTP status [403], message [org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900] at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250) at org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81) at org.apache.hadoop.security.token.Token.renew(Token.java:512) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:629) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:626) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422
[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10304: Attachment: YARN-10304.001.patch > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org