[jira] [Reopened] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
[ https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt reopened YARN-10307: --- Reopening to set the correct resolution > /leveldb-timeline-store.ldb/LOCK not exist > -- > > Key: YARN-10307 > URL: https://issues.apache.org/jira/browse/YARN-10307 > Project: Hadoop YARN > Issue Type: Bug > Environment: Ubuntu 19.10 > Hadoop 3.1.2 > Tez 0.9.2 > Hbase 2.2.4 >Reporter: appleyuchi >Priority: Blocker > Fix For: 3.1.2 > > > $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver > > in hadoop-appleyuchi-timelineserver-Desktop.out I get > > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] > 沒有此一檔案或目錄 > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,525 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state > INITED > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,526 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer > failed in state INITED > org.apache.hadoop.service.ServiceStateException: > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > Caused by: java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at
[jira] [Resolved] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
[ https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin Chundatt resolved YARN-10307. --- Fix Version/s: (was: 3.1.2) Resolution: Invalid > /leveldb-timeline-store.ldb/LOCK not exist > -- > > Key: YARN-10307 > URL: https://issues.apache.org/jira/browse/YARN-10307 > Project: Hadoop YARN > Issue Type: Bug > Environment: Ubuntu 19.10 > Hadoop 3.1.2 > Tez 0.9.2 > Hbase 2.2.4 >Reporter: appleyuchi >Priority: Blocker > > $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver > > in hadoop-appleyuchi-timelineserver-Desktop.out I get > > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] > 沒有此一檔案或目錄 > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,525 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state > INITED > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,526 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer > failed in state INITED > org.apache.hadoop.service.ServiceStateException: > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > Caused by: java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127867#comment-17127867 ] Tao Yang commented on YARN-10293: - Thanks [~prabhujoseph] for updating the patch. Another concern in UT is that could you finish the UT without updating the controlling access for SchedulerNode#addUnallocatedResource? I think directly calling SchedulerNode#addUnallocatedResource in UT is hard to understand. BTW, please fix the remaining check-style warning, UT failures seem unrelated to this patch. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch, YARN-10293-004.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: >