[jira] [Reopened] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-07 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10307:
---

Reopening to set the correct resolution

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
> Fix For: 3.1.2
>
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at 

[jira] [Resolved] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-07 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10307.
---
Fix Version/s: (was: 3.1.2)
   Resolution: Invalid

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-07 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127867#comment-17127867
 ] 

Tao Yang commented on YARN-10293:
-

Thanks [~prabhujoseph] for updating the patch.
Another concern in UT is that could you finish the UT without updating the 
controlling access for SchedulerNode#addUnallocatedResource?  I think directly 
calling SchedulerNode#addUnallocatedResource in UT is hard to understand.
BTW, please fix the remaining check-style warning, UT failures seem unrelated 
to this patch.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch, YARN-10293-004.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
>