[jira] [Updated] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending
[ https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-4165: -- Component/s: capacity scheduler > An outstanding container request makes all nodes to be reserved causing all > jobs pending > > > Key: YARN-4165 > URL: https://issues.apache.org/jira/browse/YARN-4165 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > > We have a long running service in YARN, it has a outstanding container > request that YARN cannot satisfy (require more memory that nodemanager can > supply). Then YARN reserves all nodes for this application, when I submit > other jobs (require relative small memory that nodemanager can supply), all > jobs are pending because YARN skips scheduling containers on the nodes that > have been reserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4011) Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
[ https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901946#comment-14901946 ] Maysam Yabandeh commented on YARN-4011: --- We face this problem quite often in our ad hoc cluster and are thinking to implement some basic checkers to make such misbehaved jobs fail fast. Until we have a proper solution for yarn, we can have a mapreduce-specific solution in place to protect the cluster from rogue mapreduce tasks? The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit. It is true that written bytes is larger than the actual used disk space, but to detect a rogue task the exact value is not required and a very large value for written bytes to local disk is a good indicative that the task is misbehaved. Thoughts? > Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk > > > Key: YARN-4011 > URL: https://issues.apache.org/jira/browse/YARN-4011 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.4.0 >Reporter: Ashwin Shankar > > We observed jobs failed since tasks couldn't launch on nodes due to > "java.io.IOException No space left on device". > On digging in further, we found a rogue job which filled up disk. > Specifically it was wrote a lot of map spills(like > attempt_1432082376223_461647_m_000421_0_spill_1.out) to nm-local-dir > causing disk to fill up, and it failed/got killed, but didn't clean up these > files in nm-local-dir. > So the disk remained full, causing subsequent jobs to fail. > This jira is created to address why files under nm-local-dir doesn't get > cleaned up when job fails after filling up disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901932#comment-14901932 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2339 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2339/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901930#comment-14901930 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #401 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/401/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities
[ https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiwei Guo updated YARN-4199: - Description: In current implementation, LeveldbTimelineStore.discardOldEntities holds a writeLock on deleteLock, which will block other put operation, which eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of history jobs in timelinestore, the block time will be very long. In our observation, it block all the TEZ jobs for several hours or longer. The possible solutions are: - Optimize leveldb configuration, so a full scan won't take long time. - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold lock while getSnapshot. One question is that whether snapshot will take long time or not, cause I have no experience with leveldb. was: I current implementation, LeveldbTimelineStore.discardOldEntities holds a writeLock on deleteLock, which will block other put operation, which eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of history jobs in timelinestore, the block time will be very long. In our observation, it block all the TEZ jobs for several hours or longer. The possible solutions are: - Optimize leveldb configuration, so a full scan won't take long time. - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold lock while getSnapshot. One question is that whether snapshot will take long time or not, cause I have no experience with leveldb. > Minimize lock time in LeveldbTimelineStore.discardOldEntities > - > > Key: YARN-4199 > URL: https://issues.apache.org/jira/browse/YARN-4199 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver, yarn >Reporter: Shiwei Guo > > In current implementation, LeveldbTimelineStore.discardOldEntities holds a > writeLock on deleteLock, which will block other put operation, which > eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of > history jobs in timelinestore, the block time will be very long. In our > observation, it block all the TEZ jobs for several hours or longer. > The possible solutions are: > - Optimize leveldb configuration, so a full scan won't take long time. > - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold > lock while getSnapshot. One question is that whether snapshot will take long > time or not, cause I have no experience with leveldb. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities
Shiwei Guo created YARN-4199: Summary: Minimize lock time in LeveldbTimelineStore.discardOldEntities Key: YARN-4199 URL: https://issues.apache.org/jira/browse/YARN-4199 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Reporter: Shiwei Guo I current implementation, LeveldbTimelineStore.discardOldEntities holds a writeLock on deleteLock, which will block other put operation, which eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of history jobs in timelinestore, the block time will be very long. In our observation, it block all the TEZ jobs for several hours or longer. The possible solutions are: - Optimize leveldb configuration, so a full scan won't take long time. - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold lock while getSnapshot. One question is that whether snapshot will take long time or not, cause I have no experience with leveldb. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901897#comment-14901897 ] Jian He commented on YARN-4000: --- bq. I think this shouldn't be a problem. actually, I think this will be a problem in regular case. Application is being killed by user right on RM restart. This is an existing problem though. Do you think so ? > RM crashes with NPE if leaf queue becomes parent queue during restart > - > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901892#comment-14901892 ] Jian He commented on YARN-4000: --- bq. In recoverContainersOnNode, we check if application is present in the scheduler or not, which will not be there. Ah, right, missed this part. thanks for pointing this out. bq. we consider them as orphan containers and in the next HB from NM, report these containers as the ones to be cleaned up by NM. Is this the case? I think in current code, RM is still ignoring these orphan containers? > RM crashes with NPE if leaf queue becomes parent queue during restart > - > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901880#comment-14901880 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2366 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2366/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending
[ https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901869#comment-14901869 ] Weiwei Yang commented on YARN-4165: --- Hi Jason We are using capacity scheduler, and the problem can be described as, we have 2 nodes, and . If there is an outstanding container request for APP1, both and is reserved for the application, RM log looks like 2015-09-21 20:39:07,990 INFO capacity.CapacityScheduler (CapacityScheduler.java:allocateContainersToNode(1240)) - Skipping scheduling since node :45454 is reserved by application appattempt_1442889801665_0001_01 2015-09-21 20:40:10,990 INFO capacity.CapacityScheduler (CapacityScheduler.java:allocateContainersToNode(1240)) - Skipping scheduling since node :45454 is reserved by application appattempt_1442889801665_0001_01 then when I submit a new job APP2, the app master cannot be allocated because all nodes are reserved. > An outstanding container request makes all nodes to be reserved causing all > jobs pending > > > Key: YARN-4165 > URL: https://issues.apache.org/jira/browse/YARN-4165 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > > We have a long running service in YARN, it has a outstanding container > request that YARN cannot satisfy (require more memory that nodemanager can > supply). Then YARN reserves all nodes for this application, when I submit > other jobs (require relative small memory that nodemanager can supply), all > jobs are pending because YARN skips scheduling containers on the nodes that > have been reserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901850#comment-14901850 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1160 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1160/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java * hadoop-yarn-project/CHANGES.txt > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901828#comment-14901828 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #428 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/428/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java * hadoop-yarn-project/CHANGES.txt > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901749#comment-14901749 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #420 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/420/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java * hadoop-yarn-project/CHANGES.txt > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4189) Capacity Scheduler : Improve location preference waiting mechanism
[ https://issues.apache.org/jira/browse/YARN-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901736#comment-14901736 ] Xianyin Xin commented on YARN-4189: --- [~leftnoteasy], convincing analysis. It's fine X << Y and X is close to the heartbeat interval, so, should we limit X to avoid users deploy it freely? > Capacity Scheduler : Improve location preference waiting mechanism > -- > > Key: YARN-4189 > URL: https://issues.apache.org/jira/browse/YARN-4189 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4189 design v1.pdf > > > There're some issues with current Capacity Scheduler implementation of delay > scheduling: > *1) Waiting time to allocate each container highly depends on cluster > availability* > Currently, app can only increase missed-opportunity when a node has available > resource AND it gets traversed by a scheduler. There’re lots of possibilities > that an app doesn’t get traversed by a scheduler, for example: > A cluster has 2 racks (rack1/2), each rack has 40 nodes. > Node-locality-delay=40. An application prefers rack1. > Node-heartbeat-interval=1s. > Assume there are 2 nodes available on rack1, delay to allocate one container > = 40 sec. > If there are 20 nodes available on rack1, delay of allocating one container = > 2 sec. > *2) It could violate scheduling policies (Fifo/Priority/Fair)* > Assume a cluster is highly utilized, an app (app1) has higher priority, it > wants locality. And there’s another app (app2) has lower priority, but it > doesn’t care about locality. When node heartbeats with available resource, > app1 decides to wait, so app2 gets the available slot. This should be > considered as a bug that we need to fix. > The same problem could happen when we use FIFO/Fair queue policies. > Another problem similar to this is related to preemption: when preemption > policy preempts some resources from queue-A for queue-B (queue-A is > over-satisfied and queue-B is under-satisfied). But queue-B is waiting for > the node-locality-delay so queue-A will get resources back. In next round, > preemption policy could preempt this resources again from queue-A. > This JIRA is target to solve these problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901718#comment-14901718 ] Dian Fu commented on YARN-3964: --- Thanks [~leftnoteasy] for your detailed review. Make sense to me and will update the patch to incorporate your comments ASAP. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901706#comment-14901706 ] Hudson commented on YARN-4188: -- FAILURE: Integrated in Hadoop-trunk-Commit #8496 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8496/]) YARN-4188. Make MoveApplicationAcrossQueues abstract, newInstance static (cdouglas: rev 8e01b0d97ac3d74b049a801dfa1cc6e77d8f680a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/MoveApplicationAcrossQueuesResponse.java * hadoop-yarn-project/CHANGES.txt > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901537#comment-14901537 ] Li Lu commented on YARN-4075: - Hi [~varun_saxena]! Thanks for the work and sorry for the delayed reply. I looked at your POC.2 patch and here are some comments: - getFlows (/flows/{clusterId}): Maybe we'd like to return the "default" cluster, or the cluster the reader runs on (or a reader farm associates to), if the given clusterId is empty? - In TestTimelineReaderWebServicesFlowRun#testGetFlowRun, why do we compare equality through toString and comparing two strings? I think we need a "deep comparison" method for timeline metrics for this case, so maybe you'd like to add this method, and use it in testGetFlowRun? - The following logic: {code} + callerUGI != null && (userId == null || userId.isEmpty()) ? + callerUGI.getUserName().trim() : parseStr(userId) {code} is common enough in TimelineReaderWebServices. Since the logic is not quite trivial, maybe we'd like to put them in a standalong private method? - I just noticed that we're returning Set rather than TimelineEntities in timeline reader. This is not consistent with timeline writer (which uses TimelineEntities). It doesn't hurt much to have one more level of indirection, so maybe we'd like to change the readers to return TimelineEntities? In this way the reader and the writer will have the same behavior on this. - Any special reasons to refactor TestHBaseTimelineStorage? Since we're merging YARN-4074 soon, I have not checked if this patch applies to the latest YARN-2928 branch. We need to make sure that after you refreshed your patch. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901501#comment-14901501 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2338 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2338/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901486#comment-14901486 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #400 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/400/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint
[ https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901415#comment-14901415 ] Richard Lee commented on YARN-4191: --- Is the RPC port necessarily HTTP, tho? Seems that is not something YARN can count on and proxy for. I still think that the Samza people should put their REST endpoint under the trackingUrl, like mapreduce does. That way, it would both not require any changes to YARN, and be using the ResourceManager proxy. > Expose ApplicationMaster RPC port in ResourceManager REST endpoint > -- > > Key: YARN-4191 > URL: https://issues.apache.org/jira/browse/YARN-4191 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Richard Lee >Priority: Minor > > Currently, the ResourceManager REST endpoint returns only the trackingUrl for > the ApplicationMaster. Some AMs, however, have their REST endpoints on the > RPC port. However, the RM does not expose the AM RPC port via REST for some > reason. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901422#comment-14901422 ] Wangda Tan commented on YARN-3964: -- [~dian.fu]. Some comments: 1) I suggest to make this to be an explicit node label configuration type: {{yarn.node-labels.configuration-type}}. Currently it has "centralized/distributed", I think you may add a "delegated-centralized" (or other better name). Other configurations in your patch look fine to me. 2) Some comments of organization of Updater/Provider: - Updater is a subclass of AbstractService, but no need to be abstract. I'm not sure what's the purpose of adding an AbstractNodeLabelsUpdater. Provider will be initialized by Updater, and Updater will call Provider's method periodically and notify RMNodeLabelsManager. - Provider is an interface, minor comments to your patch: ** Why need a Configuration in getNodeLabels method? ** Returns Set instead of Set 3) There're some methods / comments include "Fetcher", could you replace them to "Provider"? 4) Instead of adding a new checkAndThrowIfNodeLabelsFetcherConfigured, I suggest to reuse the checkAndThrowIfDistributedNodeLabelConfEnabled: You can rename it to something like checkAndThrowIfNodeLabelCannotBeUpdatedManually, which will check {{yarn.node-labels.configuration-type}}, we only allow manually update labels when type=centralized configured. 5) You can add a method to get RMNodeLabelsUpdater from RMContext, and remove it from ResourceTrackerService constructor. 6) Add a test of RMNodeLabelsUpdater? It seems can only update labels-on-node once for every node. 7) I think we need to make sure labels will be updated *synchronizedly* when a node is registering, this can avoid a node's labels initialized after it registered a while. 8) If you agree with #7, I think wait/notify implementation of Updater could be removed, you can use synchronized lock instead. Code using wait/notify has bad readability and will likely introduce bugs. Thanks, Wangda > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS
[ https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901388#comment-14901388 ] Li Lu commented on YARN-3942: - BTW, the patch apply to the existing YARN-3942.001.patch. > Timeline store to read events from HDFS > --- > > Key: YARN-3942 > URL: https://issues.apache.org/jira/browse/YARN-3942 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-3942-leveldb.001.patch, YARN-3942.001.patch > > > This adds a new timeline store plugin that is intended as a stop-gap measure > to mitigate some of the issues we've seen with ATS v1 while waiting for ATS > v2. The intent of this plugin is to provide a workable solution for running > the Tez UI against the timeline server on a large-scale clusters running many > thousands of jobs per day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3942) Timeline store to read events from HDFS
[ https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3942: Attachment: YARN-3942-leveldb.001.patch Thanks [~jlowe] for working on this! On top of the existing patch I built a new storage to move the in memory hash map storage to a level db database. The original in memory timeline store is not supposed to be used in production environments. The price for the new level db hash map storage is latency: it generally takes more time to fully load the entities into level db. Had an offline discussion with [~xgong] and seems like we need to reduce the granularity of caching to improve latency. We may want to address this problem in a separate JIRA. > Timeline store to read events from HDFS > --- > > Key: YARN-3942 > URL: https://issues.apache.org/jira/browse/YARN-3942 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-3942-leveldb.001.patch, YARN-3942.001.patch > > > This adds a new timeline store plugin that is intended as a stop-gap measure > to mitigate some of the issues we've seen with ATS v1 while waiting for ATS > v2. The intent of this plugin is to provide a workable solution for running > the Tez UI against the timeline server on a large-scale clusters running many > thousands of jobs per day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901376#comment-14901376 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2365 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2365/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901299#comment-14901299 ] Vrushali C commented on YARN-4074: -- Hi [~gtCarrera9] To confirm my understanding, did you mean putting all reader classes into a package like org.apache.hadoop.yarn.server.timelineservice.storage.reader ? There is a org.apache.hadoop.yarn.server.timelineservice.reader but that is for the web services related code. thanks Vrushali > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901272#comment-14901272 ] Varun Saxena commented on YARN-4000: [~jianhe], I think this shouldn't be a problem. In recoverContainersOnNode, we check if application is present in the scheduler or not, which will not be there. If this is so, we consider them as orphan containers and in the next HB from NM, report these containers as the ones to be cleaned up by NM. NM then cleans them up(kills them) if they are running. Correct me if I am wrong. > RM crashes with NPE if leaf queue becomes parent queue during restart > - > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901263#comment-14901263 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #419 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/419/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901254#comment-14901254 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1159 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1159/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/CHANGES.txt > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901190#comment-14901190 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #427 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/427/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4009) CORS support for ResourceManager REST API
[ https://issues.apache.org/jira/browse/YARN-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901169#comment-14901169 ] Hitesh Shah commented on YARN-4009: --- Thinking more on this, a global config might be something that is okay to start with ( we already have a huge proliferation of configs which users do not set ). If there are concerns raised down the line, it should likely be easy enough to add yarn and hdfs specific configs which would override the global one in a compatible manner? [~jeagles] comments? > CORS support for ResourceManager REST API > - > > Key: YARN-4009 > URL: https://issues.apache.org/jira/browse/YARN-4009 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prakash Ramachandran >Assignee: Varun Vasudev > Attachments: YARN-4009.001.patch, YARN-4009.002.patch, > YARN-4009.003.patch, YARN-4009.004.patch > > > Currently the REST API's do not have CORS support. This means any UI (running > in browser) cannot consume the REST API's. For ex Tez UI would like to use > the REST API for getting application, application attempt information exposed > by the API's. > It would be very useful if CORS is enabled for the REST API's. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901149#comment-14901149 ] Hudson commented on YARN-4113: -- FAILURE: Integrated in Hadoop-trunk-Commit #8495 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8495/]) YARN-4113. RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER. (Sunil G via wangda) (wangda: rev b00392dd9cbb6778f2f3e669e96cf7133590dfe7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint
[ https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901135#comment-14901135 ] Steve Loughran commented on YARN-4191: -- if there's no RPC port in REST status reports, yes , its a bug. Samza's use of a REST API without going via the proxy is a security risk, but it's not anything YARN can do to stop > Expose ApplicationMaster RPC port in ResourceManager REST endpoint > -- > > Key: YARN-4191 > URL: https://issues.apache.org/jira/browse/YARN-4191 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Richard Lee >Priority: Minor > > Currently, the ResourceManager REST endpoint returns only the trackingUrl for > the ApplicationMaster. Some AMs, however, have their REST endpoints on the > RPC port. However, the RM does not expose the AM RPC port via REST for some > reason. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901133#comment-14901133 ] Hadoop QA commented on YARN-4188: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 13s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 54s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 2s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 31s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 13s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 49s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 46s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-api. | | | | 44m 37s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761462/YARN-4188.v0.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c9cb6a5 | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9231/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9231/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9231/console | This message was automatically generated. > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901099#comment-14901099 ] zhihai xu commented on YARN-4095: - The first patch put {{NM_GOOD_LOCAL_DIRS}} and {{NM_GOOD_LOG_DIRS}} in YarnConfiguration.java, the second patch moved them to LocalDirsHandlerService.java, since they are only used inside {{LocalDirsHandlerService}}. > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue
[ https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901078#comment-14901078 ] Wangda Tan commented on YARN-4059: -- Finished design doc of improving delay scheduling mechanism and uploaded it to YARN-4189. > Preemption should delay assignments back to the preempted queue > --- > > Key: YARN-4059 > URL: https://issues.apache.org/jira/browse/YARN-4059 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch > > > When preempting containers from a queue it can take a while for the other > queues to fully consume the resources that were freed up, due to delays > waiting for better locality, etc. Those delays can cause the resources to be > assigned back to the preempted queue, and then the preemption cycle continues. > We should consider adding a delay, either based on node heartbeat counts or > time, to avoid granting containers to a queue that was recently preempted. > The delay should be sufficient to cover the cycles of the preemption monitor, > so we won't try to assign containers in-between preemption events for a queue. > Worst-case scenario for assigning freed resources to other queues is when all > the other queues want no locality. No locality means only one container is > assigned per heartbeat, so we need to wait for the entire cluster > heartbeating in times the number of containers that could run on a single > node. > So the "penalty time" for a queue should be the max of either the preemption > monitor cycle time or the amount of time it takes to allocate the cluster > with one container per heartbeat. Guessing this will be somewhere around 2 > minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4189) Capacity Scheduler : Improve location preference waiting mechanism
[ https://issues.apache.org/jira/browse/YARN-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901075#comment-14901075 ] Wangda Tan commented on YARN-4189: -- [~xinxianyin], Thanks for looking at the doc, however, I think the approach in the doc shouldn't decline the utilization: Assume we limit the maximum waiting time for each container is X sec, and average container execution time is Y sec. It will be fine If X << Y. In my mind, X is a value close to node heartbeat interval and Y is from minutes to hours. I don't have any data to prove if my thoughts is true, we need to do some benchmark tests before using it in practice. > Capacity Scheduler : Improve location preference waiting mechanism > -- > > Key: YARN-4189 > URL: https://issues.apache.org/jira/browse/YARN-4189 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4189 design v1.pdf > > > There're some issues with current Capacity Scheduler implementation of delay > scheduling: > *1) Waiting time to allocate each container highly depends on cluster > availability* > Currently, app can only increase missed-opportunity when a node has available > resource AND it gets traversed by a scheduler. There’re lots of possibilities > that an app doesn’t get traversed by a scheduler, for example: > A cluster has 2 racks (rack1/2), each rack has 40 nodes. > Node-locality-delay=40. An application prefers rack1. > Node-heartbeat-interval=1s. > Assume there are 2 nodes available on rack1, delay to allocate one container > = 40 sec. > If there are 20 nodes available on rack1, delay of allocating one container = > 2 sec. > *2) It could violate scheduling policies (Fifo/Priority/Fair)* > Assume a cluster is highly utilized, an app (app1) has higher priority, it > wants locality. And there’s another app (app2) has lower priority, but it > doesn’t care about locality. When node heartbeats with available resource, > app1 decides to wait, so app2 gets the available slot. This should be > considered as a bug that we need to fix. > The same problem could happen when we use FIFO/Fair queue policies. > Another problem similar to this is related to preemption: when preemption > policy preempts some resources from queue-A for queue-B (queue-A is > over-satisfied and queue-B is under-satisfied). But queue-B is waiting for > the node-locality-delay so queue-A will get resources back. In next round, > preemption policy could preempt this resources again from queue-A. > This JIRA is target to solve these problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901067#comment-14901067 ] Li Lu commented on YARN-4074: - Hi [~sjlee0] [~vrushalic], thanks for the work and sorry I could not get back earlier. Overall the patch LGTM. I like the refactor here and it's almost a must to put it in soon. One nit is, on naming and code organization, we're putting all derived readers in the storage package, but inevitably associating them with our (specific) HBase storage. If it's quick and easy, maybe we can put them in a package inside storage? If I'm missing anything here and it's hard, let proceed with this patch. Your call. > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901057#comment-14901057 ] Varun Saxena commented on YARN-4178: [~jrottinghuis], No, I did not mean that we can use ApplicationId#toString to create a string which can be stored in rowkey, if that is what you meant. appid is already in that format. What I was suggesting was that on the write path, we can store only the cluster timestamp and sequence number(12 bytes - one long and one int) in the row key and skip storing the "application_" part. Storing as long and int or 2 longs would ensure correct ordering(although ascending). So, as you said above Long.MAX_VALUE - X should be used for ensuring descending order. ApplicationId#toString I was talking in context of read path. On the read path we can read these 12 bytes from row key and call ApplicationId#newInstance and ApplicationId#toString to change the timestamp and id to application_ prefix app id in string format, which can then be sent back to the client. And if prefix changes, ApplicationId will be changed as well(as it is used all over YARN). However your comment about storing application_ part in the end to make row key future proof makes sense. We can go with it. > [storage implementation] app id as string can cause incorrect ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > > Currently the app id is used in various places as part of row keys and in > column names. However, they are treated as strings for the most part. This > will cause a problem with ordering when the id portion of the app id rolls > over to the next digit. > For example, "app_1234567890_100" will be considered *earlier* than > "app_1234567890_99". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled
[ https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901047#comment-14901047 ] Hadoop QA commented on YARN-3975: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 56s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 19s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 17s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 2 new checkstyle issues (total was 16, now 18). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 44s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 52s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 59s | Tests failed in hadoop-yarn-client. | | {color:red}-1{color} | yarn tests | 0m 24s | Tests failed in hadoop-yarn-server-web-proxy. | | | | 49m 32s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.client.TestRMFailover | | | hadoop.yarn.server.webproxy.TestWebAppProxyServlet | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761457/YARN-3975.8.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c9cb6a5 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9230/artifact/patchprocess/diffcheckstylehadoop-yarn-server-web-proxy.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9230/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-server-web-proxy test log | https://builds.apache.org/job/PreCommit-YARN-Build/9230/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9230/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9230/console | This message was automatically generated. > WebAppProxyServlet should not redirect to RM page if AHS is enabled > --- > > Key: YARN-3975 > URL: https://issues.apache.org/jira/browse/YARN-3975 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, > YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, > YARN-3975.8.patch > > > WebAppProxyServlet should be updated to handle the case when the appreport > doesn't have a tracking URL and the Application History Server is eanbled. > As we would have already tried the RM and got the > ApplicationNotFoundException we should not direct the user to the RM app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4188) MoveApplicationAcrossQueuesResponse should be an abstract class
[ https://issues.apache.org/jira/browse/YARN-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola updated YARN-4188: --- Attachment: YARN-4188.v0.patch No test needed > MoveApplicationAcrossQueuesResponse should be an abstract class > --- > > Key: YARN-4188 > URL: https://issues.apache.org/jira/browse/YARN-4188 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Attachments: YARN-4188.v0.patch > > > MoveApplicationAcrossQueuesResponse should be an abstract class. Additionally > the new instance should have a static modifier. Currently we are not facing > any issues because the response is empty object on success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901019#comment-14901019 ] Vrushali C commented on YARN-4074: -- Thanks everyone for the review, I will commit this patch in today. > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled
[ https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-3975: Attachment: YARN-3975.8.patch > WebAppProxyServlet should not redirect to RM page if AHS is enabled > --- > > Key: YARN-3975 > URL: https://issues.apache.org/jira/browse/YARN-3975 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, > YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, > YARN-3975.8.patch > > > WebAppProxyServlet should be updated to handle the case when the appreport > doesn't have a tracking URL and the Application History Server is eanbled. > As we would have already tried the RM and got the > ApplicationNotFoundException we should not direct the user to the RM app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900927#comment-14900927 ] Joep Rottinghuis commented on YARN-4178: [~varun_saxena] if you mean o.a.h.yarn.api.records.ApplicationId then no, that will _not_ do. Its toString is defined as {code} return appIdStrPrefix + this.getClusterTimestamp() + "_" + appIdFormat.get().format(getId()); {code} The appIdFormat uses a minimum of 4 digits: fmt.setMinimumIntegerDigits(4); When the counter part wraps over to 10K or 100K or 1M (our clusters regularly run several million apps before the RM gets restarted) the sort order gets all wrong as per my comment in YARN-4074, which is why [~sangjin.park] For example, lexically application_1442351767756_1 < application_1442351767756_ We need the applications to be ordered correctly, even at those boundaries. In fact, I think we may have to store Long.MAX_VALUE - X for the timestamp and counter parts to that these will properly order in descending order for both the counter and the RM restart epoch part. The fact that all application IDs are hardcoded with application_ in yarn seems a bit silly to me. It makes much more sense to me that applications should be able to indicate an application type and that those would have a different prefix. That way one can quickly distinguish between mapreduce apps, Tez, Spark, Impala, Presto, what-have-you. This may not matter much on smaller clusters with less usage, but to make this an option for larger clusters with several tens of thousands of jobs per day this would be really really handy. Hence my suggestion to keep the application_ part at the end of the sort, to make the key-layout future proof (maybe wishful thinking in my part). > [storage implementation] app id as string can cause incorrect ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > > Currently the app id is used in various places as part of row keys and in > column names. However, they are treated as strings for the most part. This > will cause a problem with ordering when the id portion of the app id rolls > over to the next digit. > For example, "app_1234567890_100" will be considered *earlier* than > "app_1234567890_99". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.
[ https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900902#comment-14900902 ] Hadoop QA commented on YARN-3224: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 51s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 1s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 53s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 28s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 29s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 3 new checkstyle issues (total was 188, now 191). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 47s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 39s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 42s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 62m 0s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 104m 46s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761436/0002-YARN-3224.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c9cb6a5 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9229/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9229/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9229/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9229/console | This message was automatically generated. > Notify AM with containers (on decommissioning node) could be preempted after > timeout. > - > > Key: YARN-3224 > URL: https://issues.apache.org/jira/browse/YARN-3224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Sunil G > Attachments: 0001-YARN-3224.patch, 0002-YARN-3224.patch > > > We should leverage YARN preemption framework to notify AM that some > containers will be preempted after a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900752#comment-14900752 ] Sunil G commented on YARN-4113: --- Hi [~leftnoteasy] I feel test case is not needed as its already covered in HADOOP-12386.will this be fine? > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint
[ https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900749#comment-14900749 ] Richard Lee commented on YARN-4191: --- AFAICT, looking at the yarn source code, the RM doesn't actually serialize the ApplicationReport. It gets a lot of similar information about the ApplicationMaster and returns it on the REST /apps endpoint. One thing that seems to be missing is the RPC port, tho. In particular, I'm interested in working with the Samza Application Master. It has both a trackingUrl port and an RPC port. The REST stuff is on the RPC port at / (with, oddly no version path or anything, which seems not the best practice). Compare this to the Map Reduce ApplicationMaster, where the REST api is on the same port as the trackingUrl at /ws/v1/mapreduce. I was not aware of the other RM REST issues. However, at present, I've only been doing GET requests to retrieve information about the running cluster, and not yet trying to control it. > Expose ApplicationMaster RPC port in ResourceManager REST endpoint > -- > > Key: YARN-4191 > URL: https://issues.apache.org/jira/browse/YARN-4191 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Richard Lee >Priority: Minor > > Currently, the ResourceManager REST endpoint returns only the trackingUrl for > the ApplicationMaster. Some AMs, however, have their REST endpoints on the > RPC port. However, the RM does not expose the AM RPC port via REST for some > reason. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.
[ https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3224: -- Attachment: 0002-YARN-3224.patch Attaching an initial version of patch. Small summary of this patch: 1. When a node is marked for DECOMMISSIONING, an event will be fired to Capacity Scheduler to preempt all containers running in that node. 2. In CS, a new event is added for this. And all existing *preemptContainer* api is invoked for all running container of the node. Its possible that few AM containers are also marked for preemption. This current patch gives only a PREEMPT_CONTAINER notification to AM. Timeout is not added for now as YARN-3784 is yet to be concluded. [~djp] Could you please take a look. > Notify AM with containers (on decommissioning node) could be preempted after > timeout. > - > > Key: YARN-3224 > URL: https://issues.apache.org/jira/browse/YARN-3224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Sunil G > Attachments: 0001-YARN-3224.patch, 0002-YARN-3224.patch > > > We should leverage YARN preemption framework to notify AM that some > containers will be preempted after a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900705#comment-14900705 ] Bibin A Chundatt commented on YARN-4176: Checkstyle is due to number of lines {noformat} File length is 2,146 lines (max allowed is 2,000). {noformat} I feel can be skipped as already the number of lines were greater than 2K > Resync NM nodelabels with RM every x interval for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
[ https://issues.apache.org/jira/browse/YARN-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900613#comment-14900613 ] Sunil G commented on YARN-4143: --- Yes [~adhoot]. I am also not getting any other way other than this Event handling. Because RMAppAttempt need to update such information to Schedulers (common), and either an Event or an API is only a clean way here. I do not have any objection here for existing way what is done in the patch. Thought of bringing up all possible options here and weigh in the best. > Optimize the check for AMContainer allocation needed by blacklisting and > ContainerType > -- > > Key: YARN-4143 > URL: https://issues.apache.org/jira/browse/YARN-4143 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-4143.001.patch > > > In YARN-2005 there are checks made to determine if the allocation is for an > AM container. This happens in every allocate call and should be optimized > away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900524#comment-14900524 ] Bibin A Chundatt commented on YARN-4140: They are related will check how to update testcases for that. > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-143, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-117, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > > {code} > 2015-09-09 14:35:45,467 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:45,831 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,469 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,832 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > {code} > dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1> > cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep > "root.b.b1" | wc -l > 500 > {code} > > (Consumes about 6 minutes) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint
[ https://issues.apache.org/jira/browse/YARN-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900519#comment-14900519 ] Steve Loughran commented on YARN-4191: -- Do you mean the REST API serializing the Application Report isn't including the RPC URL? Or that if an app chooses to register a REST endpoint as the port in the application report, the RM isn't redirecting to it? the RM has bigger issues with REST, namely it assumes there's a user and a browser at the far end (YARN-2084 ), not an application sending PUT requests and expecting machine-parseable status codes and error text. > Expose ApplicationMaster RPC port in ResourceManager REST endpoint > -- > > Key: YARN-4191 > URL: https://issues.apache.org/jira/browse/YARN-4191 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Richard Lee >Priority: Minor > > Currently, the ResourceManager REST endpoint returns only the trackingUrl for > the ApplicationMaster. Some AMs, however, have their REST endpoints on the > RPC port. However, the RM does not expose the AM RPC port via REST for some > reason. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900504#comment-14900504 ] Hadoop QA commented on YARN-4176: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 9s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 8s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 50s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 23s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 48s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 56m 40s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761395/0004-YARN-4176.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c9cb6a5 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9228/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9228/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9228/console | This message was automatically generated. > Resync NM nodelabels with RM every x interval for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900498#comment-14900498 ] Bibin A Chundatt commented on YARN-4140: Will recheck {{TestNodeLabelContainerAllocation }} failures > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-143, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-117, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > > {code} > 2015-09-09 14:35:45,467 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:45,831 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,469 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,832 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > {code} > dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1> > cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep > "root.b.b1" | wc -l > 500 > {code} > > (Consumes about 6 minutes) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container
[ https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900485#comment-14900485 ] Bibin A Chundatt commented on YARN-4152: Looks like only for {{LogAggregationService}} exists. {{ContainerEventDispatcher}} its handled. > NM crash with NPE when LogAggregationService#stopContainer called for absent > container > -- > > Key: YARN-4152 > URL: https://issues.apache.org/jira/browse/YARN-4152 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, > 0003-YARN-4152.patch > > > NM crash during of log aggregation. > Ran Pi job with 500 container and killed application in between > *Logs* > {code} > 2015-09-12 18:44:25,597 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_e51_1442063466801_0001_01_99 is : 143 > 2015-09-12 18:44:25,670 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e51_1442063466801_0001_01_000101 > 2015-09-12 18:44:25,670 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_e51_1442063466801_0001_01_000101 from application > application_1442063466801_0001 > 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > 2015-09-12 18:44:25,692 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_STOP for appId application_1442063466801_0001 > 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Exiting, bbye.. > 2015-09-12 18:44:25,692 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf > OPERATION=Container Finished - SucceededTARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1442063466801_0001 > CONTAINERID=container_e51_1442063466801_0001_01_000100 > {code} > *Analysis* > Looks like for absent container also {{stopContainer}} is called > {code} > case CONTAINER_FINISHED: > LogHandlerContainerFinishedEvent containerFinishEvent = > (LogHandlerContainerFinishedEvent) event; > stopContainer(containerFinishEvent.getContainerId(), > containerFinishEvent.getExitCode()); > break; > {code} > *Event EventType: KILL_CONTAINER sent to absent container > container_e51_1442063466801_0001_01_000101* > Should skip when {{null==context.getContainers().get(containerId)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4176: --- Attachment: 0004-YARN-4176.patch Hi [~Naganarasimha] Thanks for review comments Attaching patch after handling the same. > Resync NM nodelabels with RM every x interval for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4177) yarn.util.Clock should not be used to time a duration or time interval
[ https://issues.apache.org/jira/browse/YARN-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900450#comment-14900450 ] Xianyin Xin commented on YARN-4177: --- Hi [~ste...@apache.org], thanks for your comment. I've read your post and did some investgations on this. {quote} 1.Inconsistent across cores, hence non-monotonic on reads, especially reads likely to trigger thread suspend/resume (anything with sleep(), wait(), IO, accessing synchronized data under load). {quote} This was once a bug on some old OSs, but it seems not a problem on Linux newer than 2.6 or windows newer than XP SP2, if i understand your comment correctly. See http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless, and the refered https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks. {quote} 2.Not actually monotonic. {quote} Can you explain in detail? As a reference, there're some discussion on clock_gettime which nanoTime depends in http://stackoverflow.com/questions/4943733/is-clock-monotonic-process-or-thread-specific?rq=1, especially in the second post that has 4 supports. {quote} 3.Achieving a consistency by querying heavyweight counters with possible longer function execution time and lower granularity than the wall clock. That is: modern NUMA, multi-socket servers are essentially multiple computers wired together, and we have a term for that: distributed system {quote} You mean achieving a consistent time across nodes in a cluster? I think the monotonic time we plan to offer should be limited to node-local. It's hard to make it cluster wide. {quote} I've known for a long time that CPU frequency could change its rate {quote} I remembered Linux higher than 2.6.18 takes some measures to overcome this problem. http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless#comment40382219_510940 has little discussion. > yarn.util.Clock should not be used to time a duration or time interval > -- > > Key: YARN-4177 > URL: https://issues.apache.org/jira/browse/YARN-4177 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xianyin Xin >Assignee: Xianyin Xin > Attachments: YARN-4177.001.patch, YARN-4177.002.patch > > > There're many places uses Clock to time intervals, which is dangerous as > commented by [~ste...@apache.org] in HADOOP-12409. Instead, we should use > hadoop.util.Timer#monotonicNow() to get monotonic time. Or we could provide a > MonotonicClock in yarn.util considering the consistency of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container
[ https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900399#comment-14900399 ] Sunil G commented on YARN-4152: --- Thanks [~bibinchundatt]. Yes, container seems like was not present in context. And this has happened in CONTAINER_FINISHED event, so absent container scenario can be handled with this check. And looks like this case is also handled in other events, may be you could double check it and make sure similar incidents are handled for other events also. Other wise patch looks good to me. > NM crash with NPE when LogAggregationService#stopContainer called for absent > container > -- > > Key: YARN-4152 > URL: https://issues.apache.org/jira/browse/YARN-4152 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, > 0003-YARN-4152.patch > > > NM crash during of log aggregation. > Ran Pi job with 500 container and killed application in between > *Logs* > {code} > 2015-09-12 18:44:25,597 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_e51_1442063466801_0001_01_99 is : 143 > 2015-09-12 18:44:25,670 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e51_1442063466801_0001_01_000101 > 2015-09-12 18:44:25,670 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_e51_1442063466801_0001_01_000101 from application > application_1442063466801_0001 > 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > 2015-09-12 18:44:25,692 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_STOP for appId application_1442063466801_0001 > 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Exiting, bbye.. > 2015-09-12 18:44:25,692 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf > OPERATION=Container Finished - SucceededTARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1442063466801_0001 > CONTAINERID=container_e51_1442063466801_0001_01_000100 > {code} > *Analysis* > Looks like for absent container also {{stopContainer}} is called > {code} > case CONTAINER_FINISHED: > LogHandlerContainerFinishedEvent containerFinishEvent = > (LogHandlerContainerFinishedEvent) event; > stopContainer(containerFinishEvent.getContainerId(), > containerFinishEvent.getExitCode()); > break; > {code} > *Event EventType: KILL_CONTAINER sent to absent container > container_e51_1442063466801_0001_01_000101* > Should skip when {{null==context.getContainers().get(containerId)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900374#comment-14900374 ] Hadoop QA commented on YARN-3964: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 15s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 10s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 5m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 6m 57s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 1m 59s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 58m 28s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 116m 46s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761374/YARN-3964.006.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c9cb6a5 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9227/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9227/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9227/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9227/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9227/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9227/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9227/console | This message was automatically generated. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null
[ https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900353#comment-14900353 ] Hudson commented on YARN-4167: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #399 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/399/]) YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev c9cb6a5960ad335a3ee93a6ee219eae5aad372f9) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java > NPE on RMActiveServices#serviceStop when store is null > -- > > Key: YARN-4167 > URL: https://issues.apache.org/jira/browse/YARN-4167 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Fix For: 2.8.0 > > Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, > 0002-YARN-4167.patch > > > Configure > {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} > mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}} > On startup NPE is thrown on {{RMActiveServices#serviceStop}} > {noformat} > 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED; cause: > java.lang.IllegalArgumentException: > yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should > be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms > java.lang.IllegalArgumentException: > yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should > be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193) > 2015-09-16 12:23:29,507 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing > store. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193 > {noformat} > *Impact Area*: RM failover with wrong configuration -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null
[ https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900355#comment-14900355 ] Hudson commented on YARN-4167: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2364 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2364/]) YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev c9cb6a5960ad335a3ee93a6ee219eae5aad372f9) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java > NPE on RMActiveServices#serviceStop when store is null > -- > > Key: YARN-4167 > URL: https://issues.apache.org/jira/browse/YARN-4167 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Fix For: 2.8.0 > > Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, > 0002-YARN-4167.patch > > > Configure > {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} > mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}} > On startup NPE is thrown on {{RMActiveServices#serviceStop}} > {noformat} > 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED; cause: > java.lang.IllegalArgumentException: > yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should > be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms > java.lang.IllegalArgumentException: > yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should > be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193) > 2015-09-16 12:23:29,507 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing > store. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193 > {noformat} > *Impact Area*: RM failover with wrong configuration -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null
[ https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900347#comment-14900347 ] Hudson commented on YARN-4167: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2337 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2337/]) YARN-4167. NPE on RMActiveServices#serviceStop when store is null. (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev c9cb6a5960ad335a3ee93a6ee219eae5aad372f9) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java > NPE on RMActiveServices#serviceStop when store is null > -- > > Key: YARN-4167 > URL: https://issues.apache.org/jira/browse/YARN-4167 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Fix For: 2.8.0 > > Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, > 0002-YARN-4167.patch > > > Configure > {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} > mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}} > On startup NPE is thrown on {{RMActiveServices#serviceStop}} > {noformat} > 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED; cause: > java.lang.IllegalArgumentException: > yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should > be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms > java.lang.IllegalArgumentException: > yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should > be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193) > 2015-09-16 12:23:29,507 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing > store. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193 > {noformat} > *Impact Area*: RM failover with wrong configuration -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900290#comment-14900290 ] Devaraj K commented on YARN-3964: - [~leftnoteasy], Sure, Thanks for your interest. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900286#comment-14900286 ] zhihai xu commented on YARN-4095: - Hi [~jlowe], Could you help review the patch? thanks > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900285#comment-14900285 ] zhihai xu commented on YARN-4095: - Hi [~Jason Lowe], Could you help review the patch? thanks > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)