[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending
[ https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904033#comment-14904033 ] Weiwei Yang commented on YARN-4165: --- Thanks Jason, it doesn't look like the YARN-957, the reserved memory did less than node manager reports. > An outstanding container request makes all nodes to be reserved causing all > jobs pending > > > Key: YARN-4165 > URL: https://issues.apache.org/jira/browse/YARN-4165 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > > We have a long running service in YARN, it has a outstanding container > request that YARN cannot satisfy (require more memory that nodemanager can > supply). Then YARN reserves all nodes for this application, when I submit > other jobs (require relative small memory that nodemanager can supply), all > jobs are pending because YARN skips scheduling containers on the nodes that > have been reserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending
[ https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904028#comment-14904028 ] Weiwei Yang commented on YARN-4165: --- Hello Jason Thanks for looking this. I checked YARN 957 but I think this is a different problem. I have 3 nodes NM1 8G NM2 8G NM3 8G I submitted an application, requires 4 containers and each of them relative big memory like 5G, its app master requires 1G, so RM fills 3 containers and 1 app master but leaving 1 outstanding request, *unexpectedly* RM reserved 1 container on all 3 nodes like NM1 - 1 container, 1 app master - 6G used - 2G left - 5G reserved NM2 - 1 container - 5G used - 3G left - 5G reserved NM3 - 1 container - 5G used - 3G left - 5G reserved I am not sure yet why we run into such situation, but it might be related to YARN-1769, I am still investigating, if you have any pointers or comments, please let me know. Thanks. > An outstanding container request makes all nodes to be reserved causing all > jobs pending > > > Key: YARN-4165 > URL: https://issues.apache.org/jira/browse/YARN-4165 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > > We have a long running service in YARN, it has a outstanding container > request that YARN cannot satisfy (require more memory that nodemanager can > supply). Then YARN reserves all nodes for this application, when I submit > other jobs (require relative small memory that nodemanager can supply), all > jobs are pending because YARN skips scheduling containers on the nodes that > have been reserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903979#comment-14903979 ] Varun Saxena commented on YARN-4075: bq. With proper overloading I don't see significant code duplication problems. Yeah there aren't too many code duplication problems. Its just that the methods have so many parameters that this itself looks quite big. Anyways thats not a major issue. bq. TimelineEntities is not only used by the reader. It is also used by the writer and aggregation logic. Enforcing an order on this class will introduce unnecessary overhead to both writers and aggregators. If the reader needs it, we should derive it and make an ordered version, if possible. Yes thats a fair point. We can use setEntities on the read path. Explicitly defining a derived class for readers might be an option as well(say something like SortedTimelineEntities). I think we should make this return type explicit so that reader implementations use it. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903893#comment-14903893 ] Naganarasimha G R commented on YARN-1994: - Thanks for the explanation [~arpitagarwal], Even I had the same feeling that boolean configuration like NM_BIND_WILDCARD would have sufficed for it but i thought i might be missing something hence asked the query. May be we can capture the explanation you gave in the documentation jira YARN-2384 too ? > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, > YARN-1994.13.patch, YARN-1994.14.patch, YARN-1994.15-branch2.patch, > YARN-1994.15.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, > YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4157) Merge YARN-1197 back to trunk
[ https://issues.apache.org/jira/browse/YARN-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903832#comment-14903832 ] Hadoop QA commented on YARN-4157: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 24m 29s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 59 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 2s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 3m 41s | The applied patch generated 7 new checkstyle issues (total was 29, now 27). | | {color:red}-1{color} | whitespace | 284m 14s | The patch has 180 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 9m 42s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 9m 34s | Tests passed in hadoop-mapreduce-client-app. | | {color:green}+1{color} | tools/hadoop tests | 0m 52s | Tests passed in hadoop-sls. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 6m 55s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 2m 2s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 8m 27s | Tests passed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 55m 49s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 428m 10s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761700/YARN-1197.diff.6.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / cc2b473 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/whitespace.txt | | hadoop-mapreduce-client-app test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt | | hadoop-sls test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-sls.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9234/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9234/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9234/console | This message was automatically generated. > Merge YARN-1197 back to trunk > - > > Key: YARN-4157 > URL: https://issues.apache.org/jira/browse/YARN-4157 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-1197.diff.1.patch, YARN-1197.diff.2.patch, > YARN-1197.diff.3.patch, YARN-1197.diff.4.patch, YARN-1197.diff.5.patch, > YARN-1197.diff.6.patch > > > The purpose of this jira is to generate a uber patch from c
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903754#comment-14903754 ] Naganarasimha G R commented on YARN-3367: - Thanks [~gtCarrera9], For looking into this, There were lot of open questions for this jira from my end which i mentioned [earlier|https://issues.apache.org/jira/browse/YARN-3367?focusedCommentId=14732065&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14732065]. Hence just started with initial approach path. Mainly to use the refactor and reuse the existing {{AsyncDispatcher}} we need to discuss on below specific points : {quote} * 3 Is it important to maintain the order of events which are sent from sync and async ? i.e. Is it req to ensure all the async events are also pushed along with the current sync event or is it ok to send only the sync ? (current patch just ensures async events are in order) . * 4 Whether its req to merge entities of multiple async calls as they belong to same application ? {quote} If concluded then i can analyze further and inform. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities
[ https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903729#comment-14903729 ] Shiwei Guo commented on YARN-4199: -- Sorry, I havn't noticed [YARN-3448|https://issues.apache.org/jira/browse/YARN-3448] before. I think [YARN-3448|https://issues.apache.org/jira/browse/YARN-3448] solved the problem in a better way. So I marked this issue as a duplicate to [YARN-3448|https://issues.apache.org/jira/browse/YARN-3448]. Thanks for your remind. > Minimize lock time in LeveldbTimelineStore.discardOldEntities > - > > Key: YARN-4199 > URL: https://issues.apache.org/jira/browse/YARN-4199 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver, yarn >Reporter: Shiwei Guo > > In current implementation, LeveldbTimelineStore.discardOldEntities holds a > writeLock on deleteLock, which will block other put operation, which > eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of > history jobs in timelinestore, the block time will be very long. In our > observation, it block all the TEZ jobs for several hours or longer. > The possible solutions are: > - Optimize leveldb configuration, so a full scan won't take long time. > - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold > lock while getSnapshot. One question is that whether snapshot will take long > time or not, cause I have no experience with leveldb. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities
[ https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiwei Guo resolved YARN-4199. -- Resolution: Duplicate > Minimize lock time in LeveldbTimelineStore.discardOldEntities > - > > Key: YARN-4199 > URL: https://issues.apache.org/jira/browse/YARN-4199 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver, yarn >Reporter: Shiwei Guo > > In current implementation, LeveldbTimelineStore.discardOldEntities holds a > writeLock on deleteLock, which will block other put operation, which > eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of > history jobs in timelinestore, the block time will be very long. In our > observation, it block all the TEZ jobs for several hours or longer. > The possible solutions are: > - Optimize leveldb configuration, so a full scan won't take long time. > - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold > lock while getSnapshot. One question is that whether snapshot will take long > time or not, cause I have no experience with leveldb. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903614#comment-14903614 ] Arpit Agarwal commented on YARN-1994: - bq. Is it assumed that NM_BIND_HOST is configured to specific IP then NM_ADDRESS is also configured to the same IP ? Hi [~Naganarasimha], if NM_BIND_HOST is an IP address other than 0.0.0.0, then NM_ADDRESS should be set to a host that resolves to that address. Think of NM_BIND_HOST as the server side setting and NM_ADDRESS as a client side setting. If they are different the client cannot connect. I don't think we have tested setting NM_BIND_HOST to anything other than 0.0.0.0. In hindsight it may have been simpler to expose a boolean setting like NM_BIND_WILDCARD. bq. May be this a layman question why is it required to bind to all/multiple interfaces ? Depending on the routing and DNS configs, the client may connect on a different interface than the one bound by the server. Listening on all interfaces ensures connectivity. > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, > YARN-1994.13.patch, YARN-1994.14.patch, YARN-1994.15-branch2.patch, > YARN-1994.15.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, > YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled
[ https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903569#comment-14903569 ] Jason Lowe commented on YARN-3975: -- Latest patch looks good to me, however it does not apply cleanly to branch-2.7. Could you provide a branch-2.7 patch as well? > WebAppProxyServlet should not redirect to RM page if AHS is enabled > --- > > Key: YARN-3975 > URL: https://issues.apache.org/jira/browse/YARN-3975 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, > YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, > YARN-3975.8.patch, YARN-3975.9.patch > > > WebAppProxyServlet should be updated to handle the case when the appreport > doesn't have a tracking URL and the Application History Server is eanbled. > As we would have already tried the RM and got the > ApplicationNotFoundException we should not direct the user to the RM app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903527#comment-14903527 ] Li Lu commented on YARN-3367: - Also, I looked at the patch. One general comment is that, the logic of {{TimelineEntityAsyncDispatcher}} is pretty similar to {{AsyncDispatcher}}. Since the code segments that handling concurrency is normally considered as non-trivial, maybe we should refactor {{AsycnDispatcher}}'s code and reuse it, rather than follow the logic here? Will there be any unforeseen challenges on this? Thanks! > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4169) jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels
[ https://issues.apache.org/jira/browse/YARN-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903483#comment-14903483 ] Hadoop QA commented on YARN-4169: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 10m 26s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 8m 54s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 35s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 38s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 41s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 9s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 7m 48s | Tests failed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 56m 20s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 95m 35s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761714/YARN-4169.v1.001.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / cc2b473 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/whitespace.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9237/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9237/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9237/console | This message was automatically generated. > jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels > - > > Key: YARN-4169 > URL: https://issues.apache.org/jira/browse/YARN-4169 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0 > Environment: Jenkins >Reporter: Steve Loughran >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-4169.v1.001.patch > > > Test failing in [[Jenkins build > 402|https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-Yarn-trunk-Java8/402/testReport/junit/org.apache.hadoop.yarn.server.nodemanager/TestNodeStatusUpdaterForLabels/testNodeStatusUpdaterForNodeLabels/] > {code} > java.lang.NullPointerException: null > at java.util.HashSet.(HashSet.java:118) > at > org.apache.hadoop.yarn.nodelabels.NodeLabelTestBase.assertNLCollectionEquals(NodeLabelTestBase.java:103) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels.testNodeStatusUpdaterForNodeLabels(TestNodeStatusUpdaterForLabels.java:268) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903461#comment-14903461 ] Li Lu commented on YARN-3367: - Hi [~Naganarasimha], I'm trying to go over all pending JIRAs for 2928 branch, and seems like we're close on this one? Any recent progress on this JIRA? Thanks! > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903411#comment-14903411 ] Vrushali C commented on YARN-4074: -- Committed patch v8. Thanks [~sjlee0] for the contribution and everyone for the review! > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903400#comment-14903400 ] Varun Saxena commented on YARN-4000: bq. Is this the case? I think in current code, RM is still ignoring these orphan containers? In recoverContainersOnNode, if we do not find application in scheduler the flow in RM if I look at trunk code is as under: # AbstractYarnScheduler#killOrphanContainerOnNode will be called if application is not found in scheduler, which will in turn post CLEANUP_CONTAINER event (for containers which have not finished). This event will be handled by RMNodeImpl. Although here we will be sending one CLEANUP_CONTAINER event for each container even though all containers for a running app will have to be cleaned up. Maybe this can be refactored to send one event only with all the containers for an app and node. But cleaning up a lot of containers like this maybe a rare scenario. # Anyways going further, in RMNodeImpl, this event will be processed in CleanUpContainerTransition. Here the container will be added to a set containersToClean. # When heartbeat from NM comes, ResourceTrackerService#nodeHeartbeat will call RMNodeImpl#updateNodeHeartbeatResponseForCleanup. In this method, response will be populated with containers to cleanup from the set containersToClean. And hence these containers are reported back to NM in HB Rsp. On NM side, flow is as under: # In NodeStatusUpdaterImpl, these containers to cleanup will be retrieved from HB Rsp and CMgrCompletedContainersEvent will be dispatched. # In ContainerManagerImpl, this event will be processed and a ContainerKillEvent created for each container. # Now depending on the state of the container, ContainerImpl will send a CLEANUP_CONTAINER event to ContainersLauncher which will then send a TERM/KILL signal to container. > RM crashes with NPE if leaf queue becomes parent queue during restart > - > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903396#comment-14903396 ] Li Lu commented on YARN-4075: - bq. This can lead to code bloating with similar methods repeating again and again. OK here are the possible solutions: - Having two endpoints, one with a cluster name and the other don't. Both methods will be redirected to the same internal method getFlows(clusterId). For the endpoint that does not have a cluster id, we can help figure out on the server side. With proper overloading I don't see significant code duplication problems. Or - Always requiring the cluster id. Then, to allow the web apps to figure out the cluster name, we have to either implement another "client" in javascript, or we let the user input the cluster name (because the web app cannot figure them out). The first approach is not introducing any duplicated code, but actually is introducing duplicated logic, even in two different programming languages. The second approach will cause usability problems. Am I missing anything here? bq. Well currently the entities are returned in order, sorted by created time. That is how we have documented our reader API as well. TimelineReader#getEntities is supposed to return entities sorted descendingly by created time. TimelineEntities is not only used by the reader. It is also used by the writer and aggregation logic. Enforcing an order on this class will introduce unnecessary overhead to both writers and aggregators. If the reader needs it, we should derive it and make an ordered version, if possible. bq. You want me to do this refactoring in this JIRA ? To be clear, I'm only asking for Private and VisibleForTesting annotations in this JIRA. We should refactor the UTs in the JIRA that fully disables fs in future. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903387#comment-14903387 ] Varun Saxena commented on YARN-4000: [~jianhe] bq. actually, I think this will be a problem in regular case. Application is being killed by user right on RM restart. This is an existing problem though. Do you think so ? You mean user killing the application and we killing the application too at the same time ? But RM will first do the recovery and then only open any of the ports while transitioning to active. So ClientRMService or ResourceTrackerService wont even start till recovery is done. So most probably by the time kill from user comes, all the recovery related events should be processed. Even if they are not processed, they will be ahead in the dispatcher queue. A KILL event if app is already KILLING would be ignored by RMAppImpl. > RM crashes with NPE if leaf queue becomes parent queue during restart > - > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled
[ https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903385#comment-14903385 ] Hadoop QA commented on YARN-3975: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 3s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 12s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 51s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 39s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 54s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-web-proxy. | | | | 47m 26s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761709/YARN-3975.9.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cc2b473 | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9235/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-server-web-proxy test log | https://builds.apache.org/job/PreCommit-YARN-Build/9235/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9235/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9235/console | This message was automatically generated. > WebAppProxyServlet should not redirect to RM page if AHS is enabled > --- > > Key: YARN-3975 > URL: https://issues.apache.org/jira/browse/YARN-3975 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, > YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, > YARN-3975.8.patch, YARN-3975.9.patch > > > WebAppProxyServlet should be updated to handle the case when the appreport > doesn't have a tracking URL and the Application History Server is eanbled. > As we would have already tried the RM and got the > ApplicationNotFoundException we should not direct the user to the RM app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903374#comment-14903374 ] Hadoop QA commented on YARN-4140: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 44s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 7s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 10s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 49s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 29s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 29s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 54m 23s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 96m 15s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761696/0009-YARN-4140.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cc2b473 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9233/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9233/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9233/console | This message was automatically generated. > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, > 0009-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-143, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showReque
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903363#comment-14903363 ] Varun Saxena commented on YARN-4075: {quote} Or we can have 2 separate REST endpoints, with and without cluster ID. This looks good to me. Are there any specific challenges to implement this? {quote} This can lead to code bloating with similar methods repeating again and again. Otherwise no other concern. I had tried doing a regex match for paths at the time. But that doesn't seem to work if matching path is not in the end which in case of cluster ID wont be. {quote} Let's not enforce an order by default since this may be slightly more expensive? The programmer can always sort them on the client side if needed. {quote} Well currently the entities are returned in order, sorted by created time. That is how we have documented our reader API as well. TimelineReader#getEntities is supposed to return entities sorted descendingly by created time. We will be breaking this behavior if we use TimelineEntities and do not change the set within. {quote} However we do have problems with supporting more features with the old fs storage, so yes it's fine to make the change here. Maybe we'd like to mark them as test only? {quote} You want me to do this refactoring in this JIRA ? I think I can handle this refactoring alongside some other JIRA. And we can get this in ASAP for UI related work. Will rebase it and update the patch tomorrow morning India time. I think 4074 should be in by then. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903342#comment-14903342 ] Li Lu commented on YARN-4075: - Hi [~varun_saxena], thanks for the note! bq. Infact in initial patches in YARN-3814 I was taking cluster ID from config if it was not supplied by user i.e. it was an optional query parameter. But Zhijie was of the opinion that this handling should be done at TimelineClient side and that is what seemed to be the consensus at that time. Hence I removed it. Sure. However at that time we did not think about web apps. It will be a little bit non-trivial for front-end web page to figure out which cluster it's pointing to without user input. I noticed in the discussion you raised a quite helpful point: bq. Or we can have 2 separate REST endpoints, with and without cluster ID. This looks good to me. Are there any specific challenges to implement this? bq. For ordering should we change the set inside TimelineEntities to TreeSet with comparator based on created time ? Let's not enforce an order by default since this may be slightly more expensive? The programmer can always sort them on the client side if needed. bq. I plan to combine to test webservices classes to use HBase The UTs on webservices should be independent from the storage implementations. However we do have problems with supporting more features with the old fs storage, so yes it's fine to make the change here. Maybe we'd like to mark them as test only? > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903329#comment-14903329 ] Varun Saxena commented on YARN-4075: I mean "I plan to combine two of the test webservices classes to use HBase" > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903305#comment-14903305 ] Varun Saxena commented on YARN-4075: [~gtCarrera9], thanks for the review. bq. Maybe we'd like to return the "default" cluster, or the cluster the reader runs on (or a reader farm associates to), if the given clusterId is empty? Infact in initial patches in YARN-3814 I was taking cluster ID from config if it was not supplied by user i.e. it was an optional query parameter. But Zhijie was of the opinion that this handling should be done at TimelineClient side and that is what seemed to be the consensus at that time. Hence I removed it. If the consensus now seems to be centering around handling it at server side, we can do that. I am fine either ways. bq. I just noticed that we're returning Set rather than TimelineEntities in timeline reader. Ok. Again I had initially kept the API as returning TimelineEntities in 3051 but opinion differed then. I would infact prefer using TimelineEntities. For ordering should we change the set inside TimelineEntities to TreeSet with comparator based on created time ? Ordering might be useful at the client side. bq. In TestTimelineReaderWebServicesFlowRun#testGetFlowRun, why do we compare equality through toString and comparing two strings For the sake of simplicity. The toString outputs values as well. Anyways I can write a static function in the test class as well to do the comparison if toString approach seems confusing. Seems to be the case. Will change it. bq. Any special reasons to refactor TestHBaseTimelineStorage Due to visibility of TimelineSchemaCreator#createAllTables. Saw no real need to make it public. WebServices related test class shouldn't really need to access it directly. As I said in one of the comments above, I plan to combine to test webservices classes to use HBase. For that I will have a Test class for hbase reader implementation which will create the tables and load the data (in some beforeclass method). Webservices class will merely call that. Same arrangement as the one which exists for TestTimelineReaderWebServices and TestFileSystemTimelineReaderImpl. Then I wont need to call createAllTables from this test class. Will do that refactoring in some reader related JIRA. At the time, getting in this JIRA was more important than this refactoring, which anyways is just for tests. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4200) Refactor reader classes in storage to nest under hbase specific package name
[ https://issues.apache.org/jira/browse/YARN-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903285#comment-14903285 ] Li Lu commented on YARN-4200: - Will flip the code quickly when there is no other interference. Right now our priority goes to YARN-4075. > Refactor reader classes in storage to nest under hbase specific package name > > > Key: YARN-4200 > URL: https://issues.apache.org/jira/browse/YARN-4200 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Li Lu >Priority: Minor > > As suggested by [~gtCarrera9] in YARN-4074, filing jira to refactor the code > to group together the reader classes under a package in storage that > indicates these are hbase specific. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903284#comment-14903284 ] Li Lu commented on YARN-4074: - Sure, please go ahead with the current patch. Thanks for the work folks! > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4200) Refactor reader classes in storage to nest under hbase specific package name
[ https://issues.apache.org/jira/browse/YARN-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu reassigned YARN-4200: --- Assignee: Li Lu > Refactor reader classes in storage to nest under hbase specific package name > > > Key: YARN-4200 > URL: https://issues.apache.org/jira/browse/YARN-4200 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Li Lu >Priority: Minor > > As suggested by [~gtCarrera9] in YARN-4074, filing jira to refactor the code > to group together the reader classes under a package in storage that > indicates these are hbase specific. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903246#comment-14903246 ] Vrushali C commented on YARN-4074: -- Chatted with Li offline and decided to file https://issues.apache.org/jira/browse/YARN-4200 to deal with the refactoring of package names and proceed with this patch. > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4169) jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels
[ https://issues.apache.org/jira/browse/YARN-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4169: Attachment: YARN-4169.v1.001.patch Hi [~ste...@apache.org], Was able to reproduce this test failure (NPE) during debug testing, and it was due to improper handling of race condition, basically after sending {{hearbeat}} we need to wait for a short duration in the test case for the HB thread in Node status updater, till it goes into wait state. If some sleep is not added then notify of {{sendOutofBandHeartBeat}} will be called even before HB thread goes to wait state. Also have corrected other review comments which you had mentioned. > jenkins trunk+java build failed in TestNodeStatusUpdaterForLabels > - > > Key: YARN-4169 > URL: https://issues.apache.org/jira/browse/YARN-4169 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0 > Environment: Jenkins >Reporter: Steve Loughran >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-4169.v1.001.patch > > > Test failing in [[Jenkins build > 402|https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-Yarn-trunk-Java8/402/testReport/junit/org.apache.hadoop.yarn.server.nodemanager/TestNodeStatusUpdaterForLabels/testNodeStatusUpdaterForNodeLabels/] > {code} > java.lang.NullPointerException: null > at java.util.HashSet.(HashSet.java:118) > at > org.apache.hadoop.yarn.nodelabels.NodeLabelTestBase.assertNLCollectionEquals(NodeLabelTestBase.java:103) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels.testNodeStatusUpdaterForNodeLabels(TestNodeStatusUpdaterForLabels.java:268) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4200) Refactor reader classes in storage to nest under hbase specific package name
Vrushali C created YARN-4200: Summary: Refactor reader classes in storage to nest under hbase specific package name Key: YARN-4200 URL: https://issues.apache.org/jira/browse/YARN-4200 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vrushali C Priority: Minor As suggested by [~gtCarrera9] in YARN-4074, filing jira to refactor the code to group together the reader classes under a package in storage that indicates these are hbase specific. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3975) WebAppProxyServlet should not redirect to RM page if AHS is enabled
[ https://issues.apache.org/jira/browse/YARN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-3975: Attachment: YARN-3975.9.patch Somehow attached a wrong version of patch previously. Attached the patch with checkstyle fixed. > WebAppProxyServlet should not redirect to RM page if AHS is enabled > --- > > Key: YARN-3975 > URL: https://issues.apache.org/jira/browse/YARN-3975 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-3975.2.b2.patch, YARN-3975.3.patch, > YARN-3975.4.patch, YARN-3975.5.patch, YARN-3975.6.patch, YARN-3975.7.patch, > YARN-3975.8.patch, YARN-3975.9.patch > > > WebAppProxyServlet should be updated to handle the case when the appreport > doesn't have a tracking URL and the Application History Server is eanbled. > As we would have already tried the RM and got the > ApplicationNotFoundException we should not direct the user to the RM app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-4180: Attachment: YARN-4180.002.patch Addressed feedback > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch, YARN-4180.002.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903217#comment-14903217 ] Anubhav Dhoot commented on YARN-4180: - The test failure looks unrelated. > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4141) Runtime Application Priority change should not throw exception for applications at finishing states
[ https://issues.apache.org/jira/browse/YARN-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903168#comment-14903168 ] Jason Lowe commented on YARN-4141: -- Thanks for updating the patch. The new constants should be marked final. Also using "active" instead of "accepted" may be a bit more clear since accepted directly maps to an existing app state. > Runtime Application Priority change should not throw exception for > applications at finishing states > --- > > Key: YARN-4141 > URL: https://issues.apache.org/jira/browse/YARN-4141 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4141.patch, 0002-YARN-4141.patch, > 0003-YARN-4141.patch, 0004-YARN-4141.patch, 0005-YARN-4141.patch > > > As suggested by [~jlowe] in > [MAPREDUCE-5870-comment|https://issues.apache.org/jira/browse/MAPREDUCE-5870?focusedCommentId=14737035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737035] > , its good that if YARN can suppress exceptions during change application > priority calls for applications at its finishing stages. > Currently it will be difficult for clients to handle this. This will be > similar to kill application behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4157) Merge YARN-1197 back to trunk
[ https://issues.apache.org/jira/browse/YARN-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4157: - Attachment: YARN-1197.diff.6.patch Rebased to latest trunk (diff.6) > Merge YARN-1197 back to trunk > - > > Key: YARN-4157 > URL: https://issues.apache.org/jira/browse/YARN-4157 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-1197.diff.1.patch, YARN-1197.diff.2.patch, > YARN-1197.diff.3.patch, YARN-1197.diff.4.patch, YARN-1197.diff.5.patch, > YARN-1197.diff.6.patch > > > The purpose of this jira is to generate a uber patch from current YARN-1197 > branch and run against trunk to fix any uncaught warnings and test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903162#comment-14903162 ] Li Lu commented on YARN-4075: - Sorry folks we're a little bit delayed on YARN-4074, but once that is in we can move forward with this JIRA quickly. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903158#comment-14903158 ] Li Lu commented on YARN-4074: - Sorry I missed your message yesterday... I was thinking about putting those hbase reader classes (like ApplicationEntityReader) to a sub dir to indicate they only work with HBase. It's also fine to commit the patch as-is if that's troublesome. I'm OK with both. > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.007.patch, > YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, > YARN-4074-YARN-2928.POC.006.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903139#comment-14903139 ] Sunil G commented on YARN-4113: --- Thank you [~leftnoteasy] for the review and commit and thank you Karthik for the review. > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4113.patch > > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903133#comment-14903133 ] Bibin A Chundatt commented on YARN-4176: Hi [~leftnoteasy] Could you please look into this issue > Resync NM nodelabels with RM every x interval for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4189) Capacity Scheduler : Improve location preference waiting mechanism
[ https://issues.apache.org/jira/browse/YARN-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903125#comment-14903125 ] Wangda Tan commented on YARN-4189: -- [~xinxianyin], I mentioned this in design doc: bq. To avoid application set a very high delay (such as 10 min), we shall have a global max-container-delay to cap the delay to avoid resource wastage. > Capacity Scheduler : Improve location preference waiting mechanism > -- > > Key: YARN-4189 > URL: https://issues.apache.org/jira/browse/YARN-4189 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4189 design v1.pdf > > > There're some issues with current Capacity Scheduler implementation of delay > scheduling: > *1) Waiting time to allocate each container highly depends on cluster > availability* > Currently, app can only increase missed-opportunity when a node has available > resource AND it gets traversed by a scheduler. There’re lots of possibilities > that an app doesn’t get traversed by a scheduler, for example: > A cluster has 2 racks (rack1/2), each rack has 40 nodes. > Node-locality-delay=40. An application prefers rack1. > Node-heartbeat-interval=1s. > Assume there are 2 nodes available on rack1, delay to allocate one container > = 40 sec. > If there are 20 nodes available on rack1, delay of allocating one container = > 2 sec. > *2) It could violate scheduling policies (Fifo/Priority/Fair)* > Assume a cluster is highly utilized, an app (app1) has higher priority, it > wants locality. And there’s another app (app2) has lower priority, but it > doesn’t care about locality. When node heartbeats with available resource, > app1 decides to wait, so app2 gets the available slot. This should be > considered as a bug that we need to fix. > The same problem could happen when we use FIFO/Fair queue policies. > Another problem similar to this is related to preemption: when preemption > policy preempts some resources from queue-A for queue-B (queue-A is > over-satisfied and queue-B is under-satisfied). But queue-B is waiting for > the node-locality-delay so queue-A will get resources back. In next round, > preemption policy could preempt this resources again from queue-A. > This JIRA is target to solve these problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4140: --- Attachment: 0009-YARN-4140.patch Hi [~sunilg] Thnks for review and comments. Have updated restcases and patch as per comments > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, > 0009-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-143, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-117, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > > {code} > 2015-09-09 14:35:45,467 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:45,831 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,469 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,832 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > {code} > dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1> > cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep > "root.b.b1" | wc -l > 500 > {code} > > (Consumes about 6 minutes) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM
[ https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903096#comment-14903096 ] Robert Kanter commented on YARN-4180: - +1 after doing those. > AMLauncher does not retry on failures when talking to NM > - > > Key: YARN-4180 > URL: https://issues.apache.org/jira/browse/YARN-4180 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4180.001.patch > > > We see issues with RM trying to launch a container while a NM is restarting > and we get exceptions like NMNotReadyException. While YARN-3842 added retry > for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing > there intermittent errors to cause job failures. This can manifest during > rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902902#comment-14902902 ] Joep Rottinghuis commented on YARN-4075: Agreed with comments from [~gtCarrera] > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch, > YARN-4075-YARN-2928.POC.2.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902808#comment-14902808 ] Varun Saxena commented on YARN-2902: Ok. Then what I will do is NOT wait for completion of running tasks which have been cancelled. Localizer will try to only delete directories for which download was complete. Tasks which have failed, directories for them will anyways be deleted by FSDownload. We however may need a config in NM for deletion task delay(the one I have added in current patch). Or we can simply have a hardcoded value of 2 minutes. Regarding System exit, it will called after ExecutorService#shutdownNow(which will only interrupt running tasks and not wait for them) anyways. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4165) An outstanding container request makes all nodes to be reserved causing all jobs pending
[ https://issues.apache.org/jira/browse/YARN-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902771#comment-14902771 ] Jason Lowe commented on YARN-4165: -- Something must be amiss then, since the capacity scheduler should not be making reservations when the node has insufficient memory to ever fill the request after YARN-957. As for reservations in general, the capacity scheduler applies reservations against the user limits within the queue. If the user has the ability to fully use the queue then yes, reservations can stall other applications within the queue since the user is allowed to fill the queue. Without that behavior the application with large requests could end up in a situation where it never runs due to indefinite postponement problems. > An outstanding container request makes all nodes to be reserved causing all > jobs pending > > > Key: YARN-4165 > URL: https://issues.apache.org/jira/browse/YARN-4165 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Weiwei Yang >Assignee: Weiwei Yang > > We have a long running service in YARN, it has a outstanding container > request that YARN cannot satisfy (require more memory that nodemanager can > supply). Then YARN reserves all nodes for this application, when I submit > other jobs (require relative small memory that nodemanager can supply), all > jobs are pending because YARN skips scheduling containers on the nodes that > have been reserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4199) Minimize lock time in LeveldbTimelineStore.discardOldEntities
[ https://issues.apache.org/jira/browse/YARN-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902757#comment-14902757 ] Jason Lowe commented on YARN-4199: -- Have you looked at the rolling leveldb implementation from YARN-3448? One of its design goals was to solve this same problem. > Minimize lock time in LeveldbTimelineStore.discardOldEntities > - > > Key: YARN-4199 > URL: https://issues.apache.org/jira/browse/YARN-4199 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver, yarn >Reporter: Shiwei Guo > > In current implementation, LeveldbTimelineStore.discardOldEntities holds a > writeLock on deleteLock, which will block other put operation, which > eventually block the execution of YARN jobs(e.g. TEZ). When there is lots of > history jobs in timelinestore, the block time will be very long. In our > observation, it block all the TEZ jobs for several hours or longer. > The possible solutions are: > - Optimize leveldb configuration, so a full scan won't take long time. > - Take a snapshot of leveldb, and scan the snapshot, so we only need to hold > lock while getSnapshot. One question is that whether snapshot will take long > time or not, cause I have no experience with leveldb. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902756#comment-14902756 ] Hadoop QA commented on YARN-4140: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 41s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 59s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 16s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 49s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 57m 33s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 97m 18s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761638/0008-YARN-4140.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 57003fa | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9232/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9232/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9232/console | This message was automatically generated. > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority:
[jira] [Commented] (YARN-4011) Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
[ https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902750#comment-14902750 ] Jason Lowe commented on YARN-4011: -- bq. The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit. I think having the MR framework provide an optional limit for local filesystem output is a reasonable request until a more sophisticated solution can be implemented by YARN directly. > Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk > > > Key: YARN-4011 > URL: https://issues.apache.org/jira/browse/YARN-4011 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.4.0 >Reporter: Ashwin Shankar > > We observed jobs failed since tasks couldn't launch on nodes due to > "java.io.IOException No space left on device". > On digging in further, we found a rogue job which filled up disk. > Specifically it was wrote a lot of map spills(like > attempt_1432082376223_461647_m_000421_0_spill_1.out) to nm-local-dir > causing disk to fill up, and it failed/got killed, but didn't clean up these > files in nm-local-dir. > So the disk remained full, causing subsequent jobs to fail. > This jira is created to address why files under nm-local-dir doesn't get > cleaned up when job fails after filling up disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL
[ https://issues.apache.org/jira/browse/YARN-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902709#comment-14902709 ] Naganarasimha G R commented on YARN-4119: - Only problem i can see with the above approach is that, by default it binds to {{NM_WEBAPP_ADDRESS}} and not on all IPs which we thought of having as default behavior ! > Expose the NM bind address as an env, so that AM can make use of it for > exposing tracking URL > -- > > Key: YARN-4119 > URL: https://issues.apache.org/jira/browse/YARN-4119 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > As described in MAPREDUCE-5938, In many security scanning tools its not > advisable to bind on all network addresses and would be good to bind only on > the desired address. As AM's can run on any of the nodes it would be better > for NM to share its bind address as part of Environment variables to the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902662#comment-14902662 ] Sunil G commented on YARN-4140: --- Hi [~bibinchundatt] Thanks for updating patch. Some minor nits: 1. {{incPendingResourcesForLabel}} and {{decPendingResourceForLabel}} need not have to take ResourceRequest as argument. Only label expression is to be passed along with resource. 2. In below code {code} } else { ResourceRequest anyRequest = getResourceRequest(priority, ResourceRequest.ANY); if (anyRequest != null) { request.setNodeLabelExpression(anyRequest.getNodeLabelExpression()); } } {code} for any other resource requests, label expression is set as from anyRequest. One of point here - If user is not specified any label expression, then also we forcefully set {{anyRequest.getNodeLabelExpression()}} in all requests. It can be null too. Such cases can be invalidated. 3. In testResourceRequestUpdateNodePartitions, before sending second changed AM resource request, could you also add few more NODE_LOCAL or RACK_LOCAL (some priority to ANY, and some after ANY). This can help in hitting some more areas in code. > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-143, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-117, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > > {code} > 2015-09-09 14:35:45,467 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:45,831 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,469 DEBUG > org.apache.hadoop.yarn.server.res
[jira] [Commented] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL
[ https://issues.apache.org/jira/browse/YARN-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902631#comment-14902631 ] Naganarasimha G R commented on YARN-4119: - Hi [~vvasudev] & [~rohithsharma], While trying to look into the modifications came across few things : # {{ContainerLaunch.sanitizeEnv}} is already adding {{NM_HOST}} to the environment of a container Launch script. {{NM_HOST}} which is added as env is got from NM's NodeId.getHost(), NodeID is set in {{ContainerManagerImpl.serviceStart}} using {{yarn.nodemanager.address}}. So i was little skeptical about using this existing env param as though BindAddr is set they take NM_HOST's address. # As per YARN-1994 {{NM_BIND_HOST}} is generally used to set {{0.0.0.0}} in a {{Multi homing/interface}} environment settings in server side. but user can set individual address too. So it would be ideal to expose this, but one concern what i have is what if this is not set ? As per my understanding we need to set address part of {{NM_WEBAPP_ADDRESS/NM_WEBAPP_HTTPS_ADDRESS}} based on the schema. So my idea is * expose new ENV as {{AM_BIND_ADDR}} * It will be set with {{NM_BIND_HOST}} if its set * if not set then {{NM_WEBAPP_ADDRESS/NM_WEBAPP_HTTPS_ADDRESS}} based on the schema. Thoughts ? > Expose the NM bind address as an env, so that AM can make use of it for > exposing tracking URL > -- > > Key: YARN-4119 > URL: https://issues.apache.org/jira/browse/YARN-4119 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > As described in MAPREDUCE-5938, In many security scanning tools its not > advisable to bind on all network addresses and would be good to bind only on > the desired address. As AM's can run on any of the nodes it would be better > for NM to share its bind address as part of Environment variables to the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition
[ https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4140: --- Attachment: 0008-YARN-4140.patch Hi [~leftnoteasy] Could you please review the patch attached. When labels updated for *any* then pending resource usage for queue and app also need updated rt?. Have changed based on that . > RM container allocation delayed incase of app submitted to Nodelabel partition > -- > > Key: YARN-4140 > URL: https://issues.apache.org/jira/browse/YARN-4140 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, > 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, > 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch > > > Trying to run application on Nodelabel partition I found that the > application execution time is delayed by 5 – 10 min for 500 containers . > Total 3 machines 2 machines were in same partition and app submitted to same. > After enabling debug was able to find the below > # From AM the container ask is for OFF-SWITCH > # RM allocating all containers to NODE_LOCAL as shown in logs below. > # So since I was having about 500 containers time taken was about – 6 minutes > to allocate 1st map after AM allocation. > # Tested with about 1K maps using PI job took 17 minutes to allocate next > container after AM allocation > Once 500 container allocation on NODE_LOCAL is done the next container > allocation is done on OFF_SWITCH > {code} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > /default-rack, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: *, Relax > Locality: true, Node Label Expression: 3} > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-143, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > showRequests: application=application_1441791998224_0001 request={Priority: > 20, Capability: , # Containers: 500, Location: > host-10-19-92-117, Relax Locality: true, Node Label Expression: } > 2015-09-09 15:21:58,954 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > > {code} > 2015-09-09 14:35:45,467 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:45,831 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,469 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > 2015-09-09 14:35:46,832 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, > usedResources=, usedCapacity=0.0, > absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> vCores:0>, NODE_LOCAL > {code} > {code} > dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1> > cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep > "root.b.b1" | wc -l > 500 > {code} > > (Consumes about 6 minutes) > -- This message was sent by Atlassian JIRA (v6.3.4#6332
[jira] [Commented] (YARN-4141) Runtime Application Priority change should not throw exception for applications at finishing states
[ https://issues.apache.org/jira/browse/YARN-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902503#comment-14902503 ] Sunil G commented on YARN-4141: --- Hi [~jlowe] and [~rohithsharma] Could you please help to check the updated patch. > Runtime Application Priority change should not throw exception for > applications at finishing states > --- > > Key: YARN-4141 > URL: https://issues.apache.org/jira/browse/YARN-4141 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4141.patch, 0002-YARN-4141.patch, > 0003-YARN-4141.patch, 0004-YARN-4141.patch, 0005-YARN-4141.patch > > > As suggested by [~jlowe] in > [MAPREDUCE-5870-comment|https://issues.apache.org/jira/browse/MAPREDUCE-5870?focusedCommentId=14737035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737035] > , its good that if YARN can suppress exceptions during change application > priority calls for applications at its finishing stages. > Currently it will be difficult for clients to handle this. This will be > similar to kill application behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902369#comment-14902369 ] Naganarasimha G R commented on YARN-1994: - Hi [~cwelch] & [~arpitagarwal], I have few doubts for configuring NM_BIND_HOST & NM_ADDRESS as per the existing trunk/branch 2 code {code} if (bindHost != null && !bindHost.isEmpty() && nmAddress != null && !nmAddress.isEmpty()) { hostOverride = nmAddress.split(":")[0]; } // setup node ID InetSocketAddress connectAddress; if (delayedRpcServerStart) { connectAddress = NetUtils.getConnectAddress(initialAddress); } else { server.start(); connectAddress = NetUtils.getConnectAddress(server); } NodeId nodeId = buildNodeId(connectAddress, hostOverride); {code} # IIUC if NM_BIND_HOST is 0.0.0.0 then NM_ADDRESS's host part needs to be used for NODE_ID but what if proper IP is configured for NM_BIND_HOST then is it correct to take NM_ADDRESS's host part ? Is it assumed that NM_BIND_HOST is configured to specific IP then NM_ADDRESS is also configured to the same IP ? # May be this a layman question why is it required to bind to all/multiple interfaces ? > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, > YARN-1994.13.patch, YARN-1994.14.patch, YARN-1994.15-branch2.patch, > YARN-1994.15.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, > YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.3.4#6332)