[jira] [Updated] (YARN-2330) Jobs are not displaying in timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2330: -- Issue Type: Sub-task (was: Bug) Parent: YARN-321 Jobs are not displaying in timeline server after RM restart --- Key: YARN-2330 URL: https://issues.apache.org/jira/browse/YARN-2330 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Environment: Nodemanagers 3 (3*8GB) Queues A = 70% Queues B = 30% Reporter: Nishan Shetty Submit jobs to queue a While job is running Restart RM Observe that those jobs are not displayed in timelineserver {code} 2014-07-22 10:11:32,084 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: History information of application application_1406002968974_0003 is not included into the result due to the exception java.io.IOException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1381) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63) at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:624) at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.init(FileSystemApplicationHistoryStore.java:683) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:661) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:146) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getAllApplications(FileSystemApplicationHistoryStore.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getAllApplications(ApplicationHistoryManagerImpl.java:103) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:75) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:197) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:156) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
[jira] [Commented] (YARN-2330) Jobs are not displaying in timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069943#comment-14069943 ] Zhijie Shen commented on YARN-2330: --- It is very likely that the history file of the given application is still not closed for writing (for example, after RM restarting, RM reopen the history file to append the history information). On the other side, the reader want to scan the file under writing. The following logic is broken, because writer is invoked on RM, while reader is invoked on timeline server. Hence, from the point of view of reader. outstandingWriters is always empty. This cannot be used to indicate whether a file was opened for writing or not, {code} // The history file is still under writing if (outstandingWriters.containsKey(appId)) { throw new IOException(History file for application + appId + is under writing); } {code} Jobs are not displaying in timeline server after RM restart --- Key: YARN-2330 URL: https://issues.apache.org/jira/browse/YARN-2330 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Environment: Nodemanagers 3 (3*8GB) Queues A = 70% Queues B = 30% Reporter: Nishan Shetty Submit jobs to queue a While job is running Restart RM Observe that those jobs are not displayed in timelineserver {code} 2014-07-22 10:11:32,084 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: History information of application application_1406002968974_0003 is not included into the result due to the exception java.io.IOException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1381) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63) at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:624) at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.init(FileSystemApplicationHistoryStore.java:683) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:661) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:146) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getAllApplications(FileSystemApplicationHistoryStore.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getAllApplications(ApplicationHistoryManagerImpl.java:103) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:75) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:197) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:156) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069950#comment-14069950 ] Zhijie Shen commented on YARN-2262: --- How's your setup for RM restarting? does the application continue after RM restarting? If so, will the timeline server converge to show the missing fields correctly? The wrong fields you saw here is because the finish information of the application is missing. However, let's say if the application is going on until being finished after RM restarting, RM will still write the finish information, and the wrong values should be corrected finally. One meta comment for this and YARN-2330: It would be great if you can help improving the generic history service. On the other side, we're seeking migrating the generic history data to the timeline store, as we've seen the limitations of the fs history store. If this is finalized, we may not continue the support of this store. You can keep an eye on YARN-2033. Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069994#comment-14069994 ] Zhijie Shen commented on YARN-2229: --- Some additional comments: 0. For those that have been marked \@Private, it should be okay we break the backward compatibility. The problem that I can think of is RM and NM versions are out of sync. For example, RM is 2.6 and it takes the id as a long, while NM is 2.4 and it takes the id as an int. 1. I'm not sure it's good to makr a \@Stable method back to \@Unstable {code} @Public - @Stable + @Unstable public abstract int getId(); {code} 2. So anyway, we're going to break the users that use protobuf to make the client in their own programing language, aren't we? {code} optional ApplicationAttemptIdProto app_attempt_id = 2; - optional int32 id = 3; + optional int64 id = 3; } {code} ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070019#comment-14070019 ] Zhijie Shen commented on YARN-2319: --- Almost good to me, but would you mind fixing the indent bellow? No tab please. {code} + TestRMWebServicesDelegationTokens.class.getName() + -root); + testMiniKDC = new MiniKdc(MiniKdc.createConf(), testRootDir); + testMiniKDC.start(); + testMiniKDC.createPrincipal(httpSpnegoKeytabFile, HTTP/localhost, + client, client2, client3); {code} Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070039#comment-14070039 ] Yuliya Feldman commented on YARN-796: - To everybody that were so involved in providing input for last couple of days I can provide support for App, Queue and Queue Label Policy Expression support. Also did some performance measurements - with 1000 entries with nodes and their labels it takes about additional 700 ms to process 1mln requests (hot cache). If will need reevaluate on every ResourceRequest within App performance will go down This should cover {quote} label-expressions support (AND) only app able to specify a label-expression when making a resource request - kind of (do per application at the moment, not per every resource request) queues to AND augment the label expression with the queue label-expression add support for OR and NOT to label-expressions {quote} As far as {quote} RM has list of valid labels. (hot reloadable) NMs have list of labels. (hot reloadable) {quote} With file in DFS you can get hot reloadable valid (unless somebody makes typo) labels on RM [~wangda] - How do you want to proceed here? Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenwu Peng updated YARN-2319: - Attachment: YARN-2319.2.patch Thanks [~zjshen] comments. sorry, My IDE formatter have issues. Fix the format issue Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070073#comment-14070073 ] Nishan Shetty commented on YARN-2262: - [~zjshen] {quote}How's your setup for RM restarting?{quote} RM HA setup where active RM restarted gracefully {quote}does the application continue after RM restarting?{quote} Yes application continues after RM restart and finally application will be SUCCEEDED {quote}If so, will the timeline server converge to show the missing fields correctly?{quote} No timeline does not show correct fields even after application is SUCCEEDED Thanks Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070080#comment-14070080 ] Zhijie Shen commented on YARN-2262: --- Then, it should be a bug. Would you mind sharing the RM and the timeline server log where the problem occurred? Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2330) Jobs are not displaying in timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070081#comment-14070081 ] Nishan Shetty commented on YARN-2330: - Here RM has gone down abrupty Jobs are not displaying in timeline server after RM restart --- Key: YARN-2330 URL: https://issues.apache.org/jira/browse/YARN-2330 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Environment: Nodemanagers 3 (3*8GB) Queues A = 70% Queues B = 30% Reporter: Nishan Shetty Submit jobs to queue a While job is running Restart RM Observe that those jobs are not displayed in timelineserver {code} 2014-07-22 10:11:32,084 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: History information of application application_1406002968974_0003 is not included into the result due to the exception java.io.IOException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1381) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63) at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:624) at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.init(FileSystemApplicationHistoryStore.java:683) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:661) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:146) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getAllApplications(FileSystemApplicationHistoryStore.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getAllApplications(ApplicationHistoryManagerImpl.java:103) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:75) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:197) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:156) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070090#comment-14070090 ] Zhijie Shen commented on YARN-2319: --- +1. Will commit it once jenkins +1 as well. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070095#comment-14070095 ] Naganarasimha G R commented on YARN-2301: - bq. How about having -list only, and then parsing whether the given id is app id or app attempt id? Coding wise i dont see any troubles for this approach but CLI design wise not sure whether a command can take either one of the params [~jianhe] can you please share your thoughts on this too. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2330) Jobs are not displaying in timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2330: --- Assignee: Naganarasimha G R Jobs are not displaying in timeline server after RM restart --- Key: YARN-2330 URL: https://issues.apache.org/jira/browse/YARN-2330 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Environment: Nodemanagers 3 (3*8GB) Queues A = 70% Queues B = 30% Reporter: Nishan Shetty Assignee: Naganarasimha G R Submit jobs to queue a While job is running Restart RM Observe that those jobs are not displayed in timelineserver {code} 2014-07-22 10:11:32,084 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: History information of application application_1406002968974_0003 is not included into the result due to the exception java.io.IOException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1381) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63) at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:624) at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.init(FileSystemApplicationHistoryStore.java:683) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:661) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:146) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getAllApplications(FileSystemApplicationHistoryStore.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getAllApplications(ApplicationHistoryManagerImpl.java:103) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:75) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:197) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:156) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070126#comment-14070126 ] Hudson commented on YARN-2045: -- FAILURE: Integrated in Hadoop-Yarn-trunk #620 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/620/]) YARN-2045. Data persisted in NM should be versioned. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612285) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/NMDBSchemaVersion.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb/NMDBSchemaVersionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Fix For: 3.0.0, 2.6.0 Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, YARN-2045-v7.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070117#comment-14070117 ] Hudson commented on YARN-2321: -- FAILURE: Integrated in Hadoop-Yarn-trunk #620 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/620/]) YARN-2321. NodeManager web UI can incorrectly report Pmem enforcement. Contributed by Leitao Guo (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612411) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NodePage.java NodeManager web UI can incorrectly report Pmem enforcement -- Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Assignee: Leitao Guo Fix For: 3.0.0, 2.6.0 Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070119#comment-14070119 ] Hudson commented on YARN-2013: -- FAILURE: Integrated in Hadoop-Yarn-trunk #620 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/620/]) YARN-2013. The diagnostics is always the ExitCodeException stack when the container crashes. (Contributed by Tsuyoshi OZAWA) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612449) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070118#comment-14070118 ] Hudson commented on YARN-2131: -- FAILURE: Integrated in Hadoop-Yarn-trunk #620 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/620/]) YARN-2131. Addendum. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612443) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070161#comment-14070161 ] Hadoop QA commented on YARN-2319: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657099/YARN-2319.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.TestFifoScheduler org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.TestRM org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4393//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4393//console This message is automatically generated. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070165#comment-14070165 ] Zhijie Shen commented on YARN-2247: --- [~vvasudev], thanks for your patience on my comments. The new patch looks almost good to me. Just some nits: 1. Should not be necessary. Always load TimelineAuthenticationFilter. With simple type, still the pseudo handler is to used. {code} +if (authType.equals(simple) !UserGroupInformation.isSecurityEnabled()) { + container.addFilter(authentication, +AuthenticationFilter.class.getName(), filterConfig); + return; +} {code} 2. Check not null first for testMiniKDC and rm? Same for TestRMWebappAuthentication {code} +testMiniKDC.stop(); +rm.stop(); {code} 3. I didn't find the logic to forbid it. Anyway, is it good to mention it in the document as well? {code} + // Test to make sure that we can't do delegation token + // functions using just delegation token auth {code} Allow RM web services users to authenticate using delegation tokens --- Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2319: -- Attachment: YARN-2319.2.patch Trigger the jenkins again Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070223#comment-14070223 ] Hadoop QA commented on YARN-2319: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657115/YARN-2319.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4394//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4394//console This message is automatically generated. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070239#comment-14070239 ] Hudson commented on YARN-2242: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5934 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5934/]) YARN-2242. Addendum patch. Improve exception information on AM launch crashes. (Contributed by Li Lu) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612565) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, YARN-2242-071414.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070279#comment-14070279 ] Hudson commented on YARN-2131: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1812 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1812/]) YARN-2131. Addendum. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612443) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070281#comment-14070281 ] Hudson commented on YARN-2270: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1812 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1812/]) YARN-2270. Made TestFSDownload#testDownloadPublicWithStatCache be skipped when there’s no ancestor permissions. Contributed by Akira Ajisaka. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612460) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Fix For: 2.5.0 Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070278#comment-14070278 ] Hudson commented on YARN-2321: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1812 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1812/]) YARN-2321. NodeManager web UI can incorrectly report Pmem enforcement. Contributed by Leitao Guo (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612411) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NodePage.java NodeManager web UI can incorrectly report Pmem enforcement -- Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Assignee: Leitao Guo Fix For: 3.0.0, 2.6.0 Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2331) Distinguish shutdown during supervision vs. shutdown for rolling upgrade
Jason Lowe created YARN-2331: Summary: Distinguish shutdown during supervision vs. shutdown for rolling upgrade Key: YARN-2331 URL: https://issues.apache.org/jira/browse/YARN-2331 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the NM is shutting down with restart support enabled there are scenarios we'd like to distinguish and behave accordingly: # The NM is running under supervision. In that case containers should be preserved so the automatic restart can recover them. # The NM is not running under supervision and a rolling upgrade is not being performed. In that case the shutdown should kill all containers since it is unlikely the NM will be restarted in a timely manner to recover them. # The NM is not running under supervision and a rolling upgrade is being performed. In that case the shutdown should not kill all containers since a restart is imminent due to the rolling upgrade and the containers will be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070350#comment-14070350 ] Hudson commented on YARN-2321: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1839 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1839/]) YARN-2321. NodeManager web UI can incorrectly report Pmem enforcement. Contributed by Leitao Guo (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612411) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NodePage.java NodeManager web UI can incorrectly report Pmem enforcement -- Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Assignee: Leitao Guo Fix For: 3.0.0, 2.6.0 Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070352#comment-14070352 ] Hudson commented on YARN-2013: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1839 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1839/]) YARN-2013. The diagnostics is always the ExitCodeException stack when the container crashes. (Contributed by Tsuyoshi OZAWA) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612449) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070359#comment-14070359 ] Hudson commented on YARN-2045: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1839 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1839/]) YARN-2045. Data persisted in NM should be versioned. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612285) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/NMDBSchemaVersion.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb/NMDBSchemaVersionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Fix For: 3.0.0, 2.6.0 Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, YARN-2045-v7.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070351#comment-14070351 ] Hudson commented on YARN-2131: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1839 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1839/]) YARN-2131. Addendum. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612443) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070383#comment-14070383 ] Zhijie Shen commented on YARN-2319: --- Committed the patch to trunk, branch-2, and branch-2.5. Thanks, [~gujilangzi]! Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 3.0.0, 2.5.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Fix For: 2.5.0 Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2319: -- Target Version/s: 2.5.0 (was: 2.6.0) Affects Version/s: (was: 2.6.0) 2.5.0 3.0.0 Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 3.0.0, 2.5.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Fix For: 2.5.0 Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, YARN-2319.2.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070501#comment-14070501 ] Jian He commented on YARN-2229: --- Copied from protobuf guide, changing int32 to int64 seems to be compatible {code} int32, uint32, int64, uint64, and bool are all compatible – this means you can change a field from one of these types to another without breaking forwards- or backwards-compatibility. If a number is parsed from the wire which doesn't fit in the corresponding type, you will get the same effect as if you had cast the number to that type in C++ (e.g. if a 64-bit number is read as an int32, it will be truncated to 32 bits). {code} ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070512#comment-14070512 ] Jian He commented on YARN-2301: --- bq. How about having -list only, and then parsing whether the given id is app id or app attempt id? I think it's fine to just have -list only. bq. is it able to show the containers of previous app attempt, or the finished containers of the current app attempt? finished containers are removed from schedulers. [~Naganarasimha], let's leave 4) separately as it involves more changes and discussion. could you post your patch which fixed the first 3 ? thanks! Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070541#comment-14070541 ] Zhijie Shen commented on YARN-2229: --- bq. Copied from protobuf guide, changing int32 to int64 seems to be compatible sounds good. bq. The problem that I can think of is RM and NM versions are out of sync. For example, ContainerTokenIdentifier serializes a long (getContainerId()) at RM side, but deserializes a int (getId()) at NM side. In this case, I'm afraid it's going to be wrong ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2131: Attachment: YARN-2131_addendum2.patch Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch, YARN-2131_addendum2.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070559#comment-14070559 ] Jian He commented on YARN-2295: --- patch looks good, +1 Refactor YARN distributed shell with existing public stable API --- Key: YARN-2295 URL: https://issues.apache.org/jira/browse/YARN-2295 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, YARN-2295-071514.patch, YARN-2295-072114.patch Some API calls in YARN distributed shell have been marked as unstable and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2317) Update documentation about how to write YARN applications
[ https://issues.apache.org/jira/browse/YARN-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reopened YARN-2317: --- Commented on the wrong jira.. reopen this. Update documentation about how to write YARN applications - Key: YARN-2317 URL: https://issues.apache.org/jira/browse/YARN-2317 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2317-071714.patch Some information in WritingYarnApplications webpage is out-dated. Need some refresh work on this document to reflect the most recent changes in YARN APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070587#comment-14070587 ] Hudson commented on YARN-2295: -- FAILURE: Integrated in Hadoop-trunk-Commit #5939 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5939/]) YARN-2295. Refactored DistributedShell to use public APIs of protocol records. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612626) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java Refactor YARN distributed shell with existing public stable API --- Key: YARN-2295 URL: https://issues.apache.org/jira/browse/YARN-2295 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, YARN-2295-071514.patch, YARN-2295-072114.patch Some API calls in YARN distributed shell have been marked as unstable and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070611#comment-14070611 ] Karthik Kambatla commented on YARN-2131: My bad. Should have caught that in my earlier review. +1 for the second addendum. Committing it. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch, YARN-2131_addendum2.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2131. Resolution: Fixed Committed addendum-2 to trunk and branch-2. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch, YARN-2131_addendum2.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070632#comment-14070632 ] Devaraj K commented on YARN-2301: - Here it is trying allocate this memory for heap size. And we need to leave the remaing memory for launching the Child container java process, native memory, etc. As [~jianhe] mentioned, RM removes the completed containers or containers for completed attempts. I think there would not be much useful by providing completed appAttemptId for -list param and displaying some message or empty result. I would think of giving appId option for -list (i.e. -list appId) and print the containers running for current application attempt. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070638#comment-14070638 ] Hudson commented on YARN-2131: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5940 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5940/]) YARN-2131. Addendum2: Document -format-state-store. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612634) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch, YARN-2131_addendum2.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2332) Create REST interface for app submission
[ https://issues.apache.org/jira/browse/YARN-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2332: --- Summary: Create REST interface for app submission (was: Create service interface for app submission) Create REST interface for app submission Key: YARN-2332 URL: https://issues.apache.org/jira/browse/YARN-2332 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Reporter: Jeff Hammerbacher Porting a discussion from the LinkedIn Hadoop group to the Hadoop JIRA: http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=gid=988957discussionID=2156671sik=1239077959330 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2332) Create service interface for app submission
[ https://issues.apache.org/jira/browse/YARN-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer moved MAPREDUCE-454 to YARN-2332: -- Component/s: (was: mrv2) webapp resourcemanager Affects Version/s: (was: 0.23.0) Key: YARN-2332 (was: MAPREDUCE-454) Project: Hadoop YARN (was: Hadoop Map/Reduce) Create service interface for app submission --- Key: YARN-2332 URL: https://issues.apache.org/jira/browse/YARN-2332 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Reporter: Jeff Hammerbacher Porting a discussion from the LinkedIn Hadoop group to the Hadoop JIRA: http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=gid=988957discussionID=2156671sik=1239077959330 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1342: - Attachment: YARN-1342v5.patch Thanks for the review, Devaraj! bq. I think we can get the state from nmStore inside recover() instead of getting as an argument. I fixed this for NMContainerTokenSecretManager and NMTokenSecretManagerInNM. bq. Here e.getMessage() may not be required to pass as message since we are wrapping the same exception. I originally used the (message, throwable) form because the resulting exception message is subtly different than just passing the throwable. org.fusesource.leveldbjni.internal.JniDB converts exceptions into DBException using the (message, throwable) form, and I was trying to be consistent. However I don't think it really matters that much what the message is, so I went ahead and changed all the conversions from DBException to IOException to just use the throwable form. bq. Can we move the CONTAINER_TOKENS_KEY_PREFIX.length() to outside of the while loop? I'm skeptical of this change assuming any decent JVM environment. The String.length() method is just returning a member, and the JIT eats this kind of stuff up all the time. I went ahead and made the change anyway, but let me know if I'm missing the motivations for it. bq. Can we make the string container_ as a constant? Replaced it with ConverterUtils.CONTAINER_PREFIX as it's close enough in this context. bq. What do you think of having the names like RecoveredContainerTokensState, loadContainerTokensState Sounds good. For consistency I also changed the corresponding class and methods for NM tokens. Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070730#comment-14070730 ] Zhijie Shen commented on YARN-2301: --- bq. I think there would not be much useful by providing completed appAttemptId for -list param and displaying some message or empty result. As I mentioned before, the command is going to be applied to both RM and timeline server. The latter is going to record the completed containers. -list appId is able to list all the containers from the side of the timeline server, and I hope it could work. So how about doing this for -list appId|appAttemptId additional opts? * appAttemptId: containers of a specific app attempt in RM/Timeline server (for the case of RM, it is likely to show empty container list, but it's fine and it is actually the current situation). * appId with no additional opt: containers of the last(current) app attempt in RM/Timeline server * appId with -last: containers of the last(current) app attempt in RM/Timeline server *appId with -all: containers of all app attempts in RM/Timeline server Does this make sense? Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2333) RM UI : show pending containers at cluster level in the application/scheduler page
Ashwin Shankar created YARN-2333: Summary: RM UI : show pending containers at cluster level in the application/scheduler page Key: YARN-2333 URL: https://issues.apache.org/jira/browse/YARN-2333 Project: Hadoop YARN Issue Type: New Feature Reporter: Ashwin Shankar It would be helpful if we could display pending containers at a cluster level to get an idea of how far behind we are with, say our ETL processing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070770#comment-14070770 ] Eric Payne commented on YARN-415: - {quote} 5. I think it's better to add a new method in SchedulerApplicationAttempt like getMemoryUtilization, which will only return memory/cpu seconds. We do this to prevent locking scheduling thread when showing application metrics on web UI. getMemoryUtilization will be used by RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. And used by SchedulerApplicationAttempt#getResourceUsageReport as well. The MemoryUtilization class may contain two fields: runningContainerMemory(VCore)Seconds {quote} [~leftnoteasy], Thank you for your thorough analysis of this patch and for your detailed suggestions. I am working through them, and I think they are pretty clear, but this one is a little confusing to me. If I understand correctly, suggestion number 5 is to create SchedulerApplicationAttempt#getMemoryUtilization to be called from both SchedulerApplicationAttempt#getResourceUsageReport as well as RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds. Is that correct? If so, I have a couple of questions: - RMAppAttempt can access the scheduler via the 'scheduler' variable, but that is of type YarnScheduler, which does not have all of the interfaces available that AbstractYarnScheduler has. Are you suggesting that I add the getMemoryUtilization method to the YarnScheduler interface? Or, are you suggesting that the RMAppAttempt#scheduler variable be cast-ed to AbstractYarnScheduler? Or, am I missing the point? - When you say that a new class should be added called MemoryUtilization to be passed back to SchedulerApplicationAttempt#getResourceUsageReport, are you suggesting that that same structure should be added to ApplicationResourceUsageReport as a class variable in place of the current 'long memorySeconds' and 'long vcoreSeconds'? If so, I am a little reluctant to do that, since that structure would have to be passed across the protobuf interface to the client. It's possible, but seems riskier than just adding 2 longs to the API. Thank you very much. Eric Payne Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070772#comment-14070772 ] Hadoop QA commented on YARN-1342: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657163/YARN-1342v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4395//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4395//console This message is automatically generated. Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2334) Document exit codes and their meanings used by linux task controller
[ https://issues.apache.org/jira/browse/YARN-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer moved MAPREDUCE-1318 to YARN-2334: --- Component/s: (was: documentation) documentation Assignee: (was: Anatoli Fomenko) Key: YARN-2334 (was: MAPREDUCE-1318) Project: Hadoop YARN (was: Hadoop Map/Reduce) Document exit codes and their meanings used by linux task controller Key: YARN-2334 URL: https://issues.apache.org/jira/browse/YARN-2334 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Sreekanth Ramakrishnan Attachments: HADOOP-5912.1.patch, MAPREDUCE-1318.1.patch, MAPREDUCE-1318.2.patch, MAPREDUCE-1318.patch Currently, linux task controller binary uses a set of exit code, which is not documented. These should be documented. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2334) Document exit codes and their meanings used by linux task controller
[ https://issues.apache.org/jira/browse/YARN-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070801#comment-14070801 ] Allen Wittenauer commented on YARN-2334: Moving this to YARN. The same problem appears to exist for container executor. Document exit codes and their meanings used by linux task controller Key: YARN-2334 URL: https://issues.apache.org/jira/browse/YARN-2334 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Sreekanth Ramakrishnan Attachments: HADOOP-5912.1.patch, MAPREDUCE-1318.1.patch, MAPREDUCE-1318.2.patch, MAPREDUCE-1318.patch Currently, linux task controller binary uses a set of exit code, which is not documented. These should be documented. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070818#comment-14070818 ] Craig Welch commented on YARN-1994: --- FYI, took a look at the changes between patch 3 and 5, it includes support for the ApplicationHistory service when an address is specified for it (when one is not specified it binds on all ports, but if it is, it won't bind without this change). This impacts ApplicationHistoryClientService.java, ApplicationHistoryServer.java, and WebAppUtils.java. The mapreduce configurations where not consolidated when the yarn one was, those are also consolidated in the .5 patch. This impacts JHAdminConfig.java, MRWebAppUtil.java, HistoryClientService.java, and HSAdminServer.java. Some redundant logic in AdminService.java was also removed. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070838#comment-14070838 ] Craig Welch commented on YARN-1994: --- -re [~arpitagarwal] comments - [~mipoto] made the changes along the lines of those requested to AdminService.java in p5, the application master, client rm, history client, resource localization and resource tracker services all have changes to support the bind properties - I think in some cases the file names are different, but the services look to be covered - I changed WebAppUtils to use RPCUtil as suggested, TIMELINE_SERVICE_BIND_HOST was used by Milan in p5 when he added support for that service. I am attaching a .6 patch, which consists of Milan's .5 patch with the change to WebAppUtils to use RPCUtil, and a change to add a separate getRMWebAppBindURLWithoutScheme() method, to make sure there is no confusion about it's purpose, I returned getRMWebAppURLWithoutScheme() to it's earlier functionality, in case it is used by external code, so it will work properly (it should not have the bind logic). All relatively small changes, but look to be worthwhile finishing. In my interactive testing everything looks to be working properly. [~arpitagarwal] (and [~mipoto], if you like), can you take one more quick look? Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1994: -- Attachment: YARN-1994.6.patch Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch reassigned YARN-2008: - Assignee: Craig Welch (was: Chen He) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070859#comment-14070859 ] Craig Welch commented on YARN-2008: --- I believe we've resolved on [YARN-1198] to move forward with this change. [~wangda], [~airbots], can you take a look at my patch then, and provide feedback? Thanks... CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2273: -- Attachment: YARN-2273-5.patch new patch adds return in the continuous scheduling thread. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) Test*WebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070874#comment-14070874 ] Jason Lowe commented on YARN-2304: -- I noticed many (all?) of the failures occurred on the Jenkins H5 build node. Checking that node I saw a hung RM test that originally started on July 13th, and it was holding onto the port that the NM web service tests wanted to use. I've killed that hung test, so hopefully things can continue to progress in the short-term. Test*WebServices* fails intermittently -- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt TestNMWebService, TestRMWebService, and TestAMWebService get failed with address already get bind. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070875#comment-14070875 ] Craig Welch commented on YARN-2008: --- [~airbots], -re {quote} and if the application gets the low baseline headroom it will not be able to effectively use that greater capacity. {quote} right, that's what the patch is intended to do, but the approach here is to only drop that when needed based on utilization - when utilization is not an issue, allow the maxcapacity logic to continue as today and let the AM use the additional available headroom CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070880#comment-14070880 ] Hadoop QA commented on YARN-1994: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657185/YARN-1994.6.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4396//console This message is automatically generated. Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2331) Distinguish shutdown during supervision vs. shutdown for rolling upgrade
[ https://issues.apache.org/jira/browse/YARN-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070925#comment-14070925 ] Jason Lowe commented on YARN-2331: -- Another possible approach is to have the NM always try to cleanup containers on a shutdown when it is unsupervised. If a rolling upgrade needs to be performed and thus containers need to be preserved, the NM would be killed without the chance to cleanup (e.g.: kill -9 to deliver a SIGKILL). Upon restart the NM would recover the state from the state store and reacquire the containers. Distinguish shutdown during supervision vs. shutdown for rolling upgrade Key: YARN-2331 URL: https://issues.apache.org/jira/browse/YARN-2331 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Jason Lowe When the NM is shutting down with restart support enabled there are scenarios we'd like to distinguish and behave accordingly: # The NM is running under supervision. In that case containers should be preserved so the automatic restart can recover them. # The NM is not running under supervision and a rolling upgrade is not being performed. In that case the shutdown should kill all containers since it is unlikely the NM will be restarted in a timely manner to recover them. # The NM is not running under supervision and a rolling upgrade is being performed. In that case the shutdown should not kill all containers since a restart is imminent due to the rolling upgrade and the containers will be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070967#comment-14070967 ] Li Lu commented on YARN-2314: - Hi [~jlowe], I'm interested in looking into the cache overflow side of this issue (Sorry about the last comment that I mistyped my keyboard and sent it out...). After checking your comments and the code, I think a quick fix would be, when adding a new proxy into the cache and the cache is full, instead of only relying on (and trying to delete) the least recently used item, the cache should keep checking through the whole list to find one item that is not being used by a RPC, and replace it at that place. There is one scenario that this may not actually help, and that would be the whole list of cached items are used by RPCs. I would like to check with you to see if this is a frequent case in your cluster, and if not, if this quick fix would work for the cache overflow problem. Thanks! ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070965#comment-14070965 ] Hadoop QA commented on YARN-2273: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657187/YARN-2273-5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4397//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4397//console This message is automatically generated. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2314: - Attachment: nmproxycachefix.prototype.patch I was thinking along similar lines, but I am worried about the corner case where all RPCs are in use. I think we need to handle this case even if it's rare. An AM running on a node where it can see the RM but has a network cut to the rest of the cluster could go really bad really quick otherwise. If we don't handle the corner case then we'll continue to grow the proxy cache beyond its boundaries as we do today, and that AM will explode with thousands of threads for what may be a temporary network outage. While debugging this I wrote up a quick prototype patch to try to fix the cache so that it keeps the cache under the configured limit. Attaching the patch for reference. However as I mentioned above, simply keeping the NM proxy cache under its configured limit means nothing if we don't address the problems with connections remaining open in the IPC Client layer. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical Attachments: nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2335) Annotate all hadoop-sls APIs as @Private
Wei Yan created YARN-2335: - Summary: Annotate all hadoop-sls APIs as @Private Key: YARN-2335 URL: https://issues.apache.org/jira/browse/YARN-2335 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071018#comment-14071018 ] Li Lu commented on YARN-2314: - Thanks [~jlowe]! About the corner case, I'm wondering whether a bounded time waiting would be slightly better than waiting? In that way if a certain timeout is triggered then it means all RPCs are occupied for a really long time, and the system could report this abnormal situation. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical Attachments: nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2273: --- Summary: NPE in ContinuousScheduling thread when we lose a node (was: NPE in ContinuousScheduling Thread crippled RM after DN flap) NPE in ContinuousScheduling thread when we lose a node -- Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2335) Annotate all hadoop-sls APIs as @Private
[ https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2335: -- Attachment: YARN-2335-1.patch Annotate all hadoop-sls APIs as @Private Key: YARN-2335 URL: https://issues.apache.org/jira/browse/YARN-2335 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2335-1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071077#comment-14071077 ] Hudson commented on YARN-2273: -- FAILURE: Integrated in Hadoop-trunk-Commit #5945 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5945/]) YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612720) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java NPE in ContinuousScheduling thread when we lose a node -- Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071118#comment-14071118 ] Tsuyoshi OZAWA commented on YARN-2313: -- The test failure is not related. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, YARN-2313.4.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) Test*WebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071217#comment-14071217 ] Tsuyoshi OZAWA commented on YARN-2304: -- [~jlowe], thanks for your survey! After confirming things will go well, close this JIRA as fixed one. Test*WebServices* fails intermittently -- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt TestNMWebService, TestRMWebService, and TestAMWebService get failed with address already get bind. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071272#comment-14071272 ] Junping Du commented on YARN-1342: -- Thanks [~jlowe] for updating the patch. I'd like to continue our practice in YARN-1341 with initiate a question What would happen if failed to update currentMasterKey and previousMasterKey for NMContainerTokenSecretManager? My tentative answer is: staled currentMasterKey won't break anything as it will get updated in registering RM during restart. staled previousMasterKey will make AM that hold original previousMasterKey failed to start container. Would you confirm my understanding is correct? If so, the following code may not be necessary? {code} +// if there was no master key, try the previous key +if (super.currentMasterKey == null) { + super.currentMasterKey = previousMasterKey; +} {code} Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
Kenji Kikushima created YARN-2336: - Summary: Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree Key: YARN-2336 URL: https://issues.apache.org/jira/browse/YARN-2336 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Kenji Kikushima Assignee: Kenji Kikushima When we have sub queues in Fair Scheduler, REST api returns a missing '[' blacket JSON for childQueues. This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-2336: -- Attachment: YARN-2336.patch Attached a patch. To notice childQueues as a collection to JSONJAXBContext, this patch introduces FairSchedulerQueueInfoList. I refered to MAPREDUCE-4020. Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree -- Key: YARN-2336 URL: https://issues.apache.org/jira/browse/YARN-2336 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Kenji Kikushima Assignee: Kenji Kikushima Attachments: YARN-2336.patch When we have sub queues in Fair Scheduler, REST api returns a missing '[' blacket JSON for childQueues. This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1050) Document the Fair Scheduler REST API
[ https://issues.apache.org/jira/browse/YARN-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071289#comment-14071289 ] Kenji Kikushima commented on YARN-1050: --- Hi [~ajisakaa], created YARN-2336 for missing '[' blacket issue. Please comment if you have interest. Thanks! Document the Fair Scheduler REST API Key: YARN-1050 URL: https://issues.apache.org/jira/browse/YARN-1050 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Sandy Ryza Assignee: Kenji Kikushima Attachments: YARN-1050-2.patch, YARN-1050-3.patch, YARN-1050.patch The documentation should be placed here along with the Capacity Scheduler documentation: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2215) Add preemption info to REST/CLI
[ https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-2215: -- Attachment: YARN-2215.patch Hi [~leftnoteasy], I found Resource Preempted from Current Attempt and Number of Non-AM Containers Preempted from Current Attempt in Web UI are not implemented in REST apps api. For api consistency, I think REST should have same elements. Please comment if you have interest, Thanks. Add preemption info to REST/CLI --- Key: YARN-2215 URL: https://issues.apache.org/jira/browse/YARN-2215 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Attachments: YARN-2215.patch As discussed in YARN-2181, we'd better to add preemption info to RM RESTful API/CLI to make administrator/user get more understanding about preemption happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071324#comment-14071324 ] Hadoop QA commented on YARN-2336: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657259/YARN-2336.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4398//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4398//console This message is automatically generated. Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree -- Key: YARN-2336 URL: https://issues.apache.org/jira/browse/YARN-2336 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Kenji Kikushima Assignee: Kenji Kikushima Attachments: YARN-2336.patch When we have sub queues in Fair Scheduler, REST api returns a missing '[' blacket JSON for childQueues. This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071361#comment-14071361 ] Sandy Ryza commented on YARN-2328: -- {code} -if (node != null Resources.fitsIn(minimumAllocation, -node.getAvailableResource())) { +if (node != null +Resources.fitsIn(minimumAllocation, node.getAvailableResource())) { {code} This looks unrelated. +1 otherwise. FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2328-1.patch FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and handle being interrupted. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-2313: - Summary: Livelock can occur in FairScheduler when there are lots of running apps (was: Livelock can occur on FairScheduler when there are lots of running apps) Livelock can occur in FairScheduler when there are lots of running apps --- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, YARN-2313.4.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071395#comment-14071395 ] Devaraj K commented on YARN-2301: - Thanks [~zjshen] for the clarification. +1 for the above approach. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)