[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943703#comment-14943703 ] Sangjin Lee commented on YARN-4178: --- +1 with consolidating WriterUtils and ReaderUtils. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4220) [Storage implementation] Support getEntities with only Application id but no flow and flow run ID
[ https://issues.apache.org/jira/browse/YARN-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943741#comment-14943741 ] Li Lu commented on YARN-4220: - Oh Thanks [~varun_saxena]. Having looked at the code I can see we're checking the preconditions in ApplicationEntityReader. We're obtaining container level entities from the generic reader but application level entities from the application entity reader. I think this caused the problem. I thought this was a feature but it turned out this looks like a bug :). > [Storage implementation] Support getEntities with only Application id but no > flow and flow run ID > - > > Key: YARN-4220 > URL: https://issues.apache.org/jira/browse/YARN-4220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu >Priority: Minor > > Currently we're enforcing flow and flowrun id to be non-null values on > {{getEntities}}. We can actually query the appToFlow table to figure out an > application's flow id and flowrun id if they're missing. This will simplify > normal queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943702#comment-14943702 ] Varun Saxena commented on YARN-3864: [~sjlee0], thanks for the review. Will address your comments and update a patch shortly. > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4220) [Storage implementation] Support getEntities with only Application id but no flow and flow run ID
[ https://issues.apache.org/jira/browse/YARN-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943700#comment-14943700 ] Sangjin Lee commented on YARN-4220: --- Hmm, for any application or generic entity queries, we do not require the flow or the flow run id (or with YARN-4221 even user id). They are already populated from the app-to-flow table. Is that what you're referring to? Then it should already be working that way. > [Storage implementation] Support getEntities with only Application id but no > flow and flow run ID > - > > Key: YARN-4220 > URL: https://issues.apache.org/jira/browse/YARN-4220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu >Priority: Minor > > Currently we're enforcing flow and flowrun id to be non-null values on > {{getEntities}}. We can actually query the appToFlow table to figure out an > application's flow id and flowrun id if they're missing. This will simplify > normal queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943760#comment-14943760 ] Sangjin Lee commented on YARN-3367: --- Sorry for my late reply. {quote} Timelineclient async calls are only to ensure the client need not wait till the server response & just return immediately after requesting to post entity or even in server side we need to ensure some thing ? As currently we are trying to send the async parameter to the server. {quote} I think at minimum the flush should not be done on the server side. If the client is fine without the server response, it clearly implies flush is not needed (we had this discussion on another JIRA). {quote} Is it important to maintain the order of events which are sent from sync and async ? i.e. Is it req to ensure all the async events are also pushed along with the current sync event or is it ok to send only the sync ? (current patch just ensures async events are in order) . {quote} I'm not sure if it is a requirement that the timeline *client* has to ensure the order of events for both sync and async. First of all, the timestamp should be set for most of the entities, metrics, events, etc., and the server should rely on the timestamps to resolve ordering. Also, even if the client ensures a certain order, there are many situations under which the events will be received by the server out of order. {quote} Whether its req to merge entities of multiple async calls as they belong to same application ? {quote} I'm not really sure if this is something the client needs to do. If anything, this requirement should fall on the app level timeline collector. I don't see a whole lot of situations where the timeline client can do this easily and unambiguously. Thoughts, [~Naganarasimha]? > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943771#comment-14943771 ] Sangjin Lee commented on YARN-3864: --- The latest patch (v.4) LGTM. Unless there are additional comments, and with jenkins passing, I'll commit it soon. > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-YARN-2928.04.patch, YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4220) [Storage implementation] Support getEntities with only Application id but no flow and flow run ID
[ https://issues.apache.org/jira/browse/YARN-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943773#comment-14943773 ] Varun Saxena commented on YARN-4220: This wont work even after YARN-3864. /entities endpoint will not be considered as single entity read but that is what we are trying to do here. Read a single application entity. Maybe we can handle it here because passing query params gets us correct result. > [Storage implementation] Support getEntities with only Application id but no > flow and flow run ID > - > > Key: YARN-4220 > URL: https://issues.apache.org/jira/browse/YARN-4220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu >Priority: Minor > > Currently we're enforcing flow and flowrun id to be non-null values on > {{getEntities}}. We can actually query the appToFlow table to figure out an > application's flow id and flowrun id if they're missing. This will simplify > normal queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943778#comment-14943778 ] Vrushali C commented on YARN-4178: -- bq. I think we can have a single class TimelineStorageUtils and remove TimelineWriterUtils and TimelineReaderUtils. Sounds good. I guess it will be a big class but shouldn't be a problem. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943781#comment-14943781 ] Jason Lowe commented on YARN-4216: -- The container logs should not be uploaded on NM stop if we are doing recovery. That is intentional. Decommission + nm restart doesn't make sense to me. Either we are decommissioning a node and don't expect it to return, or we are going to restart it and expect it to return shortly. For the former, we want the NM to linger a bit to try to finish log aggregation. For the latter it should not. If we are decommissioning the node then context.getDecommissioned() in the boolean clause above should be true which means shouldAbort would be false. That means it should not do the same thing as a shutdown under supervision. My apologies if I'm missing something. > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4220) [Storage implementation] Support getEntities with only Application id but no flow and flow run ID
[ https://issues.apache.org/jira/browse/YARN-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943721#comment-14943721 ] Li Lu commented on YARN-4220: - Well this may be the thing I'm missing now: I made a call {{http://localhost:8188/ws/v2/timeline/entities/application_1443660447597_0001/YARN_APPLICATION?userid=llu=ALL}} and get exceptions {{java.lang.NullPointerException: flowId shouldn't be null}}. I believe we do have the ability to handle this kind of requests, but for some reason we're blocking this in some intermediate levels. A call like {{http://localhost:8188/ws/v2/timeline/entities/application_1443660447597_0001/YARN_APPLICATION?userid=llu=ALL=flow_1443660447597_1=1}} would return the correct info though. Am I calling it in the wrong way? Thanks! > [Storage implementation] Support getEntities with only Application id but no > flow and flow run ID > - > > Key: YARN-4220 > URL: https://issues.apache.org/jira/browse/YARN-4220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu >Priority: Minor > > Currently we're enforcing flow and flowrun id to be non-null values on > {{getEntities}}. We can actually query the appToFlow table to figure out an > application's flow id and flowrun id if they're missing. This will simplify > normal queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4162: Attachment: YARN-4162.v2.002.patch Hi [~wangda], I have updated the patch with test cases and did some sufficient testing with and without node labels for the UI, seems to be fine, hope you could also try once. Also to ensure all the REST data related to the existing web UI is available we need to also expose the Node Label Resource information. Do i need to add them also in this jira ? > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token
[ https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943832#comment-14943832 ] Jonathan Eagles commented on YARN-4183: --- [~xgong], this issue is two-fold. 1) The web services publishing should trigger posting history based on generic history enablement and not timeline server enablement. 2) There is still no separation between timeline clients that require delegation tokens and those that don't. See YARN-3942. As a result, if timelineservice is enabled at the global level, then each yarn client will get a timeline delegation token which makes the timeline service a live dependency. Meaning if the timeline service is down, then the grid is down. This patch above is a clean way to avoid enabling the timeline service for all YarnClients in the cluster. > Enabling generic application history forces every job to get a timeline > service delegation token > > > Key: YARN-4183 > URL: https://issues.apache.org/jira/browse/YARN-4183 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-4183.1.patch > > > When enabling just the Generic History Server and not the timeline server, > the system metrics publisher will not publish the events to the timeline > store as it checks if the timeline server and system metrics publisher are > enabled before creating a timeline client. > To make it work, if the timeline service flag is turned on, it will force > every yarn application to get a delegation token. > Instead of checking if timeline service is enabled, we should be checking if > application history server is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4178: --- Attachment: YARN-4178-YARN-2928.03.patch Updating a patch after merging Writer and Reader utils > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3864: --- Attachment: YARN-3864-YARN-2928.04.patch Updating a patch addressing [~sjlee0]'s comments > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-YARN-2928.04.patch, YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4220) [Storage implementation] Support getEntities with only Application id but no flow and flow run ID
[ https://issues.apache.org/jira/browse/YARN-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943785#comment-14943785 ] Li Lu commented on YARN-4220: - Sure. We can fix it here. Right now I'm passing flow and flowrun ids in the web UI POC. Not a big deal. > [Storage implementation] Support getEntities with only Application id but no > flow and flow run ID > - > > Key: YARN-4220 > URL: https://issues.apache.org/jira/browse/YARN-4220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu >Priority: Minor > > Currently we're enforcing flow and flowrun id to be non-null values on > {{getEntities}}. We can actually query the appToFlow table to figure out an > application's flow id and flowrun id if they're missing. This will simplify > normal queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943791#comment-14943791 ] Hadoop QA commented on YARN-3864: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 16m 9s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 5s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 17s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 7s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 39s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 53s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 57s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 41m 3s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765030/YARN-3864-YARN-2928.04.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / a95b8f5 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/9350/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9350/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9350/console | This message was automatically generated. > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-YARN-2928.04.patch, YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
[ https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943809#comment-14943809 ] Anubhav Dhoot commented on YARN-4185: - I don't think option 2 where you restart from 1 makes sense. Its also not a goal to minimize the total wait time. The goal should be to minimize the time to recover for short intermittent failure while also waiting long enough for long failures before giving up. Would it be better for us to ramp up to 10 sec exponentially and then do the n retries for 10 sec or do totally n retries including the ramp up. > Retry interval delay for NM client can be improved from the fixed static > retry > --- > > Key: YARN-4185 > URL: https://issues.apache.org/jira/browse/YARN-4185 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian > > Instead of having a fixed retry interval that starts off very high and stays > there, we are better off using an exponential backoff that has the same fixed > max limit. Today the retry interval is fixed at 10 sec that can be > unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943849#comment-14943849 ] Naganarasimha G R commented on YARN-3367: - Thanks for the feed back [~sjlee0], bq. If the client is fine without the server response, it clearly implies flush is not needed Yes i agree with this, but what should be the behavior of Sync calls ? IMO in the wake of YARN-4061 (Fault tolerant writer for timeline v2), we need not worry abt it either, Thoughts ? bq. First of all, the timestamp should be set for most of the entities, metrics, events, etc., and the server should rely on the timestamps to resolve ordering. Well i can understand if all the events are received and timestamp is filled at the client side we need not worry abt the order but in the case client goes down send some events out of order ? like containerFinished event gets Published but logAggregation does not succeed. And from Client App side not sure how important is the order to be maintained. bq. I don't see a whole lot of situations where the timeline client can do this easily and unambiguously. Well the approach i thought is to block publishing further entities/events through sync /async calls till the events in Timeline client queue is cleared. But i don't completely see the need for them until its very necessary to maintain the order. [~djp] can you please comment on this part as in this jira description you have targetted to get the events in order. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4170) AM need to be notified with priority in AllocateResponse
[ https://issues.apache.org/jira/browse/YARN-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4170: -- Attachment: 0003-YARN-4170.patch Yes [~rohithsharma] Thank you for sharing the thoughts. Yes, we could send "null" when there are no changes in priority. Updated patch against this change. > AM need to be notified with priority in AllocateResponse > - > > Key: YARN-4170 > URL: https://issues.apache.org/jira/browse/YARN-4170 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4170.patch, 0002-YARN-4170.patch, > 0003-YARN-4170.patch > > > As discussed in MAPREDUCE-5870, Application Master need to be notified with > priority in Allocate heartbeat. This will help AM to know the priority and > can update JobStatus when client asks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4220) [Storage implementation] Support getEntities with only Application id but no flow and flow run ID
[ https://issues.apache.org/jira/browse/YARN-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943729#comment-14943729 ] Varun Saxena commented on YARN-4220: [~gtCarrera9] Oh I get it. Even I was wondering what this JIRA is about. I think this will be addressed by patches in YARN-3864. There were some gaps in ApplicationEntityReader which I have fixed while working on YARN-3864. Will check this flow and confirm > [Storage implementation] Support getEntities with only Application id but no > flow and flow run ID > - > > Key: YARN-4220 > URL: https://issues.apache.org/jira/browse/YARN-4220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu >Priority: Minor > > Currently we're enforcing flow and flowrun id to be non-null values on > {{getEntities}}. We can actually query the appToFlow table to figure out an > application's flow id and flowrun id if they're missing. This will simplify > normal queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943751#comment-14943751 ] Allen Wittenauer commented on YARN-4224: If this is an incompatible change, then this needs to be ws/v3. > Change the REST interface to conform to current REST APIs' in YARN > -- > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4170) AM need to be notified with priority in AllocateResponse
[ https://issues.apache.org/jira/browse/YARN-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943797#comment-14943797 ] Hadoop QA commented on YARN-4170: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 20s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 14s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 34s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 15s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 59s | The applied patch generated 1 new checkstyle issues (total was 7, now 8). | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 57s | Post-patch findbugs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common compilation is broken. | | {color:red}-1{color} | findbugs | 2m 20s | Post-patch findbugs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager compilation is broken. | | {color:green}+1{color} | findbugs | 2m 20s | The patch does not introduce any new Findbugs (version ) warnings. | | {color:red}-1{color} | yarn tests | 0m 24s | Tests failed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 0m 23s | Tests failed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 0m 22s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 47m 37s | | \\ \\ || Reason || Tests || | Failed build | hadoop-yarn-api | | | hadoop-yarn-common | | | hadoop-yarn-server-resourcemanager | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765029/0003-YARN-4170.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b925cf1 | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/9349/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9349/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9349/artifact/patchprocess/whitespace.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9349/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9349/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9349/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9349/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9349/console | This message was automatically generated. > AM need to be notified with priority in AllocateResponse > - > > Key: YARN-4170 > URL: https://issues.apache.org/jira/browse/YARN-4170 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4170.patch, 0002-YARN-4170.patch, > 0003-YARN-4170.patch > > > As discussed in MAPREDUCE-5870, Application Master need to be notified with > priority in Allocate heartbeat. This will help AM to know the priority and > can update JobStatus when client asks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943863#comment-14943863 ] Varun Saxena commented on YARN-4178: Depending on order in which patches go in, YARN-4178 or YARN-3864 will require rebase > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4162: Attachment: YARN-4162.v2.003.patch > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, YARN-4162.v2.003.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944501#comment-14944501 ] Rohith Sharma K S commented on YARN-4209: - patch apply for branch-2.7.2 is failing.. Can you provide patch for branch-2.7? > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch, YARN-4209.001.patch, > YARN-4209.002.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944498#comment-14944498 ] Rohith Sharma K S commented on YARN-4209: - committing shortly > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch, YARN-4209.001.patch, > YARN-4209.002.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
Wilfred Spiegelenburg created YARN-4227: --- Summary: FairScheduler: RM quits processing expired container from a removed node Key: YARN-4227 URL: https://issues.apache.org/jira/browse/YARN-4227 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.1, 2.5.0, 2.3.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Priority: Critical Under some circumstances the node is removed before an expired container event is processed causing the RM to exit: {code} 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1436927988321_1307950_01_12 Container Transitioned from ACQUIRED to EXPIRED 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Completed container: container_1436927988321_1307950_01_12 in state: EXPIRED event:EXPIRE 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1436927988321_1307950 CONTAINERID=container_1436927988321_1307950_01_12 2015-10-04 21:14:01,063 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_EXPIRED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) at java.lang.Thread.run(Thread.java:745) 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4215) RMNodeLabels Manager Need to verify and replace node labels for the only modified Node Label Mappings in the request
[ https://issues.apache.org/jira/browse/YARN-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4215: Attachment: YARN-4215.v1.002.patch > RMNodeLabels Manager Need to verify and replace node labels for the only > modified Node Label Mappings in the request > > > Key: YARN-4215 > URL: https://issues.apache.org/jira/browse/YARN-4215 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: nodelabel, resourcemanager > Attachments: YARN-4215.v1.001.patch, YARN-4215.v1.002.patch > > > Modified node Labels needs to be updated by the capacity scheduler holding a > lock hence its better to push events to scheduler only when there is actually > a change in the label mapping for a given node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944544#comment-14944544 ] Hudson commented on YARN-4176: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #458 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/458/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4228) FileSystemRMStateStore use IOUtils on fs#close
Bibin A Chundatt created YARN-4228: -- Summary: FileSystemRMStateStore use IOUtils on fs#close Key: YARN-4228 URL: https://issues.apache.org/jira/browse/YARN-4228 Project: Hadoop YARN Issue Type: Improvement Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor NPE on {{FileSystemRMStateStore#closeWithRetries}} when active service initialization fails on rm start up {noformat} 2015-10-05 19:56:38,626 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:721) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:718) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$FSAction.runWithRetries(FileSystemRMStateStore.java:734) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeWithRetries(FileSystemRMStateStore.java:718) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeInternal(FileSystemRMStateStore.java:169) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:618) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.AbstractService.close(AbstractService.java:250) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:609) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:965) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1195) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944484#comment-14944484 ] Naganarasimha G R commented on YARN-4162: - Thanks [~wangda] for the comments, have incorporated them in the latest patch. Test case failures seems to be not related to the modifications in this jira and valid checkstyle comments have been incorporated. Also your thoughts on my earlier comment ? bq. Also to ensure all the REST data related to the existing web UI is available we need to also expose the Node Label Resource information. Do i need to add them also in this jira ? > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, YARN-4162.v2.003.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-4227: Attachment: YARN-4227.patch > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944524#comment-14944524 ] Bibin A Chundatt commented on YARN-4216: {quote} That is intentional. Decommission + nm restart doesn't make sense to me. Either we are decommissioning a node and don't expect it to return, or we are going to restart it and expect it to return shortly. {quote} For *rolling upgrade* the same scenarios can happen *( decommmision (logs upload) --> upgrade --> start NM --> new container assignment --> on finish log upload )* and container log loss happens. Append logs during aggregation could be one solution in this case rt? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4061) [Fault tolerance] Fault tolerant writer for timeline v2
[ https://issues.apache.org/jira/browse/YARN-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943999#comment-14943999 ] Sangjin Lee commented on YARN-4061: --- Sorry it took me a while to get to this. Thanks for the proposal [~gtCarrera9]! One potential area of concern is regarding flush and the associated contract with the client. If the client wanted to write critical data synchronously and specifically wanted to block until it receives a server (storage) response, how would that work in this scheme? The proposal has the following for the flush situation: {quote} When one log segment reaches a predefined size, or a time trigger, or an explicit flush call happens, it is published to a log queue. {quote} Since the actual storage writer (HBase) always acts on this queue asynchronously, it seems that the client cannot have a synchronous write semantics. Is that a correct reading? If so, how would we implement such a synchronous write? > [Fault tolerance] Fault tolerant writer for timeline v2 > --- > > Key: YARN-4061 > URL: https://issues.apache.org/jira/browse/YARN-4061 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Attachments: FaulttolerantwriterforTimelinev2.pdf > > > We need to build a timeline writer that can be resistant to backend storage > down time and timeline collector failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4061) [Fault tolerance] Fault tolerant writer for timeline v2
[ https://issues.apache.org/jira/browse/YARN-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944009#comment-14944009 ] Sangjin Lee commented on YARN-4061: --- Another thing to consider is also the throughput of writing to filesystems (local or hdfs). This may or may not be a big problem for app-level timeline collector, but it would certainly be something we need to analyze rigorously for the RM timeline collector. If we go the route of writing all writes to disk, then we should ensure that we can sustain the throughput for the RM collector of a very large cluster (> 10,000 nodes, a large number of apps being created). > [Fault tolerance] Fault tolerant writer for timeline v2 > --- > > Key: YARN-4061 > URL: https://issues.apache.org/jira/browse/YARN-4061 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Attachments: FaulttolerantwriterforTimelinev2.pdf > > > We need to build a timeline writer that can be resistant to backend storage > down time and timeline collector failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944028#comment-14944028 ] Sangjin Lee commented on YARN-3367: --- bq. Yes i agree with this, but what should be the behavior of Sync calls ? IMO in the wake of YARN-4061 (Fault tolerant writer for timeline v2), we need not worry abt it either, Thoughts ? I added a couple of comments to YARN-4061. I think it remains to be seen what we will choose as the behavior/implementation at the end. But at least I think it'd be fair to say that there will be a certain type of calls that will need to trigger flush (and a synchronous wait for the response). Whether we will do that on the sync side or not, I think we have some flexibility. {quote} Well i can understand if all the events are received and timestamp is filled at the client side we need not worry abt the order but in the case client goes down send some events out of order ? like containerFinished event gets Published but logAggregation does not succeed. And from Client App side not sure how important is the order to be maintained. {quote} I think we might be saying the same thing. What I'm saying is that it would not be practical to ensure the order of events for sync and async writes. As for the timestamps, I am also arguing that the timestamps should always be set explicitly for entities/metrics/events, and that the server should rely on the explicit timestamps, rather than on time of receipt. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943918#comment-14943918 ] Varun Saxena commented on YARN-2902: The new patch does the following over the last patch : # Removed additional param in container executor to ignore missing directory. Now this will be the default behaviour # The patch no longer cancels the deletion task in NM # Localizer wont send the extra heartbeat which it was sending to NM if NM had indicated it to DIE. # Do not wait in localizer for cancelled task to complete on DIE. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943957#comment-14943957 ] Sangjin Lee commented on YARN-4178: --- I just committed YARN-3864. Could you please rebase this patch? Thanks. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943973#comment-14943973 ] Hadoop QA commented on YARN-4178: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 29s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 9m 14s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 40s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 18s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 5s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 44s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 45s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 2s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 53s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 46m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765047/YARN-4178-YARN-2928.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / a95b8f5 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/9352/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9352/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9352/console | This message was automatically generated. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4175) Example of use YARN-1197
[ https://issues.apache.org/jira/browse/YARN-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944039#comment-14944039 ] MENG DING commented on YARN-4175: - I am using the example application to test the container increase/decrease function against a 4 node cluster. Will collect and report all problems when the tests are completed. Just a quick note in case someone also wants to do the test: * The application master IPC server now listens on a fixed port 8686. If multiple app masters are started on the same host with *-enable_ipc* option specified, there will be port conflicts, but YARN should be able to start new app attempts and try to launch app master on a different host. * If there are invalid container resource change request (e.g., target resource is smaller than original resource for increase), the AMRMClient will throw exception (i.e. InvalidResourceRequestException) at the allocate call, and current implementation of the distributed shell appmaster will exit, causing the entire application to exit. > Example of use YARN-1197 > > > Key: YARN-4175 > URL: https://issues.apache.org/jira/browse/YARN-4175 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Wangda Tan >Assignee: MENG DING > Attachments: YARN-4175.1.patch > > > Like YARN-2609, we need a example program to demonstrate how to use YARN-1197 > from end-to-end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942973#comment-14942973 ] Varun Saxena commented on YARN-4178: bq. To that effect, if you’d like, we can rename TimelineWriterUtils to TimelineStorageUtils so that both reader and writer can use functions from this. Also,let’s have the invert(long) and invert(int) functions in the same util class, instead of adding in a new util class. I think we can have a single class TimelineStorageUtils and remove TimelineWriterUtils and TimelineReaderUtils. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942972#comment-14942972 ] Varun Saxena commented on YARN-4178: bq. Why do we need any util classes for this, can't an AppId class handle this by itself (convert from string to byte representation and back)? Should we be adding code specific to ATS and that too HBase implementation of ATS into a class(ApplicationId) which is used all across YARN ? > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4009) CORS support for ResourceManager REST API
[ https://issues.apache.org/jira/browse/YARN-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4009: Attachment: YARN-4009.006.patch Uploaded a new patch to address Jonathan's comments. bq.Configuration usage set(config setting, 'true") is better as setBoolean(config setting, true) Fixed. bq.timeline server now uses its own different way to enable. So if i turn on resource manager and timeline server both on but nothing else, I get a CORS disabled message in the timeline server log even though it is enabled. Could you file a jira to address this spurious log message? Fixed. > CORS support for ResourceManager REST API > - > > Key: YARN-4009 > URL: https://issues.apache.org/jira/browse/YARN-4009 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prakash Ramachandran >Assignee: Varun Vasudev > Attachments: YARN-4009.001.patch, YARN-4009.002.patch, > YARN-4009.003.patch, YARN-4009.004.patch, YARN-4009.005.patch, > YARN-4009.006.patch > > > Currently the REST API's do not have CORS support. This means any UI (running > in browser) cannot consume the REST API's. For ex Tez UI would like to use > the REST API for getting application, application attempt information exposed > by the API's. > It would be very useful if CORS is enabled for the REST API's. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943113#comment-14943113 ] Bibin A Chundatt commented on YARN-4216: {quote} That's why YARN-1362 was done, so we can explicitly tell the nodemanager whether or not the NM is under supervision and likely to restart. {quote} *yarn.nodemanager.recovery.supervised=false* in my current setup. In this case as i understand from above comment i am supposed to set *yarn.nodemanager.recovery.supervised* as true to inform restart is under supervision. [~jlowe] so should i close this jira ?? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943124#comment-14943124 ] Bibin A Chundatt commented on YARN-4216: [Document|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html] doesn't mention about the *yarn.nodemanager.recovery.supervised* . Should i update doc? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944363#comment-14944363 ] Hadoop QA commented on YARN-3216: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 46s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 9m 6s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 44s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 18s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 10s | The applied patch generated 18 new checkstyle issues (total was 271, now 268). | | {color:red}-1{color} | whitespace | 0m 8s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 44s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 40s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 62m 27s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 108m 45s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765017/0003-YARN-3216.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 30ac69c | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/9355/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9355/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9355/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9355/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9355/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9355/console | This message was automatically generated. > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch, > 0003-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944381#comment-14944381 ] Hudson commented on YARN-4176: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2427 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2427/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
[ https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944384#comment-14944384 ] Neelesh Srinivas Salian commented on YARN-4185: --- [~adhoot], thanks for the clarification. So, the initial retries can be done with backoff times of 1,2,4,8 that is still less then 10 and thus give the opportunity to retry for a short-lived NM restart (under 10 seconds) We can continue to wait 10 seconds of backoff incrementally to accomodate a larger failure time. Thus, the failure times can be under 1,2,4,8,10,10 and so on till the number of retries is exhausted. My only concern is that if the failure lasts longer than the total wait time and the number of retries, there won't be a chance to retry. I'll write up a patch to exhibit this. Thank you. > Retry interval delay for NM client can be improved from the fixed static > retry > --- > > Key: YARN-4185 > URL: https://issues.apache.org/jira/browse/YARN-4185 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Neelesh Srinivas Salian > > Instead of having a fixed retry interval that starts off very high and stays > there, we are better off using an exponential backoff that has the same fixed > max limit. Today the retry interval is fixed at 10 sec that can be > unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944130#comment-14944130 ] Varun Saxena commented on YARN-4178: New patch updated. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch, > YARN-4178-YARN-2928.04.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4178: --- Attachment: YARN-4178-YARN-2928.04.patch [~sjlee0], rebased the patch > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch, > YARN-4178-YARN-2928.04.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4178: --- Attachment: (was: YARN-4178-YARN-2928.04.patch) > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944052#comment-14944052 ] Hadoop QA commented on YARN-4162: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 21m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 8m 29s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 11s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 16s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 8s | The applied patch generated 28 new checkstyle issues (total was 222, now 249). | | {color:green}+1{color} | whitespace | 0m 5s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 29s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 51m 36s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 97m 2s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | | | org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765041/YARN-4162.v2.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3e1752f | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/9351/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9351/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9351/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9351/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9351/console | This message was automatically generated. > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4061) [Fault tolerance] Fault tolerant writer for timeline v2
[ https://issues.apache.org/jira/browse/YARN-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944067#comment-14944067 ] Li Lu commented on YARN-4061: - Thanks for the review [~sjlee0]! bq. Since the actual storage writer (HBase) always acts on this queue asynchronously, it seems that the client cannot have a synchronous write semantics. Is that a correct reading? If so, how would we implement such a synchronous write? This is definitely a valid concern. Yes having a pure synchronous semantic with this design is hard. To support synchronous semantic we generally have two ways: - We not only need to enforce a flush, but on synchronous calls also need to block until the the data is actually persisted onto HBase. The advantage of this design is simplicity, but if the HBase storage is not available we cannot perform any synchronous calls. This makes the "fault tolerant" feature less appealing. - Since we know (and trust) that data on HDFS will be eventually available in HBase, maybe we can have a FT reader to check HDFS on or before we check the HBase? In this way we can always select out the most update data, either in HDFS or in HBase. The shortcoming of this approach is that local file storage will not work here, because those buffered data is not generally available to other nodes (and I doubt if this strong consistency model is too ambitious given the amount of data). About throughput, I agree we need to be careful here. We may have some traffic with similar scale and flow as the MapReduce JobHistory server? If this is the case, I think we can definitely start with some ideas in the JHS? > [Fault tolerance] Fault tolerant writer for timeline v2 > --- > > Key: YARN-4061 > URL: https://issues.apache.org/jira/browse/YARN-4061 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Attachments: FaulttolerantwriterforTimelinev2.pdf > > > We need to build a timeline writer that can be resistant to backend storage > down time and timeline collector failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4178: --- Attachment: YARN-4178-YARN-2928.04.patch > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch, > YARN-4178-YARN-2928.04.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944123#comment-14944123 ] Hadoop QA commented on YARN-4178: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 12s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:red}-1{color} | javac | 4m 31s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765067/YARN-4178-YARN-2928.04.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 09c3576 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9353/console | This message was automatically generated. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch, > YARN-4178-YARN-2928.04.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944272#comment-14944272 ] Hudson commented on YARN-4176: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #492 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/492/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944289#comment-14944289 ] Wangda Tan commented on YARN-3216: -- Thanks [~sunilg]. Went through the patch, some comments: 1. AbstractCSQueue: Instead of adding AM-used-resource to parentQueue, I think we may only need to calculate AM-used-resource on LeafQueueu and user. Currently we don't have limitation of AM-used-resource on parentQueue, so the aggregated resource may not be very useful. We can add it along the hierachy if we want to limit max-am-percent on parentQueue in the future. 2. CapacitySchedulerConfiguration: Instead of introduce a new configuration: MAXIMUM_AM_RESOURCE_PARTITION_SUFFIX, I suggest to use the existing one: maximum-am-resource-percent. If {{queue.accessible-node-labels..maximum-am-resource-percent}} not set, it uses queue.maximum-am-resource-percent. Please let me know if there's any specific reason to add a new maximum-am-resource-partition. 3. LeafQueue: I'm wondering if we need to maintain map of {{PartitionInfo}}: PartitionInfo.getActiveApplications is only used to check if there's any activated apps under a partition, it is equivalent to {{queueUsage.getAMUsed(partitionName) > 0}}. 4. SchedulerApplicationAttempt: I think return value getAMUsed should be: - Before AM container allocated, it returns AM-Resource-Request.resource on partition=AM-Resource-Request.node-label-request - After AM container allocated, it returns AM-Container.resource on partition=AM-Node.partition - you don't have to update am-resource when AM container just allocated, because AM-container.resource and am-resource-request.node-label-request won't be changed, but you need to update this if partition of AM-container's NM updated). I'm not sure if it is clear to you, please let me know if you need more elaborate about this comment. I found you removed some code from FiCaSchedulerApp's constructor, I think getAMUsed should still return correct value before AM container allocated, otherwise the computation might be wrong. Let me know if I didn't understand your code correctly. > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch, > 0003-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944292#comment-14944292 ] Wangda Tan commented on YARN-3216: -- And forgot to mention: 5. About am-resource-percent per user per partition. Currently you have only considered am-resource-percent per queue, I think you need to calculate (not configure) per-user-per-partition am-resource-limit as well. Since the patch is already very complex to me, I'm fine with doing the math of am-resource-limit-per-user in a separated JIRA. > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch, > 0003-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4215) RMNodeLabels Manager Need to verify and replace node labels for the only modified Node Label Mappings in the request
[ https://issues.apache.org/jira/browse/YARN-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944313#comment-14944313 ] Wangda Tan commented on YARN-4215: -- Thanks [~Naganarasimha]! Patch generally looks good to me, Only a few nits in the test: - I suggest to add one more test just to avoid regression in the future: If a host has label (node1:0 label=x). If someone update label of node1:0 to y. Scheduler should receive an event as well. - Could you check if labels of the events are expected? {code} 483 mgr.replaceLabelsOnNode(ImmutableMap.of(toNodeId("n1:1"), toSet("p1"), 484 toNodeId("n2:1"), toSet("p2"), toNodeId("n3"), toSet("p3"))); 485 assertTrue("Event should be sent when there is change in labels", 486 schedEventsHandler.receivedEvent); 487 assertEquals("3 node label mapping modified", 3, 488 schedEventsHandler.updatedNodeToLabels.size()); 489 schedEventsHandler.receivedEvent = false; {code} > RMNodeLabels Manager Need to verify and replace node labels for the only > modified Node Label Mappings in the request > > > Key: YARN-4215 > URL: https://issues.apache.org/jira/browse/YARN-4215 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: nodelabel, resourcemanager > Attachments: YARN-4215.v1.001.patch > > > Modified node Labels needs to be updated by the capacity scheduler holding a > lock hence its better to push events to scheduler only when there is actually > a change in the label mapping for a given node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944336#comment-14944336 ] Hudson commented on YARN-4176: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #1222 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1222/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4176: - Summary: Resync NM nodelabels with RM periodically for distributed nodelabels (was: Resync NM nodelabels with RM every x interval for distributed nodelabels) > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944179#comment-14944179 ] Wangda Tan commented on YARN-4176: -- Latest patch LGTM, committing.. > Resync NM nodelabels with RM every x interval for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4178) [storage implementation] app id as string in row keys can cause incorrect ordering
[ https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944190#comment-14944190 ] Hadoop QA commented on YARN-4178: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 16m 2s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 8m 8s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 22s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 17s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 4s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 56s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 49s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 41m 24s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765073/YARN-4178-YARN-2928.04.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 09c3576 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/9354/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9354/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9354/console | This message was automatically generated. > [storage implementation] app id as string in row keys can cause incorrect > ordering > -- > > Key: YARN-4178 > URL: https://issues.apache.org/jira/browse/YARN-4178 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4178-YARN-2928.01.patch, > YARN-4178-YARN-2928.02.patch, YARN-4178-YARN-2928.03.patch, > YARN-4178-YARN-2928.04.patch > > > Currently the app id is used in various places as part of row keys. However, > currently they are treated as strings. This will cause a problem with > ordering when the id portion of the app id rolls over to the next digit. > For example, "app_1234567890_1" will be considered *earlier* than > "app_1234567890_". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944195#comment-14944195 ] Hudson commented on YARN-4176: -- FAILURE: Integrated in Hadoop-trunk-Commit #8573 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8573/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944144#comment-14944144 ] Wangda Tan commented on YARN-4162: -- Thanks [~Naganarasimha] working on this! Sample REST response generally looks good, some comments: 1) Suggest to rename ResourceUsageInfo.partitionResourceUsages to resourceUsagesByPartition. Similarly, QueueCapacitiesInfo.partitionQueueCapacity -> queueCapacitiesByPartition 2) I think pendingResource/amResource is also very important for user/leafQueue's resourceUsage. 3) I think it's better to move CapacitySchedulerInfo#getResourceUsageInfo/getQueueCapacitiesInfo to ResourceUsage#createResourceUsageInfo and QueueCapacities#createQueueCapacitiesInfo. With this, you can access internal fields of ResourceUsage/QueueCapacities and it will be more nature to me to create -Info from a class. Will include more detailed code review at next iteration. > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944169#comment-14944169 ] Wangda Tan commented on YARN-3964: -- Thanks [~dian.fu] working on this patch and reviews from [~Naganarasimha]/[~sunilg]. I think the approach/patch generally looks good and safe to me. [~devaraj.k], could you take care of following review works if you have bandwidth? > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.007.patch, YARN-3964.007.patch, > YARN-3964.008.patch, YARN-3964.009.patch, YARN-3964.010.patch, > YARN-3964.011.patch, YARN-3964.012.patch, YARN-3964.013.patch, > YARN-3964.014.patch, YARN-3964.015.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4061) [Fault tolerance] Fault tolerant writer for timeline v2
[ https://issues.apache.org/jira/browse/YARN-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944200#comment-14944200 ] Sangjin Lee commented on YARN-4061: --- I don't think the MR JHS is an apt comparison. First, we're dealing with a totally distributed writer situation (individual jobs) for the MR JHS whereas the RM timeline collector would be a single significant writer (again, it's the RM collector that I'm most worried about). Also, JHS writes only a few large files (job conf, job history files, etc.), whereas the timeline service will write a huge number of tiny writes. The volume of writes will be much larger than the JHS use case. Regarding the synchronous semantics, we really need to think it through. On the one hand, we might consider handling the synchronous calls separate from the rest and outside the log queue, but it's not clear how one can make it work alongside the asynchronous writes that are going on. > [Fault tolerance] Fault tolerant writer for timeline v2 > --- > > Key: YARN-4061 > URL: https://issues.apache.org/jira/browse/YARN-4061 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Attachments: FaulttolerantwriterforTimelinev2.pdf > > > We need to build a timeline writer that can be resistant to backend storage > down time and timeline collector failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944236#comment-14944236 ] Hudson commented on YARN-4176: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #483 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/483/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944416#comment-14944416 ] Sunil G commented on YARN-3216: --- Thank you [~leftnoteasy] for sharing the comments. bq.Please let me know if there's any specific reason to add a new maximum-am-resource-partition. I agree with you. We could use the same configuration name under each label. bq.if there's any activated apps under a partition, it is equivalent to queueUsage.getAMUsed(partitionName) Yes. This will be enough, i kept a new map with the idea of maintaining some more information in similar lines with User. But as of now, the change suggested is enough. I will remove the map. bq.you don't have to update am-resource when AM container just allocated, because AM-container.resource and am-resource-request.node-label-request won't be changed, but you need to update this if partition of AM-container's NM updated As I see it, we may need to below change. - In FiCaSchedulerApp's ctor, update AM-Resource-Request.resource on partition ( keep existing code). But use {{rmApp.getAMResourceRequest().getNodeLabelExpression()}} to setAMResource instead of setting to NO_LABEL. Because this information wont be changed later. - if partition of AM-container's NM updated, we need to change AMResource which I am handling in {{nodePartitionUpdated}} as below. {code} +if (rmContainer.isAMContainer()) { + setAppAMNodePartitionName(newPartition); + this.attemptResourceUsage.decAMUsed(oldPartition, containerResource); + this.attemptResourceUsage.incAMUsed(newPartition, containerResource); + getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this); + getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this); +} {code} Here AM-Resource-Request.resource is updated in FiCaSchedulerApp's ctor based on {{rmApp.getAMResourceRequest}}. Once container is allocated, this resource will be come a part of the partition with no change in resource. So I feel I need not have to update resource in *allocate* call of FicaSchedulerApp. Am I correct? - am-resource-percent per user per partition: Yes, I will raise a new ticket to handle this and will make changes there instead of doing in this. > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch, > 0003-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM periodically for distributed nodelabels
[ https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1490#comment-1490 ] Hudson commented on YARN-4176: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2397 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2397/]) YARN-4176. Resync NM nodelabels with RM periodically for distributed (wangda: rev 30ac69c6bd3db363248d6c742561371576006dab) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdaterForLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java > Resync NM nodelabels with RM periodically for distributed nodelabels > > > Key: YARN-4176 > URL: https://issues.apache.org/jira/browse/YARN-4176 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.8.0 > > Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, > 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch > > > This JIRA is for handling the below set of issue > # Distributed nodelabels after NM registered with RM if cluster nodelabels > are removed and added then NM doesnt resend labels in heartbeat again untils > any change in labels > # NM registration failed with Nodelabels should resend labels again to RM > The above cases can be handled by resync nodeLabels with RM every x interval > # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} > and will resend nodelabels to RM based on config no matter what the > registration fails or success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4162: Attachment: YARN-4162.v2.003.patch > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, YARN-4162.v2.003.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4162) Scheduler info in REST, is currently not displaying partition specific queue information similar to UI
[ https://issues.apache.org/jira/browse/YARN-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4162: Attachment: (was: YARN-4162.v2.003.patch) > Scheduler info in REST, is currently not displaying partition specific queue > information similar to UI > -- > > Key: YARN-4162 > URL: https://issues.apache.org/jira/browse/YARN-4162 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4162.v1.001.patch, YARN-4162.v2.001.patch, > YARN-4162.v2.002.patch, restAndJsonOutput.zip > > > When Node Labels are enabled then REST Scheduler Information should also > provide partition specific queue information similar to the existing Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943298#comment-14943298 ] Varun Saxena commented on YARN-4224: [~sjlee0] raised this point on YARN-3864. Current REST API format does not conform to REST APIs' elsewhere in hadoop. As this is a user facing API, everyone can share their thoughts on this. > Change the REST interface to conform to current REST APIs' in YARN > -- > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4009) CORS support for ResourceManager REST API
[ https://issues.apache.org/jira/browse/YARN-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943315#comment-14943315 ] Varun Vasudev commented on YARN-4009: - The release audit and findbugs issues are unrelated to the patch. The tests pass on my local machine. > CORS support for ResourceManager REST API > - > > Key: YARN-4009 > URL: https://issues.apache.org/jira/browse/YARN-4009 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prakash Ramachandran >Assignee: Varun Vasudev > Attachments: YARN-4009.001.patch, YARN-4009.002.patch, > YARN-4009.003.patch, YARN-4009.004.patch, YARN-4009.005.patch, > YARN-4009.006.patch > > > Currently the REST API's do not have CORS support. This means any UI (running > in browser) cannot consume the REST API's. For ex Tez UI would like to use > the REST API for getting application, application attempt information exposed > by the API's. > It would be very useful if CORS is enabled for the REST API's. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4009) CORS support for ResourceManager REST API
[ https://issues.apache.org/jira/browse/YARN-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943157#comment-14943157 ] Hadoop QA commented on YARN-4009: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 27m 19s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 8m 4s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 30s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 19s | The applied patch generated 1 release audit warnings. | | {color:green}+1{color} | site | 3m 35s | Site still builds. | | {color:red}-1{color} | checkstyle | 3m 29s | The applied patch generated 6 new checkstyle issues (total was 0, now 6). | | {color:red}-1{color} | checkstyle | 4m 4s | The applied patch generated 2 new checkstyle issues (total was 211, now 212). | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 55s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 11m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 8m 1s | Tests passed in hadoop-common. | | {color:green}+1{color} | yarn tests | 0m 27s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 8s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 4m 5s | Tests passed in hadoop-yarn-server-applicationhistoryservice. | | {color:green}+1{color} | yarn tests | 0m 29s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 9m 2s | Tests passed in hadoop-yarn-server-nodemanager. | | {color:red}-1{color} | yarn tests | 60m 23s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 154m 50s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.TestRM | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12764977/YARN-4009.006.patch | | Optional Tests | javadoc javac unit findbugs checkstyle site | | git revision | trunk / 30e2f83 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/diffcheckstylehadoop-common.txt https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-applicationhistoryservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9345/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9345/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9345/console | This message was automatically generated. > CORS
[jira] [Created] (YARN-4224) Change the REST interface to conform to current REST APIs' in YARN
Varun Saxena created YARN-4224: -- Summary: Change the REST interface to conform to current REST APIs' in YARN Key: YARN-4224 URL: https://issues.apache.org/jira/browse/YARN-4224 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: YARN-2928 Reporter: Varun Saxena Assignee: Varun Saxena -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943320#comment-14943320 ] Varun Saxena commented on YARN-4224: My proposal would be as under * *Query flows* Current REST API for querying flows is {{/ws/v2/timeline/flows/\{clusterid}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/flows* _Eg :_ /ws/v2/timeline/yarn_cluster/flows {panel} * *Query flowrun* Current REST API is {{/ws/v2/timeline/flowrun/\{clusterid}/\{flowid}/\{flowrunid}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/\{flowid}/run/\{flowrunid}* _Eg :_ /ws/v2/timeline/yarn_cluster/hive_flow/run/123 {panel} * *Query app* Current REST API in YARN-3864 is {{/ws/v2/timeline/app/\{clusterid}/\{appid}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/app/\{appid}* _Eg :_ /ws/v2/timeline/yarn_cluster/app/application_11_1345 {panel} * *Query apps for a flow* Current REST API in YARN-3864 is {{/ws/v2/timeline/flowapps/\{clusterid}/\{flowid}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/\{flowid}/apps* _Eg :_ /ws/v2/timeline/yarn_cluster/hive_flow/apps {panel} * *Query apps for a flowrun* Current REST API in YARN-3864 is {{/ws/v2/timeline/flowrunapps/\{clusterid}/\{flowid}/\{flowrunid}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/\{flowid}/\{flowrunid}/apps* _Eg :_ /ws/v2/timeline/yarn_cluster/hive_flow/123/apps {panel} * *Query entity* Current REST API is {{/ws/v2/timeline/entity/\{clusterid}/\{appid}/\{entitytype}/\{entityid}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/\{appid}/\{entitytype}/entity/\{entityid}* _Eg :_ /ws/v2/timeline/yarn_cluster/application_1444034548255_0001/YARN_CONTAINER/entity/container_1444034548255_0001_01_01 {panel} * *Query entities* Current REST API is {{/ws/v2/timeline/entities/\{clusterid}/\{appid}/\{entitytype}/}} . This can be changed to : {panel} */ws/v2/timeline/\{clusterid}/\{appid}/\{entitytype}/entities* _Eg :_ /ws/v2/timeline/yarn_cluster/application_1444034548255_0001/YARN_CONTAINER/entities {panel} > Change the REST interface to conform to current REST APIs' in YARN > -- > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4223) Findbugs warnings in hadoop-yarn-server-nodemanager project
[ https://issues.apache.org/jira/browse/YARN-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943328#comment-14943328 ] Varun Saxena commented on YARN-4223: Release audit warning is unrelated. There is HDFS-9182 already for it. > Findbugs warnings in hadoop-yarn-server-nodemanager project > --- > > Key: YARN-4223 > URL: https://issues.apache.org/jira/browse/YARN-4223 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Minor > Attachments: FindBugs Report.html, YARN-4223.01.patch > > > {noformat} > classname='org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher'> >message='Unchecked/unconfirmed cast from > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEvent > to > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.SignalContainersLauncherEvent > in > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncherEvent)' > lineNumber='146'/> > > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1509: Attachment: YARN-1509.4.patch Submit the new patch that fixes the whitespace issue > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943368#comment-14943368 ] Jason Lowe commented on YARN-4216: -- Yes, the document should be updated to cover that property. Did you try setting that property to true, and does it solve your issue? > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1510) Make NMClient support change container resources
[ https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943406#comment-14943406 ] MENG DING commented on YARN-1510: - * The release audit is not related. * The failed test passed in my own environment after applying the patch, so it is not related. > Make NMClient support change container resources > > > Key: YARN-1510 > URL: https://issues.apache.org/jira/browse/YARN-1510 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1510-YARN-1197.1.patch, > YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch > > > As described in YARN-1197, YARN-1449, we need add API in NMClient to support > 1) sending request of increase/decrease container resource limits > 2) get succeeded/failed changed containers response from NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943425#comment-14943425 ] Sunil G commented on YARN-3216: --- Hi [~eepayne] Thank you for sharing the comments. As Naga mentioned, we could retrieve am-resource-precent per-partition (per queue) config information can also be fetched from REST. Also as you mentioned, this information can also be retrieved from GUI (such as "am resource usage per queue per partition) from partition-tab in scheduler page. > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI
Eric Payne created YARN-4226: Summary: Make capacity scheduler queue's preemption status REST API consistent with GUI Key: YARN-4226 URL: https://issues.apache.org/jira/browse/YARN-4226 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor In the capacity scheduler GUI, the preemption status has the following form: {code} Preemption: disabled {code} However, the REST API shows the following for the same status: {code} preemptionDisabled":true {code} The latter is confusing and should be consistent with the format in the GUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Component/s: capacity scheduler > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2902: --- Target Version/s: 2.7.2 (was: 2.8.0) > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: (was: YARN-3769-branch-2.002.patch) > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.7.002.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Summary: Add preemption status to yarn queue -status (was: Add preemption status to {{yarn queue -status}}) > Add preemption status to yarn queue -status > --- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943567#comment-14943567 ] Hadoop QA commented on YARN-1509: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 17s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 59s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 18s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 15s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 30s | The applied patch generated 5 new checkstyle issues (total was 79, now 78). | | {color:green}+1{color} | whitespace | 0m 8s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 53s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 7m 31s | Tests passed in hadoop-yarn-client. | | | | 46m 1s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765011/YARN-1509.4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b925cf1 | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/9346/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9346/artifact/patchprocess/diffcheckstylehadoop-yarn-client.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9346/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9346/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9346/console | This message was automatically generated. > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
[ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943574#comment-14943574 ] Bibin A Chundatt commented on YARN-4216: When yarn.nodemanager.recovery.supervised=true and nodemanager stoppped abort aggregation is called 2015-10-05 20:17:20,634 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: *Aborting log aggregation for application_1444056058955_0002* {noformat} 2015-10-05 20:17:20,634 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-10-05 20:17:20,634 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2015-10-05 20:17:20,634 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Aborting log aggregation for application_1444056058955_0002 2015-10-05 20:17:20,634 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Aggregation did not complete for application application_1444056058955_0002 2015-10-05 20:17:20,639 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2015-10-05 20:17:20,664 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040 2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040 2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-10-05 20:17:20,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting 2015-10-05 20:17:20,665 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting. 2015-10-05 20:17:20,671 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system... 2015-10-05 20:17:20,674 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped. {noformat} Container logs are not cleaned up and uploaded to HDFS on stop But decommision + nm restart while application is running should cause the same log missing scenario as per {{LogAggregationService#stopAggregators}} {code} boolean supervised = getConfig().getBoolean( YarnConfiguration.NM_RECOVERY_SUPERVISED, YarnConfiguration.DEFAULT_NM_RECOVERY_SUPERVISED); // if recovery on restart is supported then leave outstanding aggregations // to the next restart boolean shouldAbort = context.getNMStateStore().canRecover() && !context.getDecommissioned() && supervised; // politely ask to finish for (AppLogAggregator aggregator : appLogAggregators.values()) { if (shouldAbort) { aggregator.abortLogAggregation(); } else { aggregator.finishLogAggregation(); } } {code} ' > Container logs not shown for newly assigned containers after NM recovery > -- > > Key: YARN-4216 > URL: https://issues.apache.org/jira/browse/YARN-4216 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml > > > Steps to reproduce > # Start 2 nodemanagers with NM recovery enabled > # Submit pi job with 20 maps > # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager) > (Logs of all completed container gets aggregated to HDFS) > # Now start the NM1 again and wait for job completion > *The newly assigned container logs on NM1 are not shown* > *hdfs log dir state* > # When logs are aggregated to HDFS during stop its with NAME (localhost_38153) > # On log aggregation after starting NM the newly assigned container logs gets > uploaded with name (localhost_38153.tmp) > History server the logs are now shown for new task attempts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943645#comment-14943645 ] Sangjin Lee commented on YARN-3864: --- I kicked off another jenkins build. I have reviewed the latest patch (v.3), and it looks good to me for the most part. I have only a few minor comments. (TimelineReaderWebServices.java) - l.540: nit: let's use a normal Java style: {{req.getQueryString() == null}} - l.575: If we're calling this end point "flowrunapps", then shouldn't the method be called {{getFlowRunApps}}? The latter one seems to be named that. - Both for /flowrunapps and /flowapps, I understand it will return the most recent N apps if item is specified, correct? Then it should be stated in the javadoc. If you could address those, and with jenkins passing, I'd like to go ahead and commit the patch. Do let me know if you have other comments. Thanks! > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4221) Store user in app to flow table
[ https://issues.apache.org/jira/browse/YARN-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943668#comment-14943668 ] Sangjin Lee commented on YARN-4221: --- Thanks for turning around and creating this patch quickly, [~varun_saxena]! I am in agreement with the approach taken in this patch. I'm going to take a closer look once we commit YARN-3864. One high level comment: I think we should document the behavior of the REST API with as much detail as possible. It should be very clear about what params are required and what params are optional, what type of contents would be returned, and in what order the entities will be, etc. The javadoc here is as important as the code itself. So for example, we should have plenty of documentation on where the user id is required and where it is optional. > Store user in app to flow table > --- > > Key: YARN-4221 > URL: https://issues.apache.org/jira/browse/YARN-4221 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-4221-YARN-2928.01.patch > > > We should store user as well in in app to flow table. > For queries where user is not supplied and flow context can be retrieved from > app to flow table, we should take the user from app to flow table instead of > considering UGI as default user. > This is as per discussion on YARN-3864 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943587#comment-14943587 ] MENG DING commented on YARN-1509: - * release audit is not related * will apply for exception for checkstyle: ** relaxed visibility is for testing purposes. ** function length exceeding limit is caused by long comments. > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943613#comment-14943613 ] Hadoop QA commented on YARN-3864: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12764952/YARN-3864-addendum-appaggregation.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b925cf1 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9348/console | This message was automatically generated. > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4225) Add preemption status to {{yarn queue -status}}
Eric Payne created YARN-4225: Summary: Add preemption status to {{yarn queue -status}} Key: YARN-4225 URL: https://issues.apache.org/jira/browse/YARN-4225 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3769: - Attachment: YARN-3769-branch-2.002.patch > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
[ https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3216: -- Attachment: 0003-YARN-3216.patch Hi [~leftnoteasy] Attaching v2 version of patch addressing the major comments. Kindly help to check the same. > Max-AM-Resource-Percentage should respect node labels > - > > Key: YARN-3216 > URL: https://issues.apache.org/jira/browse/YARN-3216 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3216.patch, 0002-YARN-3216.patch, > 0003-YARN-3216.patch > > > Currently, max-am-resource-percentage considers default_partition only. When > a queue can access multiple partitions, we should be able to compute > max-am-resource-percentage based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
I don't think option 2 where you restart from 1 makes sense. Its also not a goal to minimize the total wait time. He goal should minimize the time to recover for short intermittent failure while also waiting long enough for long failures before giving up. On Oct 3, 2015 6:43 PM, "Neelesh Srinivas Salian (JIRA)"wrote: > > [ > https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942528#comment-14942528 > ] > > Neelesh Srinivas Salian commented on YARN-4185: > --- > > Thoughts: > 1) Using the exponentialBackoffRetry policy will have a progression of > wait time starting at 1sec per retry assuming it takes a second for the NM > to come up. > Hence exponentially, the backoff time increases 2,4,8,16...till 512 as we > approach 10 retries. > > 2) In the current strategy, the wait time is 10 seconds which causes an NM > that restarted in 1 second to wait for a retry. > > 3) In the event of the retries going forward, at the 3rd retry ( the wait > time is collectively 7 seconds (1+2+4) as per the exponential strategy) and > (30 (10+10+10) seconds as the current static retry) > > 4) If you keep retrying, collectively the waiting static retry has now > waited for 60 seconds versus 2^6 = 64 seconds in the exponential strategy > at the 6th retry attempt. > > Logic for the Design: > 1) In the event of retries being default to 10, >a. I propose after the 3rd attempt, we continue to keep the wait time > as 4 seconds and continue the same. >Thus the total time comes up to 1,2,4,4,4,4,4,4,4,4 = 35 seconds. >b. Versus collectively spending 100 seconds on waiting time in the > static retry strategy. > > 2) Alternatively, the logic could be: >a. Have the 1st 3 attempts of retry. If further needed, fall back to > the 1sec start of the same logic. > So, it looks like this.. (1,2,4) (1,2,4) (1,2,4) (1) for 10 > retries. >b. Thus we get the 10 retries done in collectively 22 seconds versus > 100 seconds. > > Requesting feedback. > Thank you. > > > Retry interval delay for NM client can be improved from the fixed static > retry > > > --- > > > > Key: YARN-4185 > > URL: https://issues.apache.org/jira/browse/YARN-4185 > > Project: Hadoop YARN > > Issue Type: Bug > >Reporter: Anubhav Dhoot > >Assignee: Neelesh Srinivas Salian > > > > Instead of having a fixed retry interval that starts off very high and > stays there, we are better off using an exponential backoff that has the > same fixed max limit. Today the retry interval is fixed at 10 sec that can > be unnecessarily high especially when NMs could rolling restart within a > sec. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Summary: Add preemption status to yarn queue -status for capacity scheduler (was: Add preemption status to yarn queue -status) > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3864) Implement support for querying single app and all apps for a flow run
[ https://issues.apache.org/jira/browse/YARN-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943648#comment-14943648 ] Sangjin Lee commented on YARN-3864: --- Should have refreshed the page first. :) The jenkins is failing because it's testing the addendum patch. Built the patch locally, ran all the tests and findbugs. All seem fine. > Implement support for querying single app and all apps for a flow run > - > > Key: YARN-3864 > URL: https://issues.apache.org/jira/browse/YARN-3864 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-3864-YARN-2928.01.patch, > YARN-3864-YARN-2928.02.patch, YARN-3864-YARN-2928.03.patch, > YARN-3864-addendum-appaggregation.patch > > > This JIRA will handle support for querying all apps for a flow run in HBase > reader implementation. > And also REST API implementation for single app and multiple apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943680#comment-14943680 ] Hadoop QA commented on YARN-3769: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 30s | Pre-patch branch-2 compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 5m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 3s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 58s | The applied patch generated 6 new checkstyle issues (total was 145, now 150). | | {color:red}-1{color} | whitespace | 0m 6s | The patch has 26 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 15s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 56m 6s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 94m 23s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | | | hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicyForNodePartitions | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765015/YARN-3769-branch-2.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | branch-2 / d843c50 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9347/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9347/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9347/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9347/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9347/console | This message was automatically generated. > Preemption occurring unnecessarily because preemption doesn't consider user > limit > - > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.7.002.patch, YARN-3769.001.branch-2.7.patch, > YARN-3769.001.branch-2.8.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)