[jira] [Commented] (YARN-628) Fix YarnException unwrapping
[ https://issues.apache.org/jira/browse/YARN-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659251#comment-13659251 ] Vinod Kumar Vavilapalli commented on YARN-628: -- This looks perfect. Will run tests and check this in. > Fix YarnException unwrapping > > > Key: YARN-628 > URL: https://issues.apache.org/jira/browse/YARN-628 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: YARN-628.txt, YARN-628.txt, YARN-628.txt, YARN-628.txt.2 > > > Unwrapping of YarnRemoteExceptions (currently in YarnRemoteExceptionPBImpl, > RPCUtil post YARN-625) is broken, and often ends up throwin > UndeclaredThrowableException. This needs to be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-628) Fix YarnException unwrapping
[ https://issues.apache.org/jira/browse/YARN-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659225#comment-13659225 ] Hadoop QA commented on YARN-628: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583427/YARN-628.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/940//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/940//console This message is automatically generated. > Fix YarnException unwrapping > > > Key: YARN-628 > URL: https://issues.apache.org/jira/browse/YARN-628 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: YARN-628.txt, YARN-628.txt, YARN-628.txt, YARN-628.txt.2 > > > Unwrapping of YarnRemoteExceptions (currently in YarnRemoteExceptionPBImpl, > RPCUtil post YARN-625) is broken, and often ends up throwin > UndeclaredThrowableException. This needs to be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-688) Containers not cleaned up when NM received SHUTDOWN event from NodeStatusUpdater
[ https://issues.apache.org/jira/browse/YARN-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659227#comment-13659227 ] Hadoop QA commented on YARN-688: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583416/YARN-688.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/941//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/941//console This message is automatically generated. > Containers not cleaned up when NM received SHUTDOWN event from > NodeStatusUpdater > > > Key: YARN-688 > URL: https://issues.apache.org/jira/browse/YARN-688 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-688.1.patch > > > Currently, both SHUTDOWN event from nodeStatusUpdater and CleanupContainers > event happens to be on the same dispatcher thread, CleanupContainers Event > will not be processed until SHUTDOWN event is processed. see similar problem > on YARN-495. > On normal NM shutdown, this is not a problem since normal stop happens on > shutdownHook thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-628) Fix YarnException unwrapping
[ https://issues.apache.org/jira/browse/YARN-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated YARN-628: Attachment: YARN-628.txt Thanks! This patch did need an exhaustive review. bq. TestClientRMTokens: Can we explicitly test for InvalidToken? That'll be great if possible. Sure. Wasn't really trying to get all the exception verification in tests cleaned up in this patch. There's more in MR tests; I'll open a separate jira for this. bq. TestClientTokens: The exception thrown should always be RemoteException, so no need for the if condition, we should simply assert so. Done. bq. RPCUtil.instantiateException -> instantiateRemoteException? I've left this as is. It's not instantiating a remote exception. Maybe instantiateFromRemoteException, but I prefer the current name. Have renamed some of the other methods, some as suggested, others with slightly clearer names, and have added some comments to make the tests easier to understand. Also, added documentation to YarnRemoteException stating that derived classes must include a String only constructor. > Fix YarnException unwrapping > > > Key: YARN-628 > URL: https://issues.apache.org/jira/browse/YARN-628 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: YARN-628.txt, YARN-628.txt, YARN-628.txt, YARN-628.txt.2 > > > Unwrapping of YarnRemoteExceptions (currently in YarnRemoteExceptionPBImpl, > RPCUtil post YARN-625) is broken, and often ends up throwin > UndeclaredThrowableException. This needs to be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659136#comment-13659136 ] Hadoop QA commented on YARN-117: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583405/YARN-117-008.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 38 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy: org.apache.hadoop.mapreduce.v2.app.TestStagingCleanup {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/939//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/939//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/939//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/939//console This message is automatically generated. > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-007.patch, YARN-117-008.patch, > YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, YARN-117.5.patch, > YARN-117.6.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > cla
[jira] [Updated] (YARN-688) Containers not cleaned up when NM received SHUTDOWN event from NodeStatusUpdater
[ https://issues.apache.org/jira/browse/YARN-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-688: - Attachment: YARN-688.1.patch This patch basically creates a new thread on handling shutdown event from nodeStatusUpdater > Containers not cleaned up when NM received SHUTDOWN event from > NodeStatusUpdater > > > Key: YARN-688 > URL: https://issues.apache.org/jira/browse/YARN-688 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-688.1.patch > > > Currently, both SHUTDOWN event from nodeStatusUpdater and CleanupContainers > event happens to be on the same dispatcher thread, CleanupContainers Event > will not be processed until SHUTDOWN event is processed. see similar problem > on YARN-495. > On normal NM shutdown, this is not a problem since normal stop happens on > shutdownHook thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-688) Containers not cleaned up when NM received SHUTDOWN event from NodeStatusUpdater
Jian He created YARN-688: Summary: Containers not cleaned up when NM received SHUTDOWN event from NodeStatusUpdater Key: YARN-688 URL: https://issues.apache.org/jira/browse/YARN-688 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Currently, both SHUTDOWN event from nodeStatusUpdater and CleanupContainers event happens to be on the same dispatcher thread, CleanupContainers Event will not be processed until SHUTDOWN event is processed. see similar problem on YARN-495. On normal NM shutdown, this is not a problem since normal stop happens on shutdownHook thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659091#comment-13659091 ] Carlo Curino commented on YARN-624: --- Alejandro, I completely agree gang scheduling is an important and missing use case. As I told you in person, I spoke with various machine-learning guys and they are very interested in gang scheduling (they are working on their own AM for ML computations). From the conversation I am convinced their asks represent a rather common requirement for much of ML-type applications. In particular, they were interested in the "or" use-case you mentioned. Specifically they want to be able to express this: 1) 1 container with 128GB of RAM and 16cores OR 2) 10 containers with 16GB of RAM and 2 cores OR 3) 100 containers with 2GB of RAM and 1 core In term of locality I can see three main scenarios: 1) absolute locality, i.e., I need a gang of N containers on this rack, or on these set of nodes, 2) relative locality, i.e., I need a gang of N containers "close to each other" (this really captures more of a network property than anything else) 3) (no locality), i.e., I need a gang of N containers anywhere in the cluster > Support gang scheduling in the AM RM protocol > - > > Key: YARN-624 > URL: https://issues.apache.org/jira/browse/YARN-624 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, scheduler >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > Per discussion on YARN-392 and elsewhere, gang scheduling, in which a > scheduler runs a set of tasks when they can all be run at the same time, > would be a useful feature for YARN schedulers to support. > Currently, AMs can approximate this by holding on to containers until they > get all the ones they need. However, this lends itself to deadlocks when > different AMs are waiting on the same containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-117: Attachment: YARN-117-008.patch build diff from root of repository, so patch can apply it > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-007.patch, YARN-117-008.patch, > YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, YARN-117.5.patch, > YARN-117.6.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state transition failures are something that registered service listeners > may wish to be informed of. > When a state transition fails a {{RuntimeException}} can be thrown -and the > service listeners are not informed as the notification point isn't reached. > They may wish to know this, especially for management and diagnostics. > *Fix:* extend {{ServiceStateChangeListener}} with a callback such as > {{stateChangeFailed(Service service,Service.State targeted-state, > RuntimeException e)}} that is invoked from the (final) state change methods > in the {{AbstractService}} class (once they delegate to their inner > {{innerStart()}}, {{innerStop()}} methods; make a no-op on the existing > implementations of the interface. > h2. Service listener failures not handled > Is this an error an error or not? Log and ignore may not be what is desired. > *Proposed:* during {{stop()}} any exception by a listener is caught and > discarde
[jira] [Commented] (YARN-666) [Umbrella] Support rolling upgrades in YARN
[ https://issues.apache.org/jira/browse/YARN-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659024#comment-13659024 ] Carlo Curino commented on YARN-666: --- Hi Vinod, I will give you some numbers but bare in mind that these results are very initial, based only on a handful of runs on a 9 or 10 machine cluster, and without serious tuning of terasort. The idea of the solution is for maps to write their output directly into HDFS (e.g., with replication turned down to 1). Reducers will be started only when maps complete and stream-merge straight out of HDFS (bypassing much of the partial merging logic). Key limitations of what we have for now: 1) if a map output is lost, all reducers will have to wait for it to be re-run 2) we have lots of dfsclients open, this might become a problem for HDFS if you have too many maps per node. We initially tried this as a way to make checkpointing cheaper (no need to save any state other than last-processed key), and we were just hoping for it not too be too much worse than regular shuffle. The surprise I mentioned above was that we actually observe a surprisingly substantial speed up on a simple sort job (on 9 nodes): 25% at 64GB scale and 31% at 1TB scale. This seems to indicate that the penalty of reading through HDFS is actually trumped by the benefits of doing a stream-merge (where data never touch disk on the reduce side, other than for reducer output). Probably this is reducing seeks, and using the drives from which we read and we write more efficiently. You can imagine to get similar benefits by adding restartability to the http client (and the buffering done by HDFS client, which was likely to be beneficial in our test). More sophisticated versions of these could also dynamically decide whether to stream merge from a certain map or whether to copy the data (if for example they are small to fit in memory). Bottomline, I don't think we should read to much out these results (again very initial), other than using HDFS for intermediate data layer is not completely infeasible. > [Umbrella] Support rolling upgrades in YARN > --- > > Key: YARN-666 > URL: https://issues.apache.org/jira/browse/YARN-666 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth > Attachments: YARN_Rolling_Upgrades.pdf, YARN_Rolling_Upgrades_v2.pdf > > > Jira to track changes required in YARN to allow rolling upgrades, including > documentation and possible upgrade routes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-530) Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services
[ https://issues.apache.org/jira/browse/YARN-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658997#comment-13658997 ] Hadoop QA commented on YARN-530: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583393/YARN-530-005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/937//console This message is automatically generated. > Define Service model strictly, implement AbstractService for robust > subclassing, migrate yarn-common services > - > > Key: YARN-530 > URL: https://issues.apache.org/jira/browse/YARN-530 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117changes.pdf, YARN-530-005.patch, > YARN-530-2.patch, YARN-530-3.patch, YARN-530.4.patch, YARN-530.patch > > > # Extend the YARN {{Service}} interface as discussed in YARN-117 > # Implement the changes in {{AbstractService}} and {{FilterService}}. > # Migrate all services in yarn-common to the more robust service model, test. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658993#comment-13658993 ] Hadoop QA commented on YARN-117: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583395/YARN-117-007.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/938//console This message is automatically generated. > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-007.patch, YARN-117-2.patch, YARN-117-3.patch, > YARN-117.4.patch, YARN-117.5.patch, YARN-117.6.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state transition failures are something that registered service listeners > may wish to be informed of. > When a state transition fails a {{RuntimeException}} can be thrown -and the > service listeners are not informed as the notification point isn't reached. > They may wish to know this, especially for management and diagnostics. > *Fix:* extend {{ServiceStateChangeListener}} with a callback such as > {{stateChangeFailed(Service service,Service.State targeted-state, > RuntimeException e)}} that is invoked from the (final) state change methods > in the {{AbstractService}} class (once they delegate to their in
[jira] [Commented] (YARN-638) Restore RMDelegationTokens after RM Restart
[ https://issues.apache.org/jira/browse/YARN-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658989#comment-13658989 ] Hadoop QA commented on YARN-638: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583374/YARN-638.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/934//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/934//console This message is automatically generated. > Restore RMDelegationTokens after RM Restart > --- > > Key: YARN-638 > URL: https://issues.apache.org/jira/browse/YARN-638 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-638.10.patch, YARN-638.11.patch, YARN-638.1.patch, > YARN-638.2.patch, YARN-638.3.patch, YARN-638.4.patch, YARN-638.5.patch, > YARN-638.6.patch, YARN-638.7.patch, YARN-638.8.patch, YARN-638.9.patch > > > This is missed in YARN-581. After RM restart, RMDelegationTokens need to be > added both in DelegationTokenRenewer (addressed in YARN-581), and > delegationTokenSecretManager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658977#comment-13658977 ] Milind Bhandarkar commented on YARN-117: Hi, This is an automated message. Please do not reply to this email. If you are receiving this email, it must be because you sent an email to my old email address @EMC.com. Currently, all email sent to this address is being forwarded to my new email address: mbhandar...@gopivotal.com However, this forwarding will stop soon, and I will not be able to receive email sent to @EMC.com address. Please update your contacts DB with my new email address. Thank you. - milind > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-007.patch, YARN-117-2.patch, YARN-117-3.patch, > YARN-117.4.patch, YARN-117.5.patch, YARN-117.6.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state transition failures are something that registered service listeners > may wish to be informed of. > When a state transition fails a {{RuntimeException}} can be thrown -and the > service listeners are not informed as the notification point isn't reached. > They may wish to know this, especially for management and diagnostics. > *Fix:* extend {{ServiceStateChangeListener}} with a callback such as > {{stateChangeFailed(Service service,Service.State targeted-state, > RuntimeException e)}} that is invoked fro
[jira] [Updated] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-117: Attachment: YARN-117-007.patch patch in sync with with {{YARN-530-005.patch}} #adapts to the new {{serviceStart()}}, {{serviceStop()}}, serviceInit()}} names. # {{NodeManager}} shutdown is hardened to work from INITED # {{NodeStatusUpdater}} cross-thread stop flag marked as {{volatile}} # Various tests more rigorous about stopping services on failure/exit > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-007.patch, YARN-117-2.patch, YARN-117-3.patch, > YARN-117.4.patch, YARN-117.5.patch, YARN-117.6.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state transition failures are something that registered service listeners > may wish to be informed of. > When a state transition fails a {{RuntimeException}} can be thrown -and the > service listeners are not informed as the notification point isn't reached. > They may wish to know this, especially for management and diagnostics. > *Fix:* extend {{ServiceStateChangeListener}} with a callback such as > {{stateChangeFailed(Service service,Service.State targeted-state, > RuntimeException e)}} that is invoked from the (final) state change methods > in the {{AbstractService}} class (once they delegate to their inner > {{innerStart()}}, {{innerStop()}} methods; make a no-op on the ex
[jira] [Commented] (YARN-530) Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services
[ https://issues.apache.org/jira/browse/YARN-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658961#comment-13658961 ] Steve Loughran commented on YARN-530: - h3. Service bq. {{start()}} doesn't use {{enterState()}} API, so we don't capture the life-cycle change events. -fixed bq. {{init()}}, {{start()}} and {{stop()}} aren't synchronized, so callers of {{getServiceState()}} will get incorrect information if, let's say, {{innerInit}} is still in progress. This is interesting. I've realised they need to be sync or you add an "stateChangeInProgress" marker to ensure you eliminate the race of someone trying to {{stop()}} a service half-way through {{start()}}. Some of the more complex state models make the 'starting', 'stopping' states explicit, which is another option. * I'm going to make the methods {{synchronized}} to stop the risk of any race conditions -while doing a best effort at keeping the notifications outside the {{synchronized}} block. The corner case here is that when an init or start fails and {{stop()}} is called automatically: it will notify its listeners inside the {{stop()}} call. * I'm going to keep the state queries unsynced (reading a volatile), so that doesn't stop things that only want to read and not manipulate service state from blocking. bq. Rename {{inState}} to {{isInState}}? done bq. {{waitForServiceToStop}} seems redundant because we also have listeners, right? Sure one is a blocking call while the other is async. I'd remove it unless there is some other strong reason. May be we can implement an async utility using{{ getServiceState()}} and implement a generic {{waitForServiceState}} instead of just for stop? -let me add something that isn't on the critical path for the next alpha, as it isn't an API change: an entry point to start a service by its name. I've just added YARN-679 to show the use case here: an entry point to start any service by its classname. That isn't ready to be committed (where are the tests!), but it shows the vision. I'll see if I can implement it with the notifications. bq. What is the use of 'blockers'? An attempt to make the reason a service blocks visible, at least when it is consciously blocking (e.g. spin/sleep waiting for manager node). Unconscious blocking, by way of blocking API calls, will still happen. If the service can declare that they are blocked by something, then other tooling can say "what is this service waiting for" bq. May be LifecycleEvent move to top-level? -done bq. Not part of your patch, but we may as well take this opportunity to fix this: Rename register -> registerServiceListener? Similarly unregister. -done h3. AbstractService bq. The inner* method-names don't look good when using the service stuff. Shall we rename it to to say, for e.g, innerInit to initService ? bq. Mark all the super init, start, stop methods as final? Gladly, though it does cause a mockito test to fail: {code} testResourceRelease(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) Time elapsed: 247 sec <<< FAILURE! java.lang.AssertionError: null state in null class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker$$EnhancerByMockitoWithCGLIB$$221b8e7 at org.apache.hadoop.yarn.service.AbstractService.enterState(AbstractService.java:431) at org.apache.hadoop.yarn.service.AbstractService.init(AbstractService.java:151) at org.apache.hadoop.yarn.service.CompositeService.serviceInit(CompositeService.java:67) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceInit(ResourceLocalizationService.java:240) at org.apache.hadoop.yarn.service.AbstractService.init(AbstractService.java:154) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testResourceRelease(TestResourceLocalizationService.java:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassR
[jira] [Updated] (YARN-530) Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services
[ https://issues.apache.org/jira/browse/YARN-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-530: Attachment: YARN-530-005.patch > Define Service model strictly, implement AbstractService for robust > subclassing, migrate yarn-common services > - > > Key: YARN-530 > URL: https://issues.apache.org/jira/browse/YARN-530 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117changes.pdf, YARN-530-005.patch, > YARN-530-2.patch, YARN-530-3.patch, YARN-530.4.patch, YARN-530.patch > > > # Extend the YARN {{Service}} interface as discussed in YARN-117 > # Implement the changes in {{AbstractService}} and {{FilterService}}. > # Migrate all services in yarn-common to the more robust service model, test. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-366) Add a tracing async dispatcher to simplify debugging
[ https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658867#comment-13658867 ] Hadoop QA commented on YARN-366: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583378/YARN-366-4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/936//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/936//console This message is automatically generated. > Add a tracing async dispatcher to simplify debugging > > > Key: YARN-366 > URL: https://issues.apache.org/jira/browse/YARN-366 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Affects Versions: 2.0.2-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-366-1.patch, YARN-366-2.patch, YARN-366-3.patch, > YARN-366-4.patch, YARN-366.patch > > > Exceptions thrown in YARN/MR code with asynchronous event handling do not > contain informative stack traces, as all handle() methods sit directly under > the dispatcher thread's loop. > This makes errors very difficult to debug for those who are not intimately > familiar with the code, as it is difficult to see which chain of events > caused a particular outcome. > I propose adding an AsyncDispatcher that instruments events with tracing > information. Whenever an event is dispatched during the handling of another > event, the dispatcher would annotate that event with a pointer to its parent. > When the dispatcher catches an exception, it could reconstruct a "stack" > trace of the chain of events that led to it, and be able to log something > informative. > This would be an experimental feature, off by default, unless extensive > testing showed that it did not have a significant performance impact. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-617) In unsercure mode, AM can fake resource requirements
[ https://issues.apache.org/jira/browse/YARN-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658858#comment-13658858 ] Hadoop QA commented on YARN-617: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583375/YARN-617-20130515.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 22 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/935//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/935//console This message is automatically generated. > In unsercure mode, AM can fake resource requirements > - > > Key: YARN-617 > URL: https://issues.apache.org/jira/browse/YARN-617 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi >Priority: Minor > Attachments: YARN-617.20130501.1.patch, YARN-617.20130501.patch, > YARN-617.20130502.patch, YARN-617-20130507.patch, YARN-617.20130508.patch, > YARN-617-20130513.patch, YARN-617-20130515.patch > > > Without security, it is impossible to completely avoid AMs faking resources. > We can at the least make it as difficult as possible by using the same > container tokens and the RM-NM shared key mechanism over unauthenticated > RM-NM channel. > In the minimum, this will avoid accidental bugs in AMs in unsecure mode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-687) TestNMAuditLogger hang
[ https://issues.apache.org/jira/browse/YARN-687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658843#comment-13658843 ] Steve Loughran commented on YARN-687: - thread dump {code} Running org.apache.hadoop.yarn.server.nodemanager.TestNMAuditLogger 2013-05-15 21:21:39 Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode): "IPC Server handler 4 on 32868" daemon prio=10 tid=0x7fc9ec4da000 nid=0x359e waiting on condition [0x7fc9e8af9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xebaf1720> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1817) "IPC Server handler 3 on 32868" daemon prio=10 tid=0x7fc9ec4b8000 nid=0x359d waiting on condition [0x7fc9e8bfa000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xebaf1720> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1817) "IPC Server handler 2 on 32868" daemon prio=10 tid=0x7fc9ec4b7000 nid=0x359c waiting on condition [0x7fc9e8cfb000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xebaf1720> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1817) "IPC Server handler 1 on 32868" daemon prio=10 tid=0x7fc9ec41b800 nid=0x359b waiting on condition [0x7fc9e8dfc000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xebaf1720> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1817) "IPC Server handler 0 on 32868" daemon prio=10 tid=0x7fc9ec414000 nid=0x359a waiting on condition [0x7fc9e8efd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xebaf1720> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1817) "IPC Server listener on 32868" daemon prio=10 tid=0x7fc9ec3eb000 nid=0x3599 runnable [0x7fc9e8ffe000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:83) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0xebaf2800> (a sun.nio.ch.Util$1) - locked <0xebaf27f0> (a java.util.Collections$UnmodifiableSet) - locked <0xebaf2380> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:678) "IPC Server Responder" daemon prio=10 tid=0x7fc9ec4bb000 nid=0x3598 runnable [0x7fc9f0172000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
[jira] [Created] (YARN-687) TestNMAuditLogger hang
Steve Loughran created YARN-687: --- Summary: TestNMAuditLogger hang Key: YARN-687 URL: https://issues.apache.org/jira/browse/YARN-687 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Environment: Linux stevel-dev 3.2.0-24-virtual #39-Ubuntu SMP Mon May 21 18:44:18 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux java version "1.6.0_27" OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Reporter: Steve Loughran Priority: Minor TestNMAuditLogger hanging repeatedly on a test VM -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-656) In scheduler UI, including reserved memory in "Memory Total" can make it exceed cluster capacity.
[ https://issues.apache.org/jira/browse/YARN-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-656: Summary: In scheduler UI, including reserved memory in "Memory Total" can make it exceed cluster capacity. (was: In scheduler UI, including reserved memory "Memory Total" can make it exceed cluster capacity.) > In scheduler UI, including reserved memory in "Memory Total" can make it > exceed cluster capacity. > - > > Key: YARN-656 > URL: https://issues.apache.org/jira/browse/YARN-656 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > "Memory Total" is currently a sum of availableMB, allocatedMB, and > reservedMB. Including reservedMB in this sum can make the total exceed the > capacity of the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-366) Add a tracing async dispatcher to simplify debugging
[ https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-366: Attachment: YARN-366-4.patch > Add a tracing async dispatcher to simplify debugging > > > Key: YARN-366 > URL: https://issues.apache.org/jira/browse/YARN-366 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Affects Versions: 2.0.2-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-366-1.patch, YARN-366-2.patch, YARN-366-3.patch, > YARN-366-4.patch, YARN-366.patch > > > Exceptions thrown in YARN/MR code with asynchronous event handling do not > contain informative stack traces, as all handle() methods sit directly under > the dispatcher thread's loop. > This makes errors very difficult to debug for those who are not intimately > familiar with the code, as it is difficult to see which chain of events > caused a particular outcome. > I propose adding an AsyncDispatcher that instruments events with tracing > information. Whenever an event is dispatched during the handling of another > event, the dispatcher would annotate that event with a pointer to its parent. > When the dispatcher catches an exception, it could reconstruct a "stack" > trace of the chain of events that led to it, and be able to log something > informative. > This would be an experimental feature, off by default, unless extensive > testing showed that it did not have a significant performance impact. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-366) Add a tracing async dispatcher to simplify debugging
[ https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658836#comment-13658836 ] Sandy Ryza commented on YARN-366: - Uploading a patch to address the findbugs warnings. > Add a tracing async dispatcher to simplify debugging > > > Key: YARN-366 > URL: https://issues.apache.org/jira/browse/YARN-366 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Affects Versions: 2.0.2-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-366-1.patch, YARN-366-2.patch, YARN-366-3.patch, > YARN-366-4.patch, YARN-366.patch > > > Exceptions thrown in YARN/MR code with asynchronous event handling do not > contain informative stack traces, as all handle() methods sit directly under > the dispatcher thread's loop. > This makes errors very difficult to debug for those who are not intimately > familiar with the code, as it is difficult to see which chain of events > caused a particular outcome. > I propose adding an AsyncDispatcher that instruments events with tracing > information. Whenever an event is dispatched during the handling of another > event, the dispatcher would annotate that event with a pointer to its parent. > When the dispatcher catches an exception, it could reconstruct a "stack" > trace of the chain of events that led to it, and be able to log something > informative. > This would be an experimental feature, off by default, unless extensive > testing showed that it did not have a significant performance impact. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-613) Create NM proxy per NM instead of per container
[ https://issues.apache.org/jira/browse/YARN-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658816#comment-13658816 ] Bikas Saha commented on YARN-613: - To be clear, on the AM the behavior is always to take the tokens coming in the allocate response and setting them in the UGI (overriding old values). They will be picked from the UGI by NMClient during communication. The behavior on the NM will be to always authenticate based on the current master key. This is always the latest correct value and in the majority of the use cases, this master key will be identical to the cached appId-MasterKey. If the master key matches the incoming token then the master key is used as the new value of the cached appId-master-key. If the master key fails to validate the token (long running apps), then the appId-master-key is used to validate the token. It would be great to take the solution and break the work into separate jiras. e.g. AMRMProtocol addition, NMRM protocol changes, RM server changes, NM server changes, AMRMClient changes, nmclient changes. bq. If we don't need to preserve the work, (AM and container will be killed after RM restarts) then there will be no problem at all even with above implementation in which case as applications are already killed so we can just clear the cache on NM. If this cache is per appId then it cannot be removed when the appAttempt is completes. It will be removed when the application completes. During NM resync we should not invalidate the cache. The cache is required for work preserving restart and will automatically be refreshed by the above logic for non-work-preserving restart. > Create NM proxy per NM instead of per container > --- > > Key: YARN-613 > URL: https://issues.apache.org/jira/browse/YARN-613 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Omkar Vinit Joshi > > Currently a new NM proxy has to be created per container since the secure > authentication is using a containertoken from the container. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-617) In unsercure mode, AM can fake resource requirements
[ https://issues.apache.org/jira/browse/YARN-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-617: --- Attachment: YARN-617-20130515.patch > In unsercure mode, AM can fake resource requirements > - > > Key: YARN-617 > URL: https://issues.apache.org/jira/browse/YARN-617 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi >Priority: Minor > Attachments: YARN-617.20130501.1.patch, YARN-617.20130501.patch, > YARN-617.20130502.patch, YARN-617-20130507.patch, YARN-617.20130508.patch, > YARN-617-20130513.patch, YARN-617-20130515.patch > > > Without security, it is impossible to completely avoid AMs faking resources. > We can at the least make it as difficult as possible by using the same > container tokens and the RM-NM shared key mechanism over unauthenticated > RM-NM channel. > In the minimum, this will avoid accidental bugs in AMs in unsecure mode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-617) In unsercure mode, AM can fake resource requirements
[ https://issues.apache.org/jira/browse/YARN-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658813#comment-13658813 ] Omkar Vinit Joshi commented on YARN-617: Thanks vinod.. bq. ContainerManager.getContainerTokenIdentifier should be changed to throw only YarnRemoteException, we only throw that at the YARN layer Fixed ..using RPCUtil.getRemoteException bq. I still don't understand the DEL changes in TestNodeManagerReboot. You haven't given any explanation. Don't think they are needed. My bad... I had reverted the change ..but there were formatting issues which showed up in diff.. Fixed.. > In unsercure mode, AM can fake resource requirements > - > > Key: YARN-617 > URL: https://issues.apache.org/jira/browse/YARN-617 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi >Priority: Minor > Attachments: YARN-617.20130501.1.patch, YARN-617.20130501.patch, > YARN-617.20130502.patch, YARN-617-20130507.patch, YARN-617.20130508.patch, > YARN-617-20130513.patch, YARN-617-20130515.patch > > > Without security, it is impossible to completely avoid AMs faking resources. > We can at the least make it as difficult as possible by using the same > container tokens and the RM-NM shared key mechanism over unauthenticated > RM-NM channel. > In the minimum, this will avoid accidental bugs in AMs in unsecure mode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-638) Restore RMDelegationTokens after RM Restart
[ https://issues.apache.org/jira/browse/YARN-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-638: - Attachment: YARN-638.11.patch The newest patch revert all hdfs changes except moving addPersistedToken method to the common-secret-mananger > Restore RMDelegationTokens after RM Restart > --- > > Key: YARN-638 > URL: https://issues.apache.org/jira/browse/YARN-638 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-638.10.patch, YARN-638.11.patch, YARN-638.1.patch, > YARN-638.2.patch, YARN-638.3.patch, YARN-638.4.patch, YARN-638.5.patch, > YARN-638.6.patch, YARN-638.7.patch, YARN-638.8.patch, YARN-638.9.patch > > > This is missed in YARN-581. After RM restart, RMDelegationTokens need to be > added both in DelegationTokenRenewer (addressed in YARN-581), and > delegationTokenSecretManager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-662) Enforce required parameters for all the protocols
[ https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658795#comment-13658795 ] Bikas Saha commented on YARN-662: - Does this include making null checks etc on all incoming fields in the API handlers? Currently in most places we simply access the fields assuming they will be valid. > Enforce required parameters for all the protocols > - > > Key: YARN-662 > URL: https://issues.apache.org/jira/browse/YARN-662 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Zhijie Shen > > All proto fields are marked as options. We need to mark some of them as > requried, or enforce these server side. Server side is likely better since > that's more flexible (Example deprecating a field type in favour of another - > either of the two must be present) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-686) Flatten NodeReport
Sandy Ryza created YARN-686: --- Summary: Flatten NodeReport Key: YARN-686 URL: https://issues.apache.org/jira/browse/YARN-686 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza The NodeReport returned by getClusterNodes or given to AMs in heartbeat responses includes both a NodeState (enum) and a NodeHealthStatus (object). As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving moving the its two other methods, getHealthReport and getLastHealthReportTime, into NodeReport -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-686) Flatten NodeReport
[ https://issues.apache.org/jira/browse/YARN-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-686: Description: The NodeReport returned by getClusterNodes or given to AMs in heartbeat responses includes both a NodeState (enum) and a NodeHealthStatus (object). As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving its two other methods, getHealthReport and getLastHealthReportTime, into NodeReport. (was: The NodeReport returned by getClusterNodes or given to AMs in heartbeat responses includes both a NodeState (enum) and a NodeHealthStatus (object). As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving moving the its two other methods, getHealthReport and getLastHealthReportTime, into NodeReport.) > Flatten NodeReport > -- > > Key: YARN-686 > URL: https://issues.apache.org/jira/browse/YARN-686 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > The NodeReport returned by getClusterNodes or given to AMs in heartbeat > responses includes both a NodeState (enum) and a NodeHealthStatus (object). > As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem > necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and > moving its two other methods, getHealthReport and getLastHealthReportTime, > into NodeReport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-686) Flatten NodeReport
[ https://issues.apache.org/jira/browse/YARN-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-686: Description: The NodeReport returned by getClusterNodes or given to AMs in heartbeat responses includes both a NodeState (enum) and a NodeHealthStatus (object). As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving moving the its two other methods, getHealthReport and getLastHealthReportTime, into NodeReport. was: The NodeReport returned by getClusterNodes or given to AMs in heartbeat responses includes both a NodeState (enum) and a NodeHealthStatus (object). As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving moving the its two other methods, getHealthReport and getLastHealthReportTime, into NodeReport > Flatten NodeReport > -- > > Key: YARN-686 > URL: https://issues.apache.org/jira/browse/YARN-686 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > The NodeReport returned by getClusterNodes or given to AMs in heartbeat > responses includes both a NodeState (enum) and a NodeHealthStatus (object). > As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem > necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving > moving the its two other methods, getHealthReport and > getLastHealthReportTime, into NodeReport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-613) Create NM proxy per NM instead of per container
[ https://issues.apache.org/jira/browse/YARN-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658763#comment-13658763 ] Omkar Vinit Joshi commented on YARN-613: One addition .. good suggestion [~bikassaha] If RM restarts then we have two scenarios * If we need to preserve the work, (AM and containers will continue to run) in which AM should be able to communicate with NM with older AMNMToken after RM start. So if AM gets new container on the NM after RM reboot (RM will send the new AMNMToken to AM considering it has no knowledge of the previous AMNMToken - information not persisted) then AM should replace the existing token with new one. Now if NM gets a different token than the older /stored one it should validate the current Token's master key with that of its current/previous master key. If this is valid then replace older Token (thereby we can even renew token). * If we don't need to preserve the work, (AM and container will be killed after RM restarts) then there will be no problem at all even with above implementation in which case as applications are already killed so we can just clear the cache on NM. > Create NM proxy per NM instead of per container > --- > > Key: YARN-613 > URL: https://issues.apache.org/jira/browse/YARN-613 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Omkar Vinit Joshi > > Currently a new NM proxy has to be created per container since the secure > authentication is using a containertoken from the container. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-638) Restore RMDelegationTokens after RM Restart
[ https://issues.apache.org/jira/browse/YARN-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658705#comment-13658705 ] Hadoop QA commented on YARN-638: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583340/YARN-638.10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.hdfs.TestClientReportBadBlock {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/933//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/933//console This message is automatically generated. > Restore RMDelegationTokens after RM Restart > --- > > Key: YARN-638 > URL: https://issues.apache.org/jira/browse/YARN-638 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-638.10.patch, YARN-638.1.patch, YARN-638.2.patch, > YARN-638.3.patch, YARN-638.4.patch, YARN-638.5.patch, YARN-638.6.patch, > YARN-638.7.patch, YARN-638.8.patch, YARN-638.9.patch > > > This is missed in YARN-581. After RM restart, RMDelegationTokens need to be > added both in DelegationTokenRenewer (addressed in YARN-581), and > delegationTokenSecretManager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658637#comment-13658637 ] Alejandro Abdelnur commented on YARN-624: - As pointed out, supporting gang at RM/scheduler level will allow detection/avoidance of deadlocks. This would not be trivial (nor efficient) to do if gang is done at AM level. Examples of gang request capabilities could be: * express a set of containers in any nodes. I.e.: 10 containers in any node of the cluster. * express a set of containers in a specified set of nodes. I.e.: 10 containers in rack1. 10 containers one in each of n1...n10 * express different sets of possible gangs that would satisfy the request: I.e.: 10 containers in rack1 or in rack2. 10 containers in n1...n10 or in n11..n20. * indicate a timeout/fallback-to-normal of gang requests. We should decide on what gang capabilities we want/need to address in the short term. > Support gang scheduling in the AM RM protocol > - > > Key: YARN-624 > URL: https://issues.apache.org/jira/browse/YARN-624 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, scheduler >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > Per discussion on YARN-392 and elsewhere, gang scheduling, in which a > scheduler runs a set of tasks when they can all be run at the same time, > would be a useful feature for YARN schedulers to support. > Currently, AMs can approximate this by holding on to containers until they > get all the ones they need. However, this lends itself to deadlocks when > different AMs are waiting on the same containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-613) Create NM proxy per NM instead of per container
[ https://issues.apache.org/jira/browse/YARN-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658604#comment-13658604 ] Omkar Vinit Joshi commented on YARN-613: This was discussed offline with [~vinodkv], [~bikassaha] and [~sseth]. There were 2 viable solutions to the problem of sending AMNMToken to AM for authenticating with NM. Below problems need to be addressed * The token will be generated by RM but how long the AMNMToken should be kept alive? How long AM should be able to talk to NM on which it ever launched any container during application life cycle. * If the token doesn't have an expiry time then who will renew the token ? NM or RM ?. * If NM reboots then can the old AMNMToken be reused? ( ideally when NM goes down right now containers are also lost, so there is nothing specific to that application there in NM after reboot) * AM might handover the AMNMToken to some other external service ( other than AM ...may be another container) which should be able to communicate with NM. (problem:- how if implemented renewal will take place?) * We need to support for long running services. * When key roles over there should be no spiker in communicating renewed tokens if implemented. Proposed solutions :- * No AMNMToken renewal ** here RM will generate the token and will handover to AM only if the AM is getting the container on underlying NM for the first time otherwise it will not send. AM can use this token to talk to NM as long application is alive. So this is upper limited by number of applications in the cluster <= number of nodes * number of containers per node. *** RM will have to remember tokens given to AM per NM *** NM will have to remember tokens per AM *** AM will have to anyways remember token per NM Problems : If NM reboots then the token is no longer valid in which case RM should reissue AM a new token for restarted NM Advantages : * for every container RM doesn't have to generate and send token. * no need to renew the token. No added overhead. No need to remember past keys (other than current and previous master key). * even if AM hands over token to some other service, that service can keep using the same token. * AMNMToken renewal ** here RM will generate and issue the token to AM during start container. RM also remembers which AM has what all tokens. So when key rolls over then RM will redistribute renewed tokens to AM for all NM on which it ever started container. AM if receives the updated token will have to replace older with new token. *** RM will have to remember all the NMs fro which AM handed over token *** NM doesn't have to remember tokens per application. It only has to remember current and previous key. *** AM will receive AMNMToken per container request / or all tokens during key renewal. It will have to refresh internal tokens with it Advantages: * NM doesn't need to remember the token so there will be no problem across NM reboot. (even though token will be valid across NM reboot but still there will be nothing on NM for AM before new container starts). Problems: * RM has to either remember or regenerate and send tokens to AM for container start call. This can be avoided by just sending it when key rolls over. * AM has to refresh the tokens may be given to some another service for monitoring container progress. * There will be spike at key role over. > Create NM proxy per NM instead of per container > --- > > Key: YARN-613 > URL: https://issues.apache.org/jira/browse/YARN-613 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Omkar Vinit Joshi > > Currently a new NM proxy has to be created per container since the secure > authentication is using a containertoken from the container. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-638) Restore RMDelegationTokens after RM Restart
[ https://issues.apache.org/jira/browse/YARN-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-638: - Attachment: YARN-638.10.patch The newest patch adds several API for RM use to persists delegation tokens, and renamed logUpdateMasterKey to storeNewMasterKey to be used by both RM & hdfs, AND move addPersistedDelegationToken & addPersistedMasterKey to hadoop-common > Restore RMDelegationTokens after RM Restart > --- > > Key: YARN-638 > URL: https://issues.apache.org/jira/browse/YARN-638 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-638.10.patch, YARN-638.1.patch, YARN-638.2.patch, > YARN-638.3.patch, YARN-638.4.patch, YARN-638.5.patch, YARN-638.6.patch, > YARN-638.7.patch, YARN-638.8.patch, YARN-638.9.patch > > > This is missed in YARN-581. After RM restart, RMDelegationTokens need to be > added both in DelegationTokenRenewer (addressed in YARN-581), and > delegationTokenSecretManager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-684) ContainerManager.startContainer needs to only have ContainerTokenIdentifier instead of the whole Container
[ https://issues.apache.org/jira/browse/YARN-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658558#comment-13658558 ] Bikas Saha commented on YARN-684: - I would suggest continuing to pass Container to the startContainer API of NMClient done in YARN-422. Internally, NMClient can pick the right fields and the users responsibility continues to be simply taking the Container obtained from AMRMClient and passing it to the NMClient. Will also future proof against other changes like these. > ContainerManager.startContainer needs to only have ContainerTokenIdentifier > instead of the whole Container > -- > > Key: YARN-684 > URL: https://issues.apache.org/jira/browse/YARN-684 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > The NM only needs the token, the whole Container is unnecessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-366) Add a tracing async dispatcher to simplify debugging
[ https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658480#comment-13658480 ] Hadoop QA commented on YARN-366: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583270/YARN-366-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/932//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/932//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/932//console This message is automatically generated. > Add a tracing async dispatcher to simplify debugging > > > Key: YARN-366 > URL: https://issues.apache.org/jira/browse/YARN-366 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Affects Versions: 2.0.2-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-366-1.patch, YARN-366-2.patch, YARN-366-3.patch, > YARN-366.patch > > > Exceptions thrown in YARN/MR code with asynchronous event handling do not > contain informative stack traces, as all handle() methods sit directly under > the dispatcher thread's loop. > This makes errors very difficult to debug for those who are not intimately > familiar with the code, as it is difficult to see which chain of events > caused a particular outcome. > I propose adding an AsyncDispatcher that instruments events with tracing > information. Whenever an event is dispatched during the handling of another > event, the dispatcher would annotate that event with a pointer to its parent. > When the dispatcher catches an exception, it could reconstruct a "stack" > trace of the chain of events that led to it, and be able to log something > informative. > This would be an experimental feature, off by default, unless extensive > testing showed that it did not have a significant performance impact. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-379) yarn [node,application] command print logger info messages
[ https://issues.apache.org/jira/browse/YARN-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658403#comment-13658403 ] Ravi Prakash commented on YARN-379: --- I recant my recant. So many options. We probably don't want to turn off ALL logging for YARN_CLIENT_OPTS either. We just want to set log level to WARN. > yarn [node,application] command print logger info messages > -- > > Key: YARN-379 > URL: https://issues.apache.org/jira/browse/YARN-379 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.0.3-alpha >Reporter: Thomas Graves >Assignee: Abhishek Kapoor > Labels: usability > Attachments: YARN-379.patch > > > Running the yarn node and yarn applications command results in annoying log > info messages being printed: > $ yarn node -list > 13/02/06 02:36:50 INFO service.AbstractService: > Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. > 13/02/06 02:36:50 INFO service.AbstractService: > Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. > Total Nodes:1 > Node-IdNode-State Node-Http-Address > Health-Status(isNodeHealthy)Running-Containers > foo:8041RUNNING foo:8042 true > 0 > 13/02/06 02:36:50 INFO service.AbstractService: > Service:org.apache.hadoop.yarn.client.YarnClientImpl is stopped. > $ yarn application > 13/02/06 02:38:47 INFO service.AbstractService: > Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. > 13/02/06 02:38:47 INFO service.AbstractService: > Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. > Invalid Command Usage : > usage: application > -kill Kills the application. > -list Lists all the Applications from RM. > -statusPrints the status of the application. > 13/02/06 02:38:47 INFO service.AbstractService: > Service:org.apache.hadoop.yarn.client.YarnClientImpl is stopped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira