[jira] [Comment Edited] (YARN-8275) Create a JNI interface to interact with Windows
[ https://issues.apache.org/jira/browse/YARN-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473789#comment-16473789 ] Allen Wittenauer edited comment on YARN-8275 at 5/14/18 6:01 AM: - bq. I am planning to code everything in Commons to be used from YARN and HDFS. The umbrella JIRA should really start out in HADOOP so that people aren't taken by surprise. I suspect any YARN and HDFS specific code to be relatively tiny since winutils is used all over the place, including in the client code. That fact probably makes ... bq. a long running native process communicating with YARN over pipe almost certainly a non-starter, never mind the security concerns, with greatly increasing the complexity for likely very little gain. The other thing to keep in mind is that winutils pre-dates Java 7. Things like symlinks can now be done with Java APIs. No C required. I'd highly recommend starting with replacing the winutils calls with Java API calls first and then digging into something more complex later. [The Unix versions of those same calls will likely get a speed bump too.] --- Before I forget, from a "what gets run on the maven command line", there is very little difference between libhadoop (JNI) and winutils. Windows *always* requires (and thus triggers) -Pnative. I suspect the direction was set because winutils was added when libhadoop was still being built by autoconf. But now that cmake is there and works properly on Windows (at least in 3.x), it'd be nice to place the core of winutils into libhadoop and just keep winutils as a wrapper to use for debugging. This might also move us away from using MSBuild, which would greatly simplify the build process. was (Author: aw): bq. I am planning to code everything in Commons to be used from YARN and HDFS. The umbrella JIRA should really start out in HADOOP so that people aren't taken by surprise. I suspect any YARN and HDFS specific code to be relatively tiny since winutils is used all over the place, including in the client code. That fact probably makes ... bq. a long running native process communicating with YARN over pipe almost certainly a non-starter, never mind the security concerns, with greatly increasing the complexity for likely very little gain. The other thing to keep in mind is that winutils pre-dates Java 7. Things like symlinks can now be done with Java APIs. No C required. I'd highly recommend starting with replacing the winutils calls with Java API calls first and then digging into something more complex later. [The Unix versions of those same calls will likely get a speed bump too.] > Create a JNI interface to interact with Windows > --- > > Key: YARN-8275 > URL: https://issues.apache.org/jira/browse/YARN-8275 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Major > Attachments: WinUtils-Functions.pdf, WinUtils.CSV > > > I did a quick investigation of the performance of WinUtils in YARN. In > average NM calls 4.76 times per second and 65.51 per container. > > | |Requests|Requests/sec|Requests/min|Requests/container| > |*Sum [WinUtils]*|*135354*|*4.761*|*286.160*|*65.51*| > |[WinUtils] Execute -help|4148|0.145|8.769|2.007| > |[WinUtils] Execute -ls|2842|0.0999|6.008|1.37| > |[WinUtils] Execute -systeminfo|9153|0.321|19.35|4.43| > |[WinUtils] Execute -symlink|115096|4.048|243.33|57.37| > |[WinUtils] Execute -task isAlive|4115|0.144|8.699|2.05| > Interval: 7 hours, 53 minutes and 48 seconds > Each execution of WinUtils does around *140 IO ops*, of which 130 are DDL ops. > This means *666.58* IO ops/second due to WinUtils. > We should start considering to remove WinUtils from Hadoop and creating a JNI > interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8275) Create a JNI interface to interact with Windows
[ https://issues.apache.org/jira/browse/YARN-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473789#comment-16473789 ] Allen Wittenauer commented on YARN-8275: bq. I am planning to code everything in Commons to be used from YARN and HDFS. The umbrella JIRA should really start out in HADOOP so that people aren't taken by surprise. I suspect any YARN and HDFS specific code to be relatively tiny since winutils is used all over the place, including in the client code. That fact probably makes ... bq. a long running native process communicating with YARN over pipe almost certainly a non-starter, never mind the security concerns, with greatly increasing the complexity for likely very little gain. The other thing to keep in mind is that winutils pre-dates Java 7. Things like symlinks can now be done with Java APIs. No C required. I'd highly recommend starting with replacing the winutils calls with Java API calls first and then digging into something more complex later. [The Unix versions of those same calls will likely get a speed bump too.] > Create a JNI interface to interact with Windows > --- > > Key: YARN-8275 > URL: https://issues.apache.org/jira/browse/YARN-8275 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Major > Attachments: WinUtils-Functions.pdf, WinUtils.CSV > > > I did a quick investigation of the performance of WinUtils in YARN. In > average NM calls 4.76 times per second and 65.51 per container. > > | |Requests|Requests/sec|Requests/min|Requests/container| > |*Sum [WinUtils]*|*135354*|*4.761*|*286.160*|*65.51*| > |[WinUtils] Execute -help|4148|0.145|8.769|2.007| > |[WinUtils] Execute -ls|2842|0.0999|6.008|1.37| > |[WinUtils] Execute -systeminfo|9153|0.321|19.35|4.43| > |[WinUtils] Execute -symlink|115096|4.048|243.33|57.37| > |[WinUtils] Execute -task isAlive|4115|0.144|8.699|2.05| > Interval: 7 hours, 53 minutes and 48 seconds > Each execution of WinUtils does around *140 IO ops*, of which 130 are DDL ops. > This means *666.58* IO ops/second due to WinUtils. > We should start considering to remove WinUtils from Hadoop and creating a JNI > interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473586#comment-16473586 ] genericqa commented on YARN-8234: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 32s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.8.3 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 49s{color} | {color:green} branch-2.8.3 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s{color} | {color:green} branch-2.8.3 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} branch-2.8.3 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s{color} | {color:green} branch-2.8.3 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 10s{color} | {color:green} branch-2.8.3 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} branch-2.8.3 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 29s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 218 unchanged - 3 fixed = 220 total (was 221) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 47s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 88m 6s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}119m 57s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestSchedulingPolicy | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:c2d96dd | | JIRA Issue | YARN-8234 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12923194/YARN-8234-branch-2.8.3.003.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 88cd1051fa00 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/pat
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473553#comment-16473553 ] Larry McCay commented on YARN-8108: --- I would say that since this one line patch requires so much discussion that it supports my feeling that we need to revisit this whole filter chain mechanism. It always feels like there is some magic combination that we *think* will solve some issue that shouldn't have existed in the first place. I suggest that we get a picture or set of pictures that represent what the filter chain/s are and how global vs servlet specific layer on to each other. Especially since we are talking about ignoring configuration that was added with explicit intent in mind - we need to be able to articulate how things work in a very clear way. This information will be valuable if/when we decide to replace this mechanism. > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Assignee: Eric Yang >Priority: Blocker > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request
[ https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426 ] Szilard Nemeth edited comment on YARN-8248 at 5/13/18 10:22 AM: Hi [~haibochen] Thanks for your comments! 1. I'm fine with removing the first null check: {code:java} if (rmApp == null || rmApp.getAMResourceRequests() == null) { LOG.debug("rmApp or rmApp.AMResourceRequests was null!"); } {code} but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just before the loop on amRequests: {code:java} if (rmApp != null && rmApp.getAMResourceRequests() != null) { {code} Maybe it could be just if {code:java} (rmApp.getAMResourceRequests != null) {code} since rmApp should be non-null at this point. What do you prefer? 2. It is true that {{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would always return false when the {{queueMaxShare}} is 0 for any resource, but the problem with just using {{Resources.fitsIn}} is that it would return false for such cases when the requested resource is smaller than the max resource but that max resource is not zero, e.g. requested vCores = 2, max vCores = 1. With this check, I only wanted to catch those cases where there is a resource request of any resource type but the queue has 0 of that resource in {{queueMaxShare}}. In this sense, in the if condition this check would be enough: {code:java} Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare) {code} but it is not perfectly fine since only using this check does not check whether a resource is really requested. For example, an application does not request any vCores (maybe this cannot happen in reality) and we have 0 of vCores as maximum then it is a perfectly reasonable request so we don't need to reject the application. On the other hand if an app requests 1 vCores and we have 0 vCores as maximum then rejection should happen. Is this explanation makes it cleaner? Do you think some comments need to be added to the code above the if condition? How would you update the diagnostic message? 3. My overall intention of my changes in {{Fairscheduler}} was the following: Essentially, in {{addApplication()}}, the AM resource requests are checked against the queue's max resources. In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) resource request is happened against a queue that has 0 of any resource configured as max resource. So in my understanding, it can happen that in {{addApplication()}} the app was not rejected, for example AM does not request vCores and we have 0 vCores configure as max resources, but for a map container, 1 vCores is requested. Please tell me whether this is clear. 4. {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when AM resource request is exceeding the queue's maximum resources. (tests code added to {{FairScheduler.addApplication}}) {{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when map / reduce container request is exceeding the queue's maximum resources (tests code added to {{FairScheduler.allocate}}) Please check my comment for 3. as I explained such a case when an application will not be rejected immadiately upon submission but only when map/reduce container request happens. About the uncovered unit test: Good point and I was thinking about that if we can reject an application only if the AM request is greater than 0 and we have 0 configured as max resource or simply in any case where the requested resource is greater than max resource, regardless if it is 0 or not. If the latter is true, then I agree, unit tests and the if-conditions in the production code needs to be changed accordingly (using just {{Resources.fitsIn}} will work I guess). I'm fine with either way as well and as you have competence with FairScheduler please advise which way I should go. 5. - Removed the unused import. - Renamed those methods what you suggested - Thanks for the log change suggestions, you were right about those, it's way more understandable that way. Please note that I haven't uploaded a new patch as it does not makes sense until we discussed all of the bullet points and now I only did some minor fixes. Thanks! was (Author: snemeth): Hi [~haibochen] Thanks for your comments! 1. I'm fine with removing the first null check: {code:java} if (rmApp == null || rmApp.getAMResourceRequests() == null) { LOG.debug("rmApp or rmApp.AMResourceRequests was null!"); } {code} but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just
[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request
[ https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426 ] Szilard Nemeth edited comment on YARN-8248 at 5/13/18 10:18 AM: Hi [~haibochen] Thanks for your comments! 1. I'm fine with removing the first null check: {code:java} if (rmApp == null || rmApp.getAMResourceRequests() == null) { LOG.debug("rmApp or rmApp.AMResourceRequests was null!"); } {code} but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just before the loop on amRequests: {code:java} if (rmApp != null && rmApp.getAMResourceRequests() != null) { {code} Maybe it could be just if {code:java} (rmApp.getAMResourceRequests != null) {code} since rmApp should be non-null at this point. What do you prefer? 2. It is true that {{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would always return false when the {{queueMaxShare}} is 0 for any resource, but the problem with just using {{Resources.fitsIn}} is that it would return false for such cases when the requested resource is smaller than the max resource but that max resource is not zero, e.g. requested vCores = 2, max vCores = 1. With this check, I only wanted to catch those cases where there is a resource request of any resource type but the queue has 0 of that resource in {{queueMaxShare}}. In this sense, in the if condition this check would be enough: {code:java} Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare) {code} but it is not perfectly fine since only using this check does not check whether a resource is really requested. For example, an application does not request any vCores (maybe this cannot happen in reality) and we have 0 of vCores as maximum then it is a perfectly reasonable request so we don't need to reject the application. On the other hand if an app requests 1 vCores and we have 0 vCores as maximum then rejection should happen. Is this explanation makes it cleaner? Do you think some comments need to be added to the code above the if condition? How would you update the diagnostic message? 3. My overall intention of my changes in {{Fairscheduler}} was the following: Essentially, in {{addApplication()}}, the AM resource requests are checked against the queue's max resources. In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) resource request is happened against a queue that has 0 of any resource configured as max resource. So in my understanding, it can happen that in {{addApplication()}} the app was not rejected, for example AM does not request vCores and we have 0 vCores configure as max resources, but for a map container, 1 vCores is requested. Please tell me whether this is clear. 4. {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when AM resource request is exceeding the queue's maximum resources. (tests code added to {{FairScheduler.addApplication}}) {{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when map / reduce container request is exceeding the queue's maximum resources (tests code added to {{FairScheduler.allocate}}) Please check my comment for 3. as I explained such a case when an application will not be rejected immadiately upon submission but only when map/reduce container request happens. About the uncovered unit test: Good point and I was thinking about that if we can reject an application only if the AM request is greater than 0 and we have 0 configured as max resource or simply in any case where the requested resource is greater than max resource, regardless if it is 0 or not. If the latter is true, then I agree, unit tests and the if-conditions in the production code needs to be changed accordingly (using just {{Resources.fitsIn}} will work I guess). I'm fine with either way as well and as you have competence with FairScheduler please advise which way I should go. 5. - Removed the unused import. - Renamed those methods what you suggested - Thanks for the log change suggestions, you were right about those, it's way more understandable that way. Thanks! was (Author: snemeth): Hi @haibo! Thanks for your comments! 1. I'm fine with removing the first null check: {code:java} if (rmApp == null || rmApp.getAMResourceRequests() == null) { LOG.debug("rmApp or rmApp.AMResourceRequests was null!"); } {code} but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just before the loop on amRequests: {code:java} if (rmApp != null && rmApp.getAMResourceRequests() != null) { {code} Maybe it could be just if {code:java} (rmApp.getAM
[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request
[ https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426 ] Szilard Nemeth commented on YARN-8248: -- Hi @haibo! Thanks for your comments! 1. I'm fine with removing the first null check: {code:java} if (rmApp == null || rmApp.getAMResourceRequests() == null) { LOG.debug("rmApp or rmApp.AMResourceRequests was null!"); } {code} but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just before the loop on amRequests: {code:java} if (rmApp != null && rmApp.getAMResourceRequests() != null) { {code} Maybe it could be just if {code:java} (rmApp.getAMResourceRequests != null) {code} since rmApp should be non-null at this point. What do you prefer? 2. It is true that {{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would always return false when the {{queueMaxShare}} is 0 for any resource, but the problem with just using {{Resources.fitsIn}} is that it would return false for such cases when the requested resource is smaller than the max resource but that max resource is not zero, e.g. requested vCores = 2, max vCores = 1. With this check, I only wanted to catch those cases where there is a resource request of any resource type but the queue has 0 of that resource in {{queueMaxShare}}. In this sense, in the if condition this check would be enough: {code:java} Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare) {code} but it is not perfectly fine since only using this check does not check whether a resource is really requested. For example, an application does not request any vCores (maybe this cannot happen in reality) and we have 0 of vCores as maximum then it is a perfectly reasonable request so we don't need to reject the application. On the other hand if an app requests 1 vCores and we have 0 vCores as maximum then rejection should happen. Is this explanation makes it cleaner? Do you think some comments need to be added to the code above the if condition? How would you update the diagnostic message? 3. My overall intention of my changes in {{Fairscheduler}} was the following: Essentially, in {{addApplication()}}, the AM resource requests are checked against the queue's max resources. In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) resource request is happened against a queue that has 0 of any resource configured as max resource. So in my understanding, it can happen that in {{addApplication()}} the app was not rejected, for example AM does not request vCores and we have 0 vCores configure as max resources, but for a map container, 1 vCores is requested. Please tell me whether this is clear. 4. {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when AM resource request is exceeding the queue's maximum resources. (tests code added to {{FairScheduler.addApplication}}) {{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when map / reduce container request is exceeding the queue's maximum resources (tests code added to {{FairScheduler.allocate}}) Please check my comment for 3. as I explained such a case when an application will not be rejected immadiately upon submission but only when map/reduce container request happens. About the uncovered unit test: Good point and I was thinking about that if we can reject an application only if the AM request is greater than 0 and we have 0 configured as max resource or simply in any case where the requested resource is greater than max resource, regardless if it is 0 or not. If the latter is true, then I agree, unit tests and the if-conditions in the production code needs to be changed accordingly (using just {{Resources.fitsIn}} will work I guess). I'm fine with either way as well and as you have competence with FairScheduler please advise which way I should go. 5. - Removed the unused import. - Renamed those methods what you suggested - Thanks for the log change suggestions, you were right about those, it's way more understandable that way. Thanks! > Job hangs when a queue is specified and the maxResources of the queue cannot > satisfy the AM resource request > > > Key: YARN-8248 > URL: https://issues.apache.org/jira/browse/YARN-8248 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8248-001.patch, YARN-8248-002.patch, > YARN-8248-003.patch, YARN-8248-
[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Ziqian updated YARN-8234: Attachment: YARN-8234-branch-2.8.3.003.patch > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Major > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch > > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 2 app's in one hour and it works fine. > We add following configuration: > * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of > system metrics publisher sending events in one request. Default value is 1000 > * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the > event buffer in system metrics publisher. > * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When > enable batch publishing, we must avoid that the publisher waits for a batch > to be filled up and hold events in buffer for long time. So we add another > thread which send event's in the buffer periodically. This config sets the > interval of the cyclical sending thread. The default value is 60s. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org