date:20180513

[jira] [Comment Edited] (YARN-8275) Create a JNI interface to interact with Windows

2018-05-13 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473789#comment-16473789
 ] 

Allen Wittenauer edited comment on YARN-8275 at 5/14/18 6:01 AM:
-

bq.  I am planning to code everything in Commons to be used from YARN and HDFS.

The umbrella JIRA should really start out in HADOOP so that people aren't taken 
by surprise.  I suspect any YARN and HDFS specific code to be relatively tiny 
since winutils is used all over the place, including in the client code.  

That fact probably makes ...

bq. a long running native process communicating with YARN over pipe

almost certainly a non-starter, never mind the security concerns, with greatly 
increasing the complexity for likely very little gain.

The other thing to keep in mind is that winutils pre-dates Java 7.  Things like 
symlinks can now be done with Java APIs.  No C required.  I'd highly recommend 
starting with replacing the winutils calls with Java API calls first and then 
digging into something more complex later.  [The Unix versions of those same 
calls will likely get a speed bump too.]

---

Before I forget, from a "what gets run on the maven command line", there is 
very little difference between libhadoop (JNI) and winutils.  Windows *always* 
requires (and thus triggers) -Pnative.  

I suspect the direction was set because winutils was added when libhadoop was 
still being built by autoconf.  But now that cmake is there and works properly 
on Windows (at least in 3.x), it'd be nice to place the core of winutils into 
libhadoop and just keep winutils as a wrapper to use for debugging.  This might 
also move us away from using MSBuild, which would greatly simplify the build 
process.


was (Author: aw):
bq.  I am planning to code everything in Commons to be used from YARN and HDFS.

The umbrella JIRA should really start out in HADOOP so that people aren't taken 
by surprise.  I suspect any YARN and HDFS specific code to be relatively tiny 
since winutils is used all over the place, including in the client code.  

That fact probably makes ...

bq. a long running native process communicating with YARN over pipe

almost certainly a non-starter, never mind the security concerns, with greatly 
increasing the complexity for likely very little gain.

The other thing to keep in mind is that winutils pre-dates Java 7.  Things like 
symlinks can now be done with Java APIs.  No C required.  I'd highly recommend 
starting with replacing the winutils calls with Java API calls first and then 
digging into something more complex later.  [The Unix versions of those same 
calls will likely get a speed bump too.]

> Create a JNI interface to interact with Windows
> ---
>
> Key: YARN-8275
> URL: https://issues.apache.org/jira/browse/YARN-8275
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: WinUtils-Functions.pdf, WinUtils.CSV
>
>
> I did a quick investigation of the performance of WinUtils in YARN. In 
> average NM calls 4.76 times per second and 65.51 per container.
>  
> | |Requests|Requests/sec|Requests/min|Requests/container|
> |*Sum [WinUtils]*|*135354*|*4.761*|*286.160*|*65.51*|
> |[WinUtils] Execute -help|4148|0.145|8.769|2.007|
> |[WinUtils] Execute -ls|2842|0.0999|6.008|1.37|
> |[WinUtils] Execute -systeminfo|9153|0.321|19.35|4.43|
> |[WinUtils] Execute -symlink|115096|4.048|243.33|57.37|
> |[WinUtils] Execute -task isAlive|4115|0.144|8.699|2.05|
>  Interval: 7 hours, 53 minutes and 48 seconds
> Each execution of WinUtils does around *140 IO ops*, of which 130 are DDL ops.
> This means *666.58* IO ops/second due to WinUtils.
> We should start considering to remove WinUtils from Hadoop and creating a JNI 
> interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8275) Create a JNI interface to interact with Windows

2018-05-13 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473789#comment-16473789
 ] 

Allen Wittenauer commented on YARN-8275:


bq.  I am planning to code everything in Commons to be used from YARN and HDFS.

The umbrella JIRA should really start out in HADOOP so that people aren't taken 
by surprise.  I suspect any YARN and HDFS specific code to be relatively tiny 
since winutils is used all over the place, including in the client code.  

That fact probably makes ...

bq. a long running native process communicating with YARN over pipe

almost certainly a non-starter, never mind the security concerns, with greatly 
increasing the complexity for likely very little gain.

The other thing to keep in mind is that winutils pre-dates Java 7.  Things like 
symlinks can now be done with Java APIs.  No C required.  I'd highly recommend 
starting with replacing the winutils calls with Java API calls first and then 
digging into something more complex later.  [The Unix versions of those same 
calls will likely get a speed bump too.]

> Create a JNI interface to interact with Windows
> ---
>
> Key: YARN-8275
> URL: https://issues.apache.org/jira/browse/YARN-8275
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: WinUtils-Functions.pdf, WinUtils.CSV
>
>
> I did a quick investigation of the performance of WinUtils in YARN. In 
> average NM calls 4.76 times per second and 65.51 per container.
>  
> | |Requests|Requests/sec|Requests/min|Requests/container|
> |*Sum [WinUtils]*|*135354*|*4.761*|*286.160*|*65.51*|
> |[WinUtils] Execute -help|4148|0.145|8.769|2.007|
> |[WinUtils] Execute -ls|2842|0.0999|6.008|1.37|
> |[WinUtils] Execute -systeminfo|9153|0.321|19.35|4.43|
> |[WinUtils] Execute -symlink|115096|4.048|243.33|57.37|
> |[WinUtils] Execute -task isAlive|4115|0.144|8.699|2.05|
>  Interval: 7 hours, 53 minutes and 48 seconds
> Each execution of WinUtils does around *140 IO ops*, of which 130 are DDL ops.
> This means *666.58* IO ops/second due to WinUtils.
> We should start considering to remove WinUtils from Hadoop and creating a JNI 
> interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-05-13 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473586#comment-16473586
 ] 

genericqa commented on YARN-8234:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
32s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2.8.3 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
49s{color} | {color:green} branch-2.8.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
5s{color} | {color:green} branch-2.8.3 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} branch-2.8.3 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
32s{color} | {color:green} branch-2.8.3 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
10s{color} | {color:green} branch-2.8.3 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} branch-2.8.3 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
5s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 29s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 2 new + 218 unchanged - 3 fixed = 220 total (was 221) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
0s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
25s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
47s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 88m  6s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}119m 57s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.scheduler.fair.TestSchedulingPolicy |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:c2d96dd |
| JIRA Issue | YARN-8234 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12923194/YARN-8234-branch-2.8.3.003.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 88cd1051fa00 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/pat

[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-13 Thread Larry McCay (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473553#comment-16473553
 ] 

Larry McCay commented on YARN-8108:
---

I would say that since this one line patch requires so much discussion that it 
supports my feeling that we need to revisit this whole filter chain mechanism. 
It always feels like there is some magic combination that we *think* will solve 
some issue that shouldn't have existed in the first place.

I suggest that we get a picture or set of pictures that represent what the 
filter chain/s are and how global vs servlet specific layer on to each other.

Especially since we are talking about ignoring configuration that was added 
with explicit intent in mind - we need to be able to articulate how things work 
in a very clear way. This information will be valuable if/when we decide to 
replace this mechanism.

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-13 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426
 ] 

Szilard Nemeth edited comment on YARN-8248 at 5/13/18 10:22 AM:


Hi [~haibochen]

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests, I would leave the second null check that is just 
before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAMResourceRequests != null) 
{code}
since rmApp should be non-null at this point.
 What do you prefer?

 

2. It is true that 
{{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would 
always return false when the {{queueMaxShare}} is 0 for any resource, but the 
problem with just using {{Resources.fitsIn}} is that it would return false for 
such cases when the requested resource is smaller than the max resource but 
that max resource is not zero, e.g. requested vCores = 2, max vCores = 1.
 With this check, I only wanted to catch those cases where there is a resource 
request of any resource type but the queue has 0 of that resource in 
{{queueMaxShare}}.
 In this sense, in the if condition this check would be enough:
{code:java}
Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare)
{code}
but it is not perfectly fine since only using this check does not check whether 
a resource is really requested. For example, an application does not request 
any vCores (maybe this cannot happen in reality) and we have 0 of vCores as 
maximum then it is a perfectly reasonable request so we don't need to reject 
the application. On the other hand if an app requests 1 vCores and we have 0 
vCores as maximum then rejection should happen.
 Is this explanation makes it cleaner?
 Do you think some comments need to be added to the code above the if condition?
 How would you update the diagnostic message?

 

3. My overall intention of my changes in {{Fairscheduler}} was the following: 
 Essentially, in {{addApplication()}}, the AM resource requests are checked 
against the queue's max resources.
 In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) 
resource request is happened against a queue that has 0 of any resource 
configured as max resource.
 So in my understanding, it can happen that in {{addApplication()}} the app was 
not rejected, for example AM does not request vCores and we have 0 vCores 
configure as max resources, but for a map container, 1 vCores is requested. 
 Please tell me whether this is clear.

 

4. 
 {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an 
application happens when AM resource request is exceeding the queue's maximum 
resources. (tests code added to {{FairScheduler.addApplication}})

{{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection 
of an application happens when map / reduce container request is exceeding the 
queue's maximum resources (tests code added to {{FairScheduler.allocate}})
 Please check my comment for 3. as I explained such a case when an application 
will not be rejected immadiately upon submission but only when map/reduce 
container request happens.

About the uncovered unit test: Good point and I was thinking about that if we 
can reject an application only if the AM request is greater than 0 and we have 
0 configured as max resource or simply in any case where the requested resource 
is greater than max resource, regardless if it is 0 or not.

If the latter is true, then I agree, unit tests and the if-conditions in the 
production code needs to be changed accordingly (using just 
{{Resources.fitsIn}} will work I guess).

I'm fine with either way as well and as you have competence with FairScheduler 
please advise which way I should go.

5.
 - Removed the unused import.
 - Renamed those methods what you suggested
 - Thanks for the log change suggestions, you were right about those, it's way 
more understandable that way.

 

Please note that I haven't uploaded a new patch as it does not makes sense 
until we discussed all of the bullet points and now I only did some minor fixes.

Thanks!


was (Author: snemeth):
Hi [~haibochen]

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests, I would leave the second null check that is just

[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-13 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426
 ] 

Szilard Nemeth edited comment on YARN-8248 at 5/13/18 10:18 AM:


Hi [~haibochen]

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests, I would leave the second null check that is just 
before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAMResourceRequests != null) 
{code}
since rmApp should be non-null at this point.
 What do you prefer?

 

2. It is true that 
{{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would 
always return false when the {{queueMaxShare}} is 0 for any resource, but the 
problem with just using {{Resources.fitsIn}} is that it would return false for 
such cases when the requested resource is smaller than the max resource but 
that max resource is not zero, e.g. requested vCores = 2, max vCores = 1.
 With this check, I only wanted to catch those cases where there is a resource 
request of any resource type but the queue has 0 of that resource in 
{{queueMaxShare}}.
 In this sense, in the if condition this check would be enough:
{code:java}
Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare)
{code}
but it is not perfectly fine since only using this check does not check whether 
a resource is really requested. For example, an application does not request 
any vCores (maybe this cannot happen in reality) and we have 0 of vCores as 
maximum then it is a perfectly reasonable request so we don't need to reject 
the application. On the other hand if an app requests 1 vCores and we have 0 
vCores as maximum then rejection should happen.
 Is this explanation makes it cleaner?
 Do you think some comments need to be added to the code above the if condition?
 How would you update the diagnostic message?

 

3. My overall intention of my changes in {{Fairscheduler}} was the following: 
 Essentially, in {{addApplication()}}, the AM resource requests are checked 
against the queue's max resources.
 In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) 
resource request is happened against a queue that has 0 of any resource 
configured as max resource.
 So in my understanding, it can happen that in {{addApplication()}} the app was 
not rejected, for example AM does not request vCores and we have 0 vCores 
configure as max resources, but for a map container, 1 vCores is requested. 
 Please tell me whether this is clear.

 

4. 
 {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an 
application happens when AM resource request is exceeding the queue's maximum 
resources. (tests code added to {{FairScheduler.addApplication}})

{{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection 
of an application happens when map / reduce container request is exceeding the 
queue's maximum resources (tests code added to {{FairScheduler.allocate}})
 Please check my comment for 3. as I explained such a case when an application 
will not be rejected immadiately upon submission but only when map/reduce 
container request happens.

About the uncovered unit test: Good point and I was thinking about that if we 
can reject an application only if the AM request is greater than 0 and we have 
0 configured as max resource or simply in any case where the requested resource 
is greater than max resource, regardless if it is 0 or not.

If the latter is true, then I agree, unit tests and the if-conditions in the 
production code needs to be changed accordingly (using just 
{{Resources.fitsIn}} will work I guess).

I'm fine with either way as well and as you have competence with FairScheduler 
please advise which way I should go.

5.
 - Removed the unused import.
 - Renamed those methods what you suggested
 - Thanks for the log change suggestions, you were right about those, it's way 
more understandable that way.

 

Thanks!


was (Author: snemeth):
Hi @haibo!

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests, I would leave the second null check that is just 
before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAM

[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-13 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426
 ] 

Szilard Nemeth commented on YARN-8248:
--

Hi @haibo!

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests, I would leave the second null check that is just 
before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAMResourceRequests != null) 
{code}
since rmApp should be non-null at this point.
 What do you prefer?

 

2. It is true that 
{{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would 
always return false when the {{queueMaxShare}} is 0 for any resource, but the 
problem with just using {{Resources.fitsIn}} is that it would return false for 
such cases when the requested resource is smaller than the max resource but 
that max resource is not zero, e.g. requested vCores = 2, max vCores = 1.
 With this check, I only wanted to catch those cases where there is a resource 
request of any resource type but the queue has 0 of that resource in 
{{queueMaxShare}}.
 In this sense, in the if condition this check would be enough:
{code:java}
Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare)
{code}
but it is not perfectly fine since only using this check does not check whether 
a resource is really requested. For example, an application does not request 
any vCores (maybe this cannot happen in reality) and we have 0 of vCores as 
maximum then it is a perfectly reasonable request so we don't need to reject 
the application. On the other hand if an app requests 1 vCores and we have 0 
vCores as maximum then rejection should happen.
 Is this explanation makes it cleaner?
 Do you think some comments need to be added to the code above the if condition?
 How would you update the diagnostic message?

 

3. My overall intention of my changes in {{Fairscheduler}} was the following: 
 Essentially, in {{addApplication()}}, the AM resource requests are checked 
against the queue's max resources.
 In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) 
resource request is happened against a queue that has 0 of any resource 
configured as max resource.
 So in my understanding, it can happen that in {{addApplication()}} the app was 
not rejected, for example AM does not request vCores and we have 0 vCores 
configure as max resources, but for a map container, 1 vCores is requested. 
 Please tell me whether this is clear.

 

4. 
 {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an 
application happens when AM resource request is exceeding the queue's maximum 
resources. (tests code added to {{FairScheduler.addApplication}})

{{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection 
of an application happens when map / reduce container request is exceeding the 
queue's maximum resources (tests code added to {{FairScheduler.allocate}})
 Please check my comment for 3. as I explained such a case when an application 
will not be rejected immadiately upon submission but only when map/reduce 
container request happens.

About the uncovered unit test: Good point and I was thinking about that if we 
can reject an application only if the AM request is greater than 0 and we have 
0 configured as max resource or simply in any case where the requested resource 
is greater than max resource, regardless if it is 0 or not.

If the latter is true, then I agree, unit tests and the if-conditions in the 
production code needs to be changed accordingly (using just 
{{Resources.fitsIn}} will work I guess).

I'm fine with either way as well and as you have competence with FairScheduler 
please advise which way I should go.

5.
 - Removed the unused import.
 - Renamed those methods what you suggested
 - Thanks for the log change suggestions, you were right about those, it's way 
more understandable that way.

 

Thanks!

> Job hangs when a queue is specified and the maxResources of the queue cannot 
> satisfy the AM resource request
> 
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-

[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-05-13 Thread Hu Ziqian (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Ziqian updated YARN-8234:

Attachment: YARN-8234-branch-2.8.3.003.patch

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-8275) Create a JNI interface to interact with Windows

[jira] [Commented] (YARN-8275) Create a JNI interface to interact with Windows

[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

8 matches

Site Navigation

Mail list logo

Footer information