[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU

2021-03-11 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299405#comment-17299405
 ] 

Adam Antal commented on YARN-8823:
--

I think you can go ahead and work on this [~zhuqi].

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10621) GPU management using OpenCL instead of vendor-specific solutions

2021-02-12 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283717#comment-17283717
 ] 

Adam Antal commented on YARN-10621:
---

Thanks for bringing this issue to the community!

>From the JIRA details I assume that you're using version 3.1.1, which has not 
>yet included the {{{DevicePlugin}} interface. This interface has been added to 
>the code for the very same purpose: discovering custom resources provided by 
>these plugin - just like the Nvidia GPUs. For more information look at the 
>umbrella jira: YARN-8851.

I don't know if the changes you've been working on are based on this work, but 
it's the recommended way from 3.3.0 on. How is that aligns with you effort?


> GPU management using OpenCL instead of vendor-specific solutions
> 
>
> Key: YARN-10621
> URL: https://issues.apache.org/jira/browse/YARN-10621
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Reporter: Sotiris Niarchos
>Priority: Minor
>
> As part of the [E2Data research project|https://e2data.eu/], we at the 
> [Institute of Communication and Computer Systems 
> (ICCS)|https://www.iccs.gr/en/?noredirect=en_US] of the National Technical 
> University of Athens, Greece, have been working on a modified version of 
> Hadoop Yarn where the GPU devices that are available in the underlying 
> cluster are discovered via a Java wrapper of the OpenCL framework API (namely 
> [JOCL|https://github.com/gpu/JOCL]), instead of vendor-specific binaries.
> In other words, we have shifted towards *a more uniform and high-level 
> handling of GPUs as "OpenCL-enabled" devices*. This way, we manage to 
> *decouple GPU discovery/management from vendor-specific technicalities*; 
> every GPU, no matter the vendor, is the same for E2Data YARN (more 
> specifically, for the {{NodeManager}} component), provided that the OpenCL 
> runtime and drivers for the GPU(s) of interest are installed on the 
> respective node(s) of the cluster.
> This way, we *managed to use GPUs other than NVIDIA* (which are the only ones 
> officially supported via the {{nvidia-smi}} binary) with minimal additional 
> effort, after our initial changes.
> Ultimately, our goal is to *unify every processing unit* that YARN can 
> possible utilize (CPU cores, GPUs, FPGAs) *behind a common, simple, 
> high-level interface; that of the OpenCL-enabled device*.
> The only drawback of our approach is that vendor-specific info regarding the 
> GPUs is lost (e.g. temperature). We believe, however, that the lost 
> information is not necessary for YARN; everything that Hadoop needs in order 
> to discover and handle GPU devices is provided by OpenCL.
> This is just a proposition/a prompt for discussion for the time being. This 
> modified version is a work in progress. We consider community feedback 
> regarding the core concept (and the fact that it may constitute a paradigm 
> shift for YARN) crucial before attaching any patch file and diving into more 
> (technical) details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-12-12 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248315#comment-17248315
 ] 

Adam Antal commented on YARN-10031:
---

Committed to trunk. Thanks for the contribution [~gandras]!

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch, YARN-10031.003.patch, YARN-10031.004.patch, 
> YARN-10031.005.patch, YARN-10031.005.patch, YARN-10031.006.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-12-11 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247906#comment-17247906
 ] 

Adam Antal commented on YARN-10031:
---

Thanks [~gandras]!

LGTM. If there's no other reviews, I will commit this tomorrow.

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch, YARN-10031.003.patch, YARN-10031.004.patch, 
> YARN-10031.005.patch, YARN-10031.005.patch, YARN-10031.006.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10520) Deprecated the residual nested class for the LCEResourceHandler

2020-12-09 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal resolved YARN-10520.
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Deprecated the residual nested class for the LCEResourceHandler
> ---
>
> Key: YARN-10520
> URL: https://issues.apache.org/jira/browse/YARN-10520
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: nodemanager
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The old LCEResourceHandler interface hierarchy was deprecated, but some 
> nested class was left. Such as 
> CustomCgroupsLCEResourceHandler/MockLinuxContainerExecutor/TestResourceHandler
>  etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10520) Deprecated the residual nested class for the LCEResourceHandler

2020-12-09 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246474#comment-17246474
 ] 

Adam Antal commented on YARN-10520:
---

Thanks for the patch [~jiwq]. Committed to trunk.

> Deprecated the residual nested class for the LCEResourceHandler
> ---
>
> Key: YARN-10520
> URL: https://issues.apache.org/jira/browse/YARN-10520
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: nodemanager
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The old LCEResourceHandler interface hierarchy was deprecated, but some 
> nested class was left. Such as 
> CustomCgroupsLCEResourceHandler/MockLinuxContainerExecutor/TestResourceHandler
>  etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-12-03 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243045#comment-17243045
 ] 

Adam Antal commented on YARN-10031:
---

Reuploaded patch v5

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch, YARN-10031.003.patch, YARN-10031.004.patch, 
> YARN-10031.005.patch, YARN-10031.005.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10031) Create a general purpose log request with additional query parameters

2020-12-03 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10031:
--
Attachment: YARN-10031.005.patch

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch, YARN-10031.003.patch, YARN-10031.004.patch, 
> YARN-10031.005.patch, YARN-10031.005.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9883) Reshape SchedulerHealth class

2020-12-03 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243030#comment-17243030
 ] 

Adam Antal commented on YARN-9883:
--

Committed to trunk, thanks for the contribution [~dmmkr]

> Reshape SchedulerHealth class
> -
>
> Key: YARN-9883
> URL: https://issues.apache.org/jira/browse/YARN-9883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-9883.001.patch, YARN-9883.002.patch
>
>
> The {{SchedulerHealth}} class has some flaws, for example:
> - It has no javadoc at all
> - All its objects are package-private: they should be private
> - The internal maps should be (Concurrent) EnumMaps instead of HashMaps: they 
> are more efficient in storing Enums
> - schedulerHealthDetails only stores the last operation, its name should 
> reflect that (just like lastSchedulerRunDetails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9883) Reshape SchedulerHealth class

2020-12-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241713#comment-17241713
 ] 

Adam Antal commented on YARN-9883:
--

If you don't have any objections, I will commit this tomorrow [~BilwaST]

> Reshape SchedulerHealth class
> -
>
> Key: YARN-9883
> URL: https://issues.apache.org/jira/browse/YARN-9883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-9883.001.patch, YARN-9883.002.patch
>
>
> The {{SchedulerHealth}} class has some flaws, for example:
> - It has no javadoc at all
> - All its objects are package-private: they should be private
> - The internal maps should be (Concurrent) EnumMaps instead of HashMaps: they 
> are more efficient in storing Enums
> - schedulerHealthDetails only stores the last operation, its name should 
> reflect that (just like lastSchedulerRunDetails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9883) Reshape SchedulerHealth class

2020-12-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241687#comment-17241687
 ] 

Adam Antal commented on YARN-9883:
--

LGTM, {{TestDelegationTokenRenewer}} failure is unrelated.

> Reshape SchedulerHealth class
> -
>
> Key: YARN-9883
> URL: https://issues.apache.org/jira/browse/YARN-9883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-9883.001.patch, YARN-9883.002.patch
>
>
> The {{SchedulerHealth}} class has some flaws, for example:
> - It has no javadoc at all
> - All its objects are package-private: they should be private
> - The internal maps should be (Concurrent) EnumMaps instead of HashMaps: they 
> are more efficient in storing Enums
> - schedulerHealthDetails only stores the last operation, its name should 
> reflect that (just like lastSchedulerRunDetails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9883) Reshape SchedulerHealth class

2020-11-27 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239570#comment-17239570
 ] 

Adam Antal commented on YARN-9883:
--

Hi [~dmmkr],
Thanks for the patch, LGTM. 
Jenkins complains about the javadoc warnings ("First sentence should end with a 
period."). 
If you can handle it, I can commit this patch.

> Reshape SchedulerHealth class
> -
>
> Key: YARN-9883
> URL: https://issues.apache.org/jira/browse/YARN-9883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-9883.001.patch
>
>
> The {{SchedulerHealth}} class has some flaws, for example:
> - It has no javadoc at all
> - All its objects are package-private: they should be private
> - The internal maps should be (Concurrent) EnumMaps instead of HashMaps: they 
> are more efficient in storing Enums
> - schedulerHealthDetails only stores the last operation, its name should 
> reflect that (just like lastSchedulerRunDetails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-11-26 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239144#comment-17239144
 ] 

Adam Antal commented on YARN-10031:
---

Hi [~gandras],

Could you please rebase and reupload the patch to have a good jenkins result?
Thanks!

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch, YARN-10031.003.patch, YARN-10031.004.patch, 
> YARN-10031.005.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-11-18 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235013#comment-17235013
 ] 

Adam Antal commented on YARN-10031:
---

Thanks for rebasing the patch. 

I plan to have another review round. In the meantime could you please handle 
the checkstyles and the javadoc issues?

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch, YARN-10031.003.patch, YARN-10031.004.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10306) Create simple copy log aggregation file controller

2020-10-20 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-10306:
-

Assignee: Andras Gyori  (was: Adam Antal)

> Create simple copy log aggregation file controller
> --
>
> Key: YARN-10306
> URL: https://issues.apache.org/jira/browse/YARN-10306
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
>
> Log aggregation file controllers were created (YARN-6875) to effectively wrap 
> and move the logs of containers to a remote filesystem. While this filesystem 
> was HDFS, it was logical to create as big files as we can by packing it into 
> an aggregated (blob) file. As S3A is a valid target since YARN-9525, it is 
> much less painful from the end user point of view to browse these files as 
> they are - and not in an aggregated blob format.
> I propose to implement a "dumb"/bare copy file controller that copies the 
> container log files to the remote file system without any 
> aggregation/wrapping. The only thing which makes sense to enable is the 
> compression of those files, so we should support that.
> [Docs|https://docs.google.com/document/d/1GstbQ5oOQ7EGm386dJ9RO7z_v798LDjgFIltSnzxcEM/edit#]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-12 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212423#comment-17212423
 ] 

Adam Antal commented on YARN-10448:
---

Thanks for the patch [~zhuqi], looks good to me.

Could you please double check that the unit tests failures are related? Also 
there's one checkstyle warning remained. I can commit this if you take care of 
that.

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, 
> image-2020-10-11-22-02-17-166.png
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10420) Update CS MappingRule documentation with the new format and features

2020-10-12 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212411#comment-17212411
 ] 

Adam Antal commented on YARN-10420:
---

Thanks for the patch [~pbacsko]. I'll attach my reply inline.

1. Ok, let's not touch it then.
2. Can we check what happens, and document it as well? I think users would be 
also interested in that.
3. Can we also add this to the document?
4,5,6. Ok, got it, thanks.

bq. "If the target queue doesn't exist or and it cannot be created..." - you 
propose "and" but that would mean that we always try to create a non-existing 
queue, which is not the case in CS. Under regular parents, queues cannot be 
created dynamically and CS doesn't even try. Therefore "or" is more appropriate 
here.
Thanks for the clarification. I suggest to add this to the doc, because I 
didn't know that "cannot be created" is what you've illustrated as an example. 
Something like "If the target queue doesn't exist or cannot be created (e.g. 
under regular parents) ..."

For all the other points, I'm fine.

> Update CS MappingRule documentation with the new format and features
> 
>
> Key: YARN-10420
> URL: https://issues.apache.org/jira/browse/YARN-10420
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10420-001.patch, YARN-10420-002.patch, 
> YARN-10420-003.patch, YARN-10420-004.patch, YARN-10420-005.patch
>
>
> Update the upstream documentation with the new changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209595#comment-17209595
 ] 

Adam Antal commented on YARN-10393:
---

Committed to branch-2.10. Thanks [~Jim_Brennan].

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10393:
--
Fix Version/s: 2.10.2

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5, 2.10.2
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10393:
--
Attachment: YARN-10393-branch-2.10.001.patch

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209483#comment-17209483
 ] 

Adam Antal commented on YARN-10393:
---

Reuploaded patch for branch-2.10, pending on jenkins.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at 

[jira] [Reopened] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reopened YARN-10393:
---

Reopening the issue to trigger jenkins.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, 
> YARN-10393.002.patch, YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Commented] (YARN-10420) Update CS MappingRule documentation with the new format and features

2020-10-06 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208690#comment-17208690
 ] 

Adam Antal commented on YARN-10420:
---

Thanks for the patch [~pbacsko]. Awesome patch, thanks for the effort on this!
If you don't mind I take the initiative to read through the doc, as I wasn't 
much involved in the process and I checked whether I could comprehend it.

Some questions:
 - It's a bit strange that the two configuration options in the Queue lifetime 
for applications chapter are different:
 one is {{yarn.scheduler.capacity..maximum-application-lifetime}}
 and the other is 
{{yarn.scheduler.capacity.root..default-application-lifetime}}.
 Is there any specific reason why one is started with root and the other is 
not? I think it may be a mistake as in the doc it says "This feature can be set 
at any level in the queue hierarchy.".
 - Will the rule fails if the create flag is set for a non-managed queue? What 
is the expected behaviour? (Maybe we can add some details to the create flag 
part)
 - Can you use multiple {{setDefaultQueue}} rule? (Maybe we can add some 
details to the doc.)
- "In this table, you can see how to rewrite the old, colon-separated rules to 
the new format." <-Are these mapping rules FS or CS specific? Can we add that?

Nits:
 - You can use the {{```json}} syntax for the block codes to have json syntax 
highlight (it is used for xml in this page).
 - The Example does not show a rule where the "create" flag is set (either to 
true or false). I think we should add one for the sake of completeness.

Other nits with inlined suggestions:
 - "The {{CapacityScheduler}} supports the following parameters to _configure_ 
lifetime of an application"
 - "When the evaluation stops and what happens if a rule doesn't match can be 
adjusted more flexibly compared to the legacy mapping rule evaluator." -> "It 
can be adjusted more flexibly compared to the legacy mapping rule evaluator 
what happens when the evaluation stops and a given rule doesn't match."
 - "In case of user, primaryGroup, primaryGroupUser, secondaryGroup, 
secondaryGroupUser -,- _policies_ this tells the engine where the matching 
queue should be looked for."
 - "If the target queue doesn't exist -or- _and_ it cannot be created, it 
defines a fallback action. Valid values are skip, reject and placeDefault."
 - "placeDefault: place the application to the default queue -"- _`_ 
root.default _`_ -"- (unless it's overridden to something else)." <- Use ` 
instead of "
 - "Places the application into the default queue root.default -but this can be 
changed.- _or its overwritten value._"
- "It can be a managed parent in order to have userName leaf created 
automatically, -but- _and_ ..." <- "can" should be "shouldn't"?
- "The custom placement policy can describe other policies with the appropriate 
variable placeholders _(see below)_ ."

Note:
- The asterisk is not displayed correctly on markdown, but is generated 
correctly by {{mvn site}}. I wonder if we can do something about it. I know 
that is common to link the markdown from the GH repository instead of using the 
generated documentation from the website, so it would be nice if the markdown 
could display this correctly.

If you anything is not clear, please let me know and I try to clarify what I 
didn't understand.

> Update CS MappingRule documentation with the new format and features
> 
>
> Key: YARN-10420
> URL: https://issues.apache.org/jira/browse/YARN-10420
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10420-001.patch, YARN-10420-002.patch, 
> YARN-10420-003.patch
>
>
> Update the upstream documentation with the new changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-10-06 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208611#comment-17208611
 ] 

Adam Antal commented on YARN-10031:
---

Thanks for the patch [~gandras]. Sorry for the late with the review - here's my 
first round, I will probably need another round (sorry).
 - {{ExtendedLogMetaRequest}} should have a {{Builder}}, no setters, and final 
fields
 - the nomenclature "compare" is unfortunately occupied in Java ({{Comparable}} 
interface), and I strongly suggest to choose another name for these. For 
example {{ExtendedLogMetaRequest.MatchExpression#compare()}} can be named 
{{match()}}
 - {{LogAggregationMetaCollector}} 's conf and request object can be final
 - {{LogAggregationMetaCollector.collect#L79}}: what if node id is not 
provided? ({{logsRequest.getNodeId()}} is null) I think we can get an NPE 
there. Also if we allow null values, it worth after catching the exception in 
L108 to just continue in the loop.
 - I think this is a bit ineffective:
{code:java}
metaFiles.keySet().stream().filter(containerId ->
!(logsRequest.getContainerId() == null ||
logsRequest.getContainerId().equals(containerId)))
.collect(Collectors.toList()).forEach(metaFiles::remove);
{code}
Here you filter a keyset, collect to a list and call a remove for these items 
on the original map - I think there's too much intermediate data structure is 
created. Can we do a simple loop instead?

 - In {{LogAggregationUtils.getNodeString#L301}} you can just return the 
{{combineIterators}} directly.
 - Please fix the ASF license warnings
 - {{LogAggregationIndexedFileController.parseChecksum()}} seems a bit odd to 
me:
 ** it should be private, since it is only called inside the class
 ** I think it will always return null, because the first condition is always 
true (if it is false, then {{getLogMetaFilesOfNode()}} returns with null 
previously).
 ** I suggest to revise the logic in the function.

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch, 
> YARN-10031.002.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207941#comment-17207941
 ] 

Adam Antal commented on YARN-10393:
---

Cherry-picked to branch-3.3, branch-3.2, branch-3.1, branch-3.0. I had 
conflicts with the branch-2.10 patch, if you'd like to commit to that branch, 
please upload a patch for it.

Thanks for the conversation and the solution [~Jim_Brennan] [~wzzdreamer] 
[~yuanbo].

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10393:
--
Fix Version/s: 3.1.5
   3.3.1
   3.2.2
   3.0.4

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10393:
--
Fix Version/s: 3.4.0

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207912#comment-17207912
 ] 

Adam Antal commented on YARN-10393:
---

Committed to trunk, I will cherry-pick this to other branches now.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>

[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205449#comment-17205449
 ] 

Adam Antal commented on YARN-10448:
---

Thanks for the patch [~zhuqi].

Would you please include a sample test for the fix? Something like submitting 
with null user, and assert that the user has been changed to {{default}} 
successfully.

Also could you please move the "default" String to a private static final 
constant?

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-10-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205397#comment-17205397
 ] 

Adam Antal commented on YARN-10447:
---

Committed to trunk, thanks for the contribution [~pbacsko]. 

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch, 
> YARN-10447-003.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205385#comment-17205385
 ] 

Adam Antal commented on YARN-10393:
---

Agreed, +1 to v2 patch.

Any comments [~yuanbo], [~wzzdreamer]?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-30 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204736#comment-17204736
 ] 

Adam Antal commented on YARN-10393:
---

Hi [~Jim_Brennan],

Thanks for the patch, overall looks good. As I wrote in my earlier review I'd 
suggest to some extra log when we skip a heartbeat to make it explicit, and 
there's one last checkstyle to be taken care of. Otherwise we're fine.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.draft.2.patch, 
> YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 

[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-09-30 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204724#comment-17204724
 ] 

Adam Antal commented on YARN-8737:
--

Thanks for the patch [~Tao Yang].

The patch looks straightforward, however I have some reservation against this 
fix as this may be not enough in some other corner cases. AFAIU the 
investigation in YARN-10058 by [~tuyu], we can still bump into this issue after 
locking, if we update the queue's statistics without holding the lock of its 
parent queue. I don't have much insight on CS though, so I am a bit reluctant 
to give a confident +1 to this.

Also, as [~wangda] explained
{quote}I'm not sure if this ticket can solve the problem or not.
{quote}
so I would like to double check with someone who has more context on this part 
of the code. I would be more than happy to commit this fix if we can verify 
this.

[~sunilg]/[~prabhujoseph]/[~snemeth] do you fancy a review?

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-30 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204701#comment-17204701
 ] 

Adam Antal commented on YARN-10447:
---

Thanks [~pbacsko]. The patch looks good to me, but unit tests have failed. I 
know about the \{{TestDelegationTokenRenewer}}, and its flakiness, but could 
you please upload v3 patch again to get rid of the failed CS test?

After a +1 from jenkins I can commit that if there's no objections.

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch, 
> YARN-10447-003.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal edited comment on YARN-10447 at 9/29/20, 2:22 PM:
-

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{this.activitiesManagerEnabled = enabled;}} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?


was (Author: adam.antal):
Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{this.activitiesManagerEnabled = enabled; }} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal edited comment on YARN-10447 at 9/29/20, 2:21 PM:
-

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled; }} there, is 
that right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?


was (Author: adam.antal):
Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled;}} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal edited comment on YARN-10447 at 9/29/20, 2:21 PM:
-

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{this.activitiesManagerEnabled = enabled; }} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?


was (Author: adam.antal):
Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled; }} there, is 
that right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal commented on YARN-10447:
---

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled;}} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10114) Tail -f style CLI tool for extracting logs of running containers

2020-09-28 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10114:
--
Summary: Tail -f style CLI tool for extracting logs of running containers  
(was: Tail -f styled CLI tool for extracting logs of running containers)

> Tail -f style CLI tool for extracting logs of running containers
> 
>
> Key: YARN-10114
> URL: https://issues.apache.org/jira/browse/YARN-10114
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
>
> Let's check the possibility whether we can come up with a solution for this.
> The --follow option can be the part of the LogsCLI tool, and would maintain a 
> connection with the NM to receive the local logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10443) Document options of logs CLI

2020-09-21 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-10443:
-

Assignee: Ankit Kumar

> Document options of logs CLI
> 
>
> Key: YARN-10443
> URL: https://issues.apache.org/jira/browse/YARN-10443
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Ankit Kumar
>Priority: Major
>
> It's bugging me a lot that the YARN logs CLI is poorly documented. I always 
> have to type {{yarn logs -help}} to see the full list of supported commands. 
> It would be nice to have it nicely documented in our website.
> Current 
> [documentation|https://hadoop.apache.org/docs/r3.3.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#logs]
>  in the website shows only 5 supported options.
> The output of the help command however shows more:
> {noformat}
> Retrieve logs for YARN applications.
> usage: yarn logs -applicationId  [OPTIONS]
> general options are:
>  -am   Prints the AM Container logs
>   for this application.
>   Specify comma-separated
>   value to get logs for
>   related AM Container. For
>   example, If we specify -am
>   1,2, we will get the logs
>   for the first AM Container
>   as well as the second AM
>   Container. To get logs for
>   all AM Containers, use -am
>   ALL. To get logs for the
>   latest AM Container, use -am
>   -1. By default, it will
>   print all available logs.
>   Work with -log_files to get
>   only specific logs.
>  -appOwner AppOwner (assumed to be
>   current user if not
>   specified)
>  -client_max_retries Set max retry number for a
>   retry client to get the
>   container logs for the
>   running applications. Use a
>   negative value to make retry
>   forever. The default value
>   is 30.
>  -client_retry_interval_msWork with
>   --client_max_retries to
>   create a retry client. The
>   default value is 1000.
>  -clusterId   ClusterId. By default, it
>   will take default cluster id
>   from the RM
>  -containerId   ContainerId. By default, it
>   will print all available
>   logs. Work with -log_files
>   to get only specific logs.
>   If specified, the
>   applicationId can be omitted
>  -helpDisplays help for all
>   commands.
>  -list_nodes  Show the list of nodes that
>   successfully aggregated
>   logs. This option can only
>   be used with finished
>   applications.
>  -log_filesSpecify comma-separated
>   value to get exact matched
>   log files. Use "ALL" or "*"
>   to fetch all the log files
>   for the container.
>  -log_files_pattern Specify comma-separated
>   value to get matched log
>   files by using java 

[jira] [Commented] (YARN-10443) Document options of logs CLI

2020-09-21 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199289#comment-17199289
 ] 

Adam Antal commented on YARN-10443:
---

Hey [~akumar],

No I'm not, I assigned it to you. Will be happy to review your patch if you're 
ready.

> Document options of logs CLI
> 
>
> Key: YARN-10443
> URL: https://issues.apache.org/jira/browse/YARN-10443
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Ankit Kumar
>Priority: Major
>
> It's bugging me a lot that the YARN logs CLI is poorly documented. I always 
> have to type {{yarn logs -help}} to see the full list of supported commands. 
> It would be nice to have it nicely documented in our website.
> Current 
> [documentation|https://hadoop.apache.org/docs/r3.3.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#logs]
>  in the website shows only 5 supported options.
> The output of the help command however shows more:
> {noformat}
> Retrieve logs for YARN applications.
> usage: yarn logs -applicationId  [OPTIONS]
> general options are:
>  -am   Prints the AM Container logs
>   for this application.
>   Specify comma-separated
>   value to get logs for
>   related AM Container. For
>   example, If we specify -am
>   1,2, we will get the logs
>   for the first AM Container
>   as well as the second AM
>   Container. To get logs for
>   all AM Containers, use -am
>   ALL. To get logs for the
>   latest AM Container, use -am
>   -1. By default, it will
>   print all available logs.
>   Work with -log_files to get
>   only specific logs.
>  -appOwner AppOwner (assumed to be
>   current user if not
>   specified)
>  -client_max_retries Set max retry number for a
>   retry client to get the
>   container logs for the
>   running applications. Use a
>   negative value to make retry
>   forever. The default value
>   is 30.
>  -client_retry_interval_msWork with
>   --client_max_retries to
>   create a retry client. The
>   default value is 1000.
>  -clusterId   ClusterId. By default, it
>   will take default cluster id
>   from the RM
>  -containerId   ContainerId. By default, it
>   will print all available
>   logs. Work with -log_files
>   to get only specific logs.
>   If specified, the
>   applicationId can be omitted
>  -helpDisplays help for all
>   commands.
>  -list_nodes  Show the list of nodes that
>   successfully aggregated
>   logs. This option can only
>   be used with finished
>   applications.
>  -log_filesSpecify comma-separated
>   value to get exact matched
>   log files. Use "ALL" or "*"
>   to fetch all the log files
>   for the container.
>  -log_files_pattern Specify comma-separated
>   

[jira] [Created] (YARN-10443) Document options of logs CLI

2020-09-18 Thread Adam Antal (Jira)
Adam Antal created YARN-10443:
-

 Summary: Document options of logs CLI
 Key: YARN-10443
 URL: https://issues.apache.org/jira/browse/YARN-10443
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.0
Reporter: Adam Antal


It's bugging me a lot that the YARN logs CLI is poorly documented. I always 
have to type {{yarn logs -help}} to see the full list of supported commands. It 
would be nice to have it nicely documented in our website.

Current 
[documentation|https://hadoop.apache.org/docs/r3.3.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#logs]
 in the website shows only 5 supported options.
The output of the help command however shows more:
{noformat}
Retrieve logs for YARN applications.
usage: yarn logs -applicationId  [OPTIONS]

general options are:
 -am   Prints the AM Container logs
  for this application.
  Specify comma-separated
  value to get logs for
  related AM Container. For
  example, If we specify -am
  1,2, we will get the logs
  for the first AM Container
  as well as the second AM
  Container. To get logs for
  all AM Containers, use -am
  ALL. To get logs for the
  latest AM Container, use -am
  -1. By default, it will
  print all available logs.
  Work with -log_files to get
  only specific logs.
 -appOwner AppOwner (assumed to be
  current user if not
  specified)
 -client_max_retries Set max retry number for a
  retry client to get the
  container logs for the
  running applications. Use a
  negative value to make retry
  forever. The default value
  is 30.
 -client_retry_interval_msWork with
  --client_max_retries to
  create a retry client. The
  default value is 1000.
 -clusterId   ClusterId. By default, it
  will take default cluster id
  from the RM
 -containerId   ContainerId. By default, it
  will print all available
  logs. Work with -log_files
  to get only specific logs.
  If specified, the
  applicationId can be omitted
 -helpDisplays help for all
  commands.
 -list_nodes  Show the list of nodes that
  successfully aggregated
  logs. This option can only
  be used with finished
  applications.
 -log_filesSpecify comma-separated
  value to get exact matched
  log files. Use "ALL" or "*"
  to fetch all the log files
  for the container.
 -log_files_pattern Specify comma-separated
  value to get matched log
  files by using java regex.
  Use ".*" to fetch all the
  log files for the container.
 -nodeAddress   NodeAddress in the format
  nodename:port
 -outLocal directory for storing
  

[jira] [Commented] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197732#comment-17197732
 ] 

Adam Antal commented on YARN-950:
-

Hi [~epayne],

Sorry for the late answer. I unassigned this from myself, but I'd be happy to 
help reviewing the patch if you have any.

For our internal use case it was enough to detect these large aggregated log 
files. That was implemented in YARN-8199, therefore I did not feel strong push 
for this feature from our side.

> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Assignee: Adam Antal
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-09-17 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-950:
---

Assignee: (was: Adam Antal)

> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197729#comment-17197729
 ] 

Adam Antal commented on YARN-10031:
---

Thanks for the response [~gandras].

I accidentally made an error in my previous comment, sorry for that. By 
simplification I meant this:
{code:java}
if (key.toString().equals(logRequest.getContainerId())) {
  ...
}
{code}
because this one statement covers the null check as well and is more concise.

Regarding the parameter collections I can understand the behaviour from the 
unit tests. IMO since the approach is fine, we can go ahead with the next step.

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittent

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197718#comment-17197718
 ] 

Adam Antal commented on YARN-9333:
--

I think the potential benefits outweigh the cons, so I'm +1 on pushing the 
patch in, if there are no strong objections.

It would be nice to double-check if it's not a product issue with the 
FairScheduler, but since it is a locally not reproducible issue we should 
assume it's infra related, and this imply that pushing the patch is a good 
solution.

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent
> --
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9333-001.patch, YARN-9333-002.patch, 
> YARN-9333-003.patch, YARN-9333-debug1.patch
>
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197715#comment-17197715
 ] 

Adam Antal commented on YARN-10393:
---

Thanks for the valuable comments [~Jim_Brennan] [~yuanbo] [~wzzdreamer].

I got through the reasoning and I agree with the solution. As Jim explained 
{{lastHeartBeatID}} is indeed not the right approach, but the conditional clear 
based on the missedHeartbeat looks good to me. Being honest, I feel like I 
couldn't give a confident +1 to this patch, but since we're having a draft 
patch now, maybe it's time to involve some senior folks to give a +1 to this 
patch.

Anyways, the draft looks good - one comment is to add a DEBUG or an INFO level 
logging when the {{missedHearbeat}} is true - something like this:
{code:java}
if (!missedHearbeat) {
  pendingCompletedContainers.clear();
} else {
  LOG.info("Skip clearing pending completed containers due to missed 
heartbeat.");
}
missedHearbeat = false;
{code}
Of course, you can figure this out from the logs anyways, but let's keep it 
explicit - it may also help investigating such cases in the future.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at 

[jira] [Commented] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittent

2020-09-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191759#comment-17191759
 ] 

Adam Antal commented on YARN-9333:
--

I've just seen this failure occurring in a much higher frequency than before. 
See YARN-10332 for example, but I see occuring in other jenkins results as well.

Did you have some time to work on this [~prabhujoseph]?

CC [~snemeth], [~shuzirra].

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent
> --
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> 

[jira] [Resolved] (YARN-10329) Flaky test cases in Fair Scheduler

2020-09-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal resolved YARN-10329.
---
Resolution: Duplicate

> Flaky test cases in Fair Scheduler
> --
>
> Key: YARN-10329
> URL: https://issues.apache.org/jira/browse/YARN-10329
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hudáky Márton Gyula
>Assignee: Adam Antal
>Priority: Minor
>
> * The following 2 test cases are failing on unrelated patches very often:
> hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption
> Here is an example of both failures
> {code:java}
> [ERROR] Tests run: 105, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 27.481 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testNormalizationUsingQueueMaximumAllocation(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>   Time elapsed: 0.178 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:360)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:599)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:399)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:331)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:358)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:194)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:462)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:931)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.allocateAppAttempt(TestFairScheduler.java:435)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testNormalizationUsingQueueMaximumAllocation(TestFairScheduler.java:409)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> 

[jira] [Assigned] (YARN-10329) Flaky test cases in Fair Scheduler

2020-09-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-10329:
-

Assignee: Adam Antal

> Flaky test cases in Fair Scheduler
> --
>
> Key: YARN-10329
> URL: https://issues.apache.org/jira/browse/YARN-10329
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hudáky Márton Gyula
>Assignee: Adam Antal
>Priority: Minor
>
> * The following 2 test cases are failing on unrelated patches very often:
> hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption
> Here is an example of both failures
> {code:java}
> [ERROR] Tests run: 105, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 27.481 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testNormalizationUsingQueueMaximumAllocation(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>   Time elapsed: 0.178 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:360)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:599)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:399)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:331)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:358)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:194)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:462)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:931)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.allocateAppAttempt(TestFairScheduler.java:435)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testNormalizationUsingQueueMaximumAllocation(TestFairScheduler.java:409)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> 

[jira] [Commented] (YARN-10329) Flaky test cases in Fair Scheduler

2020-09-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191756#comment-17191756
 ] 

Adam Antal commented on YARN-10329:
---

I think the first failure is handled in YARN-10297 and the second one will be 
handled in YARN-9333.

I suggest to close this as dupe.

> Flaky test cases in Fair Scheduler
> --
>
> Key: YARN-10329
> URL: https://issues.apache.org/jira/browse/YARN-10329
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hudáky Márton Gyula
>Priority: Minor
>
> * The following 2 test cases are failing on unrelated patches very often:
> hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption
> Here is an example of both failures
> {code:java}
> [ERROR] Tests run: 105, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 27.481 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testNormalizationUsingQueueMaximumAllocation(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>   Time elapsed: 0.178 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:360)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:599)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:399)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:331)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:358)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:194)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:462)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:931)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.allocateAppAttempt(TestFairScheduler.java:435)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testNormalizationUsingQueueMaximumAllocation(TestFairScheduler.java:409)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> 

[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-09-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191754#comment-17191754
 ] 

Adam Antal commented on YARN-10031:
---

Thanks for the patch [~gandras]. 

Looks good overall: I agree with the direction. One thing I could add that I 
think we should separate the new code paths from the existing ones - I suggest 
to use a term, like "generic", "extended" or "filtered" in the functions and 
objects (like {{UserLogsRequest}}) and use it consequently.

Generic comment: please don't forget to add unit tests to the patch.

I see a few minor nits besides these, but I'll take another look in the next 
patch:

In {{LogAggregationTFileController#getLogMetaFilesOfNode}}:
- use try-with-resource for {{LogReader}}
- this can be simplified:
{code:java}
if (logRequest.getContainerId() == null ||
logRequest.getContainerId().equals(key.toString())) {
...
{code}
to
{code:java}
if (key.toString().equals(logRequest.getContainerId().equals())) {
   ...
{code}

I see that in the {{/userlogs}} endpoint ({{getAggregatedLogsMeta()}}) we have 
this as parameter:
{code:java}
@QueryParam(YarnWebServiceParams.FILESIZE) Set fileSize,

{code}
I don't see how an input like {{file_size=<1500}} translates to this set. Could 
you also add some comment or cover it with tests?

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-09-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191621#comment-17191621
 ] 

Adam Antal commented on YARN-10332:
---

Thanks for the patch [~yehuanhuan]. Committed to trunk, branch-3.3 and 
branch-3.2. 

Thanks for the review [~bibinchundatt].

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Fix For: 3.2.2, 3.4.0, 3.3.1
>
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-09-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10332:
--
Fix Version/s: 3.3.1
   3.4.0
   3.2.2

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Fix For: 3.2.2, 3.4.0, 3.3.1
>
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-09-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191572#comment-17191572
 ] 

Adam Antal commented on YARN-9136:
--

Thanks for the patch [~mhudaky], committed to trunk.

Thanks for the reviews [~snemeth], [~pbacsko].

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9136.001.patch, YARN-9136.002.patch, 
> YARN-9136.003.patch, YARN-9136.004.patch
>
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-09-07 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-9136:
-
Fix Version/s: 3.4.0

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9136.001.patch, YARN-9136.002.patch, 
> YARN-9136.003.patch, YARN-9136.004.patch
>
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-09-03 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190110#comment-17190110
 ] 

Adam Antal commented on YARN-9136:
--

+1. Any additional comments [~snemeth], [~pbacsko]?

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
> Attachments: YARN-9136.001.patch, YARN-9136.002.patch, 
> YARN-9136.003.patch, YARN-9136.004.patch
>
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10419) Javadoc error in hadoop-yarn-server-common module

2020-09-03 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190042#comment-17190042
 ] 

Adam Antal commented on YARN-10419:
---

Thanks for filing this issue [~aajisaka]. I don't know either why the precommit 
passed for YARN-10304, I will be more cautious in the future.

> Javadoc error in hadoop-yarn-server-common module
> -
>
> Key: YARN-10419
> URL: https://issues.apache.org/jira/browse/YARN-10419
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, documentation
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
> Fix For: 3.4.0
>
>
> {noformat}
> $ mvn clean process-sources javadoc:javadoc-no-fork -pl 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common
> (snip)
> [ERROR] 
> /Users/aajisaka/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/RemoteLogPathEntry.java:23:
>  error: unknown tag: ROOT_PATH
> [ERROR]  *   /%USER/
> [ERROR]  ^
> [ERROR] 
> /Users/aajisaka/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/RemoteLogPathEntry.java:23:
>  error: unknown tag: SUFFIX
> [ERROR]  *   /%USER/
> {noformat}
> Full log: https://gist.github.com/aajisaka/46fde3cbd9211fc09ee4040b85251e9c



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10386) Create new JSON schema for Placement Rules

2020-09-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188421#comment-17188421
 ] 

Adam Antal commented on YARN-10386:
---

Thanks for the addendum patch [~pbacsko]. +1 from me, committed to trunk (let's 
not block other issues with this).

> Create new JSON schema for Placement Rules
> --
>
> Key: YARN-10386
> URL: https://issues.apache.org/jira/browse/YARN-10386
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: MappingRulesDescription_v1.json, YARN-10386-001.patch, 
> YARN-10386-002.patch, YARN-10386-003.patch, YARN-10386-004.patch, 
> YARN-10386-005.patch, YARN-10386-006.patch, YARN-10386-007.patch, 
> YARN-10386-008.patch, YARN-10386-appendum.patch, YARN-10386-appendum2.patch
>
>
> Tasks in this JIRA:
>  # Create new JSON schema
>  # Add Maven plugin which generates Java POJOs based on the schema
>  # Add helper class which essentially does the same as #2 (for dev purposes)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188306#comment-17188306
 ] 

Adam Antal commented on YARN-10393:
---

Thanks for the discussion above, I think we've seen several approaches there.

In my opinion our best solution so far is this one:
bq. Another option would be to remove pendingCompletedContainers.clear() from 
the end of removeOrTrackCompletedContainersFromContext() - it doesn't really 
belong there anyway, and do it directly in the run()) method. By doing that, 
you can make it conditional on the heartbeatId changing. Something like: [...]

Also that would generally resolve all heartbeat processing problems. The 
concrete bug report was about the completed containers, but this is a more 
general solution for this bug, and it's quite elegant.

Regarding the two ways you mentioned [~yuanbo], I'd prefer option 2:
bq. 2、Resending container id from recentlyStoppedContainers periodically (maybe 
1 mins?), once it get response from RM, it will be removed from 
recentlyStoppedContainers and never get retried again.

I believe option 1. does not guarantee to resolve anything in case of e.g. an 
intermittent network disruption, where the same thing happens: even if we retry 
3 times, we may end up in the same situation. If there's a solution without the 
assumption that the connection issue resolve sooner then the retries succeed, I 
think we should go there: if makes Hadoop more resilient and not less resilient 
(because we depend on the assumption). Obviously everything is configurable 
(make arbitrary large retries), and you cannot achieve 100% resilience, but I 
found solution 2 a logically better solution.

However even option 2 provides a less general solution than the one mentioned 
by [~Jim_Brennan]. I would be the most comfortable with combining the two 
solutions, but that might be too much for this issue. Let's elaborate on this a 
bit. What do others think?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: 

[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-08-31 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187603#comment-17187603
 ] 

Adam Antal commented on YARN-10332:
---

Also we probably need a branch-3.3 and branch-3.2 patch if you have some time, 
you can upload all of these (but please wait 5 minutes between the uploads, 
sometimes jenkins only grabs the latest one).

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-08-31 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187602#comment-17187602
 ] 

Adam Antal commented on YARN-10332:
---

[~yehuanhuan]: sorry for the delay. Could you please reupload your patch to 
have a clean jenkins result?

If no objections and +1 from jenkins I can commit this soon.

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-08-27 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185765#comment-17185765
 ] 

Adam Antal commented on YARN-9136:
--

Thanks for working on this [~mhudaky].

AFAIU then let's skip the FPGA part from (could you verify if I understood 
correctly [~snemeth]?).

Regarding the patch it looks good overall, but I realized that there are 
dependent objects that are not specified in the documentation. 
E.g.:
{noformat}
| Properties | Data Type | Description |
...
| totalGpuDevices | List of GpuDevice objects | Contains the representations of 
GPU devices |
{noformat}
Later we don't define {{GpuDevice}} object, so it's still undocumented.

We should get to a point where each of these objects are broken down into 
primitive data types (int, string, etc.).

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
> Attachments: YARN-9136.001.patch, YARN-9136.002.patch, 
> YARN-9136.003.patch
>
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10386) Create new JSON schema for Placement Rules

2020-08-27 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185764#comment-17185764
 ] 

Adam Antal edited comment on YARN-10386 at 8/27/20, 10:25 AM:
--

Thanks for working on this [~pbacsko].

I saw you had some troubles with the jenkins. I don't see what exactly causes 
the jenkins failures but we should make sure that after we merge this, it won't 
appear in other jenkins results.

Regarding the patch I have two comments:

I am a little concerned about the checkstyle ignores:
{noformat}
@SuppressWarnings({"checkstyle:hideutilityclassconstructor", 
"checkstyle:linelength"})
{noformat}
- I don't see problems creating a private constructor for this class to prevent 
instantiation of this utility class.
- Also, this can be moved to a constant: 
{{"org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.placement.schema"}}
 any maybe we can add the checkstyle warning annotation to only to the constant 
to be as restrictive as we can.

Other: 
- In {{hadoop-yarn-server-resourcemanager/pom.xml}} can we also use the 
${jsonschema2pojo.version} constant for the version if possible?

I also add that I checked the new maven dependency and I saw no associated 
CVE-s to that, so I think it's fine to use it.


was (Author: adam.antal):
Thanks for working on this [~pbacsko].

I saw you had some troubles with the jenkins. I don't see what exactly causes 
the jenkins failures but we should make sure that after we merge this, it won't 
appear in other jenkins results.

Regarding the patch I have two comments:

I am a little concerned about the checkstyle ignores:
{noformat}
@SuppressWarnings({"checkstyle:hideutilityclassconstructor", 
"checkstyle:linelength"})
{noformat}
- I don't see problems creating a private constructor for this class to prevent 
instantiation of this utility class.
- Also, this can be moved to a constant: {{ 
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.placement.schema",}}
 any maybe we can add the checkstyle warning annotation to only to the constant 
to be as restrictive as we can.

Other: 
- In {{hadoop-yarn-server-resourcemanager/pom.xml}} can we also use the 
${jsonschema2pojo.version} constant for the version if possible?

I also add that I checked the new maven dependency and I saw no associated 
CVE-s to that, so I think it's fine to use it.

> Create new JSON schema for Placement Rules
> --
>
> Key: YARN-10386
> URL: https://issues.apache.org/jira/browse/YARN-10386
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MappingRulesDescription_v1.json, YARN-10386-001.patch, 
> YARN-10386-002.patch, YARN-10386-003.patch, YARN-10386-004.patch, 
> YARN-10386-005.patch, YARN-10386-006.patch
>
>
> Tasks in this JIRA:
>  # Create new JSON schema
>  # Add Maven plugin which generates Java POJOs based on the schema
>  # Add helper class which essentially does the same as #2 (for dev purposes)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10386) Create new JSON schema for Placement Rules

2020-08-27 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185764#comment-17185764
 ] 

Adam Antal commented on YARN-10386:
---

Thanks for working on this [~pbacsko].

I saw you had some troubles with the jenkins. I don't see what exactly causes 
the jenkins failures but we should make sure that after we merge this, it won't 
appear in other jenkins results.

Regarding the patch I have two comments:

I am a little concerned about the checkstyle ignores:
{noformat}
@SuppressWarnings({"checkstyle:hideutilityclassconstructor", 
"checkstyle:linelength"})
{noformat}
- I don't see problems creating a private constructor for this class to prevent 
instantiation of this utility class.
- Also, this can be moved to a constant: {{ 
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.placement.schema",}}
 any maybe we can add the checkstyle warning annotation to only to the constant 
to be as restrictive as we can.

Other: 
- In {{hadoop-yarn-server-resourcemanager/pom.xml}} can we also use the 
${jsonschema2pojo.version} constant for the version if possible?

I also add that I checked the new maven dependency and I saw no associated 
CVE-s to that, so I think it's fine to use it.

> Create new JSON schema for Placement Rules
> --
>
> Key: YARN-10386
> URL: https://issues.apache.org/jira/browse/YARN-10386
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MappingRulesDescription_v1.json, YARN-10386-001.patch, 
> YARN-10386-002.patch, YARN-10386-003.patch, YARN-10386-004.patch, 
> YARN-10386-005.patch, YARN-10386-006.patch
>
>
> Tasks in this JIRA:
>  # Create new JSON schema
>  # Add Maven plugin which generates Java POJOs based on the schema
>  # Add helper class which essentially does the same as #2 (for dev purposes)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2020-08-26 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185230#comment-17185230
 ] 

Adam Antal commented on YARN-4946:
--

This issue is reverted in YARN-9848 for 3.3.0. Please revert this patch in your 
distribution or use a later release (3.3.0).

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-08-25 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183947#comment-17183947
 ] 

Adam Antal commented on YARN-10304:
---

Thanks for the work [~gandras], committed to trunk. Thanks for the review 
[~mhudaky].

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch, YARN-10304.008.patch, 
> YARN-10304.009.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10106) Yarn logs CLI filtering by application attempt

2020-08-25 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183837#comment-17183837
 ] 

Adam Antal commented on YARN-10106:
---

Thanks for the patch [~mhudaky], committed to trunk. Appreciated the reviews 
[~bteke] and [~snemeth].

> Yarn logs CLI filtering by application attempt
> --
>
> Key: YARN-10106
> URL: https://issues.apache.org/jira/browse/YARN-10106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Trivial
> Attachments: YARN-10106.001.patch, YARN-10106.002.patch, 
> YARN-10106.003.patch, YARN-10106.004.patch, YARN-10106.005.patch, 
> YARN-10106.006.patch, YARN-10106.007.patch, YARN-10106.008.patch, 
> YARN-10106.009.patch, YARN-10106.010.patch, YARN-10106.011.patch, 
> YARN-10106.012.patch, YARN-10106.013.patch
>
>
> {{ContainerLogsRequest}} got a new parameter in YARN-10101, which is the 
> {{applicationAttempt}} - we can use this new parameter in Yarn logs CLI as 
> well to filter by application attempt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10106) Yarn logs CLI filtering by application attempt

2020-08-24 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183351#comment-17183351
 ] 

Adam Antal commented on YARN-10106:
---

Thanks for the latest patch [~mhudaky]!

If you fix the last 7 checkstyles, I can commit this tomorrow.

> Yarn logs CLI filtering by application attempt
> --
>
> Key: YARN-10106
> URL: https://issues.apache.org/jira/browse/YARN-10106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Trivial
> Attachments: YARN-10106.001.patch, YARN-10106.002.patch, 
> YARN-10106.003.patch, YARN-10106.004.patch, YARN-10106.005.patch, 
> YARN-10106.006.patch, YARN-10106.007.patch, YARN-10106.008.patch, 
> YARN-10106.009.patch, YARN-10106.010.patch, YARN-10106.011.patch
>
>
> {{ContainerLogsRequest}} got a new parameter in YARN-10101, which is the 
> {{applicationAttempt}} - we can use this new parameter in Yarn logs CLI as 
> well to filter by application attempt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10406) YARN log processor

2020-08-24 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10406:
--
Description: 
YARN currently does not have any utility that would enable cluster 
administrators to display previous actions in a Hadoop YARN cluster in an 
offline fashion. 

HDFS has the 
[OIV|https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html]/
 
[OEV|https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html]
 which does not require a running cluster to look and modify the filesystem. A 
corresponding tool would be very helpful in the context of YARN.

Since ATS is not widespread (is not available for older clusters) and there 
isn't a single file or entity that would collect all the application/container 
etc. related information, we thought our best option to parse and process the 
output of the YARN daemon log files and reconstruct the history of the cluster 
from that. We designed and implemented a CLI based solution that after parsing 
the log file enables users to query app/container related information (listing, 
filtering by certain properties) and search for common errors like CE 
failures/error codes, AM preemption or stack traces. The tool can be integrated 
into the YARN project as a sub-project.


  was:
YARN currently does not have any utility that would enable cluster 
administrators to display previous actions in a Hadoop YARN cluster in an 
offline fashion. 

HDFS has the OIV/OEV which does not require a running cluster to look and 
modify the filesystem. A corresponding tool would be very helpful in the 
context of YARN.

Since ATS is not widespread (is not available for older clusters) and there 
isn't a single file or entity that would collect all the application/container 
etc. related information, we thought our best option to parse and process the 
output of the YARN daemon log files and reconstruct the history of the cluster 
from that. We designed and implemented a CLI based solution that after parsing 
the log file enables users to query app/container related information (listing, 
filtering by certain properties) and search for common errors like CE 
failures/error codes, AM preemption or stack traces. The tool can be integrated 
into the YARN project as a sub-project.



> YARN log processor
> --
>
> Key: YARN-10406
> URL: https://issues.apache.org/jira/browse/YARN-10406
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Critical
>
> YARN currently does not have any utility that would enable cluster 
> administrators to display previous actions in a Hadoop YARN cluster in an 
> offline fashion. 
> HDFS has the 
> [OIV|https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html]/
>  
> [OEV|https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html]
>  which does not require a running cluster to look and modify the filesystem. 
> A corresponding tool would be very helpful in the context of YARN.
> Since ATS is not widespread (is not available for older clusters) and there 
> isn't a single file or entity that would collect all the 
> application/container etc. related information, we thought our best option to 
> parse and process the output of the YARN daemon log files and reconstruct the 
> history of the cluster from that. We designed and implemented a CLI based 
> solution that after parsing the log file enables users to query app/container 
> related information (listing, filtering by certain properties) and search for 
> common errors like CE failures/error codes, AM preemption or stack traces. 
> The tool can be integrated into the YARN project as a sub-project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10406) YARN log processor

2020-08-24 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10406:
--
Description: 
YARN currently does not have any utility that would enable cluster 
administrators to display previous actions in a Hadoop YARN cluster in an 
offline fashion. 

HDFS has the OIV/OEV which does not require a running cluster to look and 
modify the filesystem. A corresponding tool would be very helpful in the 
context of YARN.

Since ATS is not widespread (is not available for older clusters) and there 
isn't a single file or entity that would collect all the application/container 
etc. related information, we thought our best option to parse and process the 
output of the YARN daemon log files and reconstruct the history of the cluster 
from that. We designed and implemented a CLI based solution that after parsing 
the log file enables users to query app/container related information (listing, 
filtering by certain properties) and search for common errors like CE 
failures/error codes, AM preemption or stack traces. The tool can be integrated 
into the YARN project as a sub-project.


  was:
YARN currently does not have any utility that would enable cluster 
administrators to re-play actions in a Hadoop YARN cluster in an offline 
fashion. 

HDFS has the OIV/OEV which does not require a running cluster to look and 
modify the filesystem. A corresponding tool would be very helpful in the 
context of YARN.

Since ATS is not widespread (is not available for older clusters) and there 
isn't a single file or entity that would collect all the application/container 
etc. related information, we thought our best option to parse and process the 
output of the YARN daemon log files and reconstruct the history of the cluster 
from that. We designed and implemented a CLI based solution that after parsing 
the log file enables users to query app/container related information (listing, 
filtering by certain properties) and search for common errors like CE 
failures/error codes, AM preemption or stack traces. The tool can be integrated 
into the YARN project as a sub-project.



> YARN log processor
> --
>
> Key: YARN-10406
> URL: https://issues.apache.org/jira/browse/YARN-10406
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Critical
>
> YARN currently does not have any utility that would enable cluster 
> administrators to display previous actions in a Hadoop YARN cluster in an 
> offline fashion. 
> HDFS has the OIV/OEV which does not require a running cluster to look and 
> modify the filesystem. A corresponding tool would be very helpful in the 
> context of YARN.
> Since ATS is not widespread (is not available for older clusters) and there 
> isn't a single file or entity that would collect all the 
> application/container etc. related information, we thought our best option to 
> parse and process the output of the YARN daemon log files and reconstruct the 
> history of the cluster from that. We designed and implemented a CLI based 
> solution that after parsing the log file enables users to query app/container 
> related information (listing, filtering by certain properties) and search for 
> common errors like CE failures/error codes, AM preemption or stack traces. 
> The tool can be integrated into the YARN project as a sub-project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10406) YARN log processor

2020-08-24 Thread Adam Antal (Jira)
Adam Antal created YARN-10406:
-

 Summary: YARN log processor
 Key: YARN-10406
 URL: https://issues.apache.org/jira/browse/YARN-10406
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: yarn
Reporter: Adam Antal
Assignee: Hudáky Márton Gyula


YARN currently does not have any utility that would enable cluster 
administrators to re-play actions in a Hadoop YARN cluster in an offline 
fashion. 

HDFS has the OIV/OEV which does not require a running cluster to look and 
modify the filesystem. A corresponding tool would be very helpful in the 
context of YARN.

Since ATS is not widespread (is not available for older clusters) and there 
isn't a single file or entity that would collect all the application/container 
etc. related information, we thought our best option to parse and process the 
output of the YARN daemon log files and reconstruct the history of the cluster 
from that. We designed and implemented a CLI based solution that after parsing 
the log file enables users to query app/container related information (listing, 
filtering by certain properties) and search for common errors like CE 
failures/error codes, AM preemption or stack traces. The tool can be integrated 
into the YARN project as a sub-project.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176984#comment-17176984
 ] 

Adam Antal commented on YARN-10393:
---

Nice finding [~wzzdreamer] and thorough explanation. Let me check the PR as 
well.

One question: what do you think is the reason that this bug did not appear 
after the upgrade to 2.9? I mean the possibility of the stuck mapper is still 
in the protocol, so it could have been appeared.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at 

[jira] [Commented] (YARN-4783) Log aggregation failure for application when Nodemanager is restarted

2020-08-10 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174123#comment-17174123
 ] 

Adam Antal commented on YARN-4783:
--

Thanks for the patch [~gandras].

I am not entirely convinced that this approach resolves the original problem. 
Since the RM cancels the token, renewing that token would fail. Can you test 
this patch on a cluster using the steps above?

> Log aggregation failure for application when Nodemanager is restarted 
> --
>
> Key: YARN-4783
> URL: https://issues.apache.org/jira/browse/YARN-4783
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Surendra Singh Lilhore
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-4783.001.patch, YARN-4783.002.patch, 
> YARN-4783.003.patch
>
>
> Scenario :
> =
> 1.Start NM with user dsperf:hadoop
> 2.Configure linux-execute user as dsperf
> 3.Submit application with yarn user 
> 4.Once few containers are allocated to NM 1
> 5.Nodemanager 1 is stopped  (wait for expiry )
> 6.Start node manager after application is completed
> 7.Check the log aggregation is happening for the containers log in NMLocal 
> directory
> Expect Output :
> ===
> Log aggregation should be succesfull
> Actual Output :
> ===
> Log aggreation not successfull



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4783) Log aggregation failure for application when Nodemanager is restarted

2020-08-06 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172399#comment-17172399
 ] 

Adam Antal commented on YARN-4783:
--

Thanks for the patch [~gandras]. Somehow I could not open the compile report 
but building your patch locally I got the following errors:
{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project hadoop-yarn-server-nodemanager: Compilation failure: Compilation 
failure:
[ERROR] 
/Users/adamantal/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[36,58]
 cannot find symbol
[ERROR]   symbol:   class NMDelegationTokenManager
[ERROR]   location: package org.apache.hadoop.yarn.server.nodemanager.security
[ERROR] 
/Users/adamantal/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[105,11]
 cannot find symbol
[ERROR]   symbol:   class NMDelegationTokenManager
[ERROR]   location: class 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl
[ERROR] 
/Users/adamantal/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[766,41]
 cannot find symbol
[ERROR]   symbol:   class NMDelegationTokenManager
[ERROR]   location: class 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl
[ERROR] 
/Users/adamantal/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[145,39]
 cannot find symbol
[ERROR]   symbol:   class NMDelegationTokenManager
[ERROR]   location: class 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl
{noformat}

Could you take care of it?

> Log aggregation failure for application when Nodemanager is restarted 
> --
>
> Key: YARN-4783
> URL: https://issues.apache.org/jira/browse/YARN-4783
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Surendra Singh Lilhore
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-4783.001.patch, YARN-4783.002.patch
>
>
> Scenario :
> =
> 1.Start NM with user dsperf:hadoop
> 2.Configure linux-execute user as dsperf
> 3.Submit application with yarn user 
> 4.Once few containers are allocated to NM 1
> 5.Nodemanager 1 is stopped  (wait for expiry )
> 6.Start node manager after application is completed
> 7.Check the log aggregation is happening for the containers log in NMLocal 
> directory
> Expect Output :
> ===
> Log aggregation should be succesfull
> Actual Output :
> ===
> Log aggreation not successfull



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-08-03 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169945#comment-17169945
 ] 

Adam Antal commented on YARN-10304:
---

There's one last checkstyle issue. Could you please handle that [~gandras]?

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch, YARN-10304.008.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-07-31 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168815#comment-17168815
 ] 

Adam Antal commented on YARN-10304:
---

[~gandras] could you please rebase & reupload the latest patch to have a green 
jenkins? I can commit after that - if no objections.

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163600#comment-17163600
 ] 

Adam Antal commented on YARN-10304:
---

+1 from me. Thanks for working on this [~gandras].

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163446#comment-17163446
 ] 

Adam Antal commented on YARN-10332:
---

I'm sorry, I missed this too. Nice catch [~yehuanhuan]. +1 for this change.

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163423#comment-17163423
 ] 

Adam Antal commented on YARN-10315:
---

+1 from me on v2. Thanks for the patch [~Sushil-K-S].

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163418#comment-17163418
 ] 

Adam Antal commented on YARN-10319:
---

Indeed, the test failure is not related. 
+1 from me, thanks for the effort [~prabhujoseph].

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-21 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162016#comment-17162016
 ] 

Adam Antal commented on YARN-10319:
---

Check the markdown and I could not understand this sentence: "This may take 
time to return as it internally waits still it records specified 
activitiesCount."
I think "This may take time to return as it internally waits -still it- *until 
a certain amount of* records *are generated* specified *by* activitiesCount." 
would be more clear.

Besides that it's a +1 from me.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10106) Yarn logs CLI filtering by application attempt

2020-07-20 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161306#comment-17161306
 ] 

Adam Antal commented on YARN-10106:
---

+1 from me.

> Yarn logs CLI filtering by application attempt
> --
>
> Key: YARN-10106
> URL: https://issues.apache.org/jira/browse/YARN-10106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Trivial
> Attachments: YARN-10106.001.patch, YARN-10106.002.patch, 
> YARN-10106.003.patch, YARN-10106.004.patch, YARN-10106.005.patch, 
> YARN-10106.006.patch, YARN-10106.007.patch, YARN-10106.008.patch, 
> YARN-10106.009.patch, YARN-10106.010.patch
>
>
> {{ContainerLogsRequest}} got a new parameter in YARN-10101, which is the 
> {{applicationAttempt}} - we can use this new parameter in Yarn logs CLI as 
> well to filter by application attempt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10266) Setting debug delay to a too high number will cause NM fail to start

2020-07-01 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal resolved YARN-10266.
---
  Assignee: Adam Antal
Resolution: Won't Fix

> Setting debug delay to a too high number will cause NM fail to start
> 
>
> Key: YARN-10266
> URL: https://issues.apache.org/jira/browse/YARN-10266
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Trivial
>  Labels: newbie
>
> If I set some inappropriate number for 
> {{yarn.nodemanager.delete.debug-delay-sec}}, I'd rather have functional nM 
> with some ERROR messages in the log stating that it has been disabled due to 
> illegal argument, than to have a failed NM.
> Stack trace:
> {noformat}
> java.lang.NumberFormatException: For input string: "999"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:583)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1509)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService.serviceInit(DeletionService.java:179)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10266) Setting debug delay to a too high number will cause NM fail to start

2020-07-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149311#comment-17149311
 ] 

Adam Antal commented on YARN-10266:
---

I agree [~BilwaST]. It makes no sense to handle this exception in this 
particular case. Since we're using Java-internal methods I don't think there's 
much to do. Closing this.

> Setting debug delay to a too high number will cause NM fail to start
> 
>
> Key: YARN-10266
> URL: https://issues.apache.org/jira/browse/YARN-10266
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Priority: Trivial
>  Labels: newbie
>
> If I set some inappropriate number for 
> {{yarn.nodemanager.delete.debug-delay-sec}}, I'd rather have functional nM 
> with some ERROR messages in the log stating that it has been disabled due to 
> illegal argument, than to have a failed NM.
> Stack trace:
> {noformat}
> java.lang.NumberFormatException: For input string: "999"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:583)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1509)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService.serviceInit(DeletionService.java:179)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10106) Yarn logs CLI filtering by application attempt

2020-07-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149277#comment-17149277
 ] 

Adam Antal commented on YARN-10106:
---

Thanks for the patch [~mhudaky].

For backward compatibility reasons I think in case when only the containerId is 
specified let's not populate the appAttemptId - it was null before, and it 
doesn't matter if we add it to the {{ContainerLogsRequest}}.

The tests look good to me in general, could you please fix them?

> Yarn logs CLI filtering by application attempt
> --
>
> Key: YARN-10106
> URL: https://issues.apache.org/jira/browse/YARN-10106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Trivial
> Attachments: YARN-10106.001.patch, YARN-10106.002.patch, 
> YARN-10106.003.patch
>
>
> {{ContainerLogsRequest}} got a new parameter in YARN-10101, which is the 
> {{applicationAttempt}} - we can use this new parameter in Yarn logs CLI as 
> well to filter by application attempt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10334) TestDistributedShell leaks resources on timeout/failure

2020-07-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149259#comment-17149259
 ] 

Adam Antal commented on YARN-10334:
---

Nice finding [~ahussein]. It could potentially cause lots of intermittent 
issues in Hadoop's unit test runs.

I think revisiting this test may not be that easy, but I hope someone can 
afford some time to look at it.

> TestDistributedShell leaks resources on timeout/failure
> ---
>
> Key: YARN-10334
> URL: https://issues.apache.org/jira/browse/YARN-10334
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test, yarn
>Reporter: Ahmed Hussein
>Priority: Major
>  Labels: newbie, test
>
> {{TestDistributedShell}} times out on trunk. I found that the application, 
> and containers will stay running in the background long after the unit test 
> has failed.
> This causes failure of other test cases and several false positives failures 
> as result of:
> * Ports will stay busy, so other tests cases fail to launch.
> * Unit tests fail because of memory restrictions.
> Although the unit test is already broken on trunk, we do not want its 
> failures to other unit tests.
> {{TestDistributedShell}} needs to be revisited to make sure that all 
> {{YarnClients}}, and {{YarnApplications}} are closed properly at the end of 
> the each unit test (including exception and timeouts)
> Steps to reproduce:
> {code:bash}
> mvn test -Dtest=TestDistributedShell#testDSShellWithOpportunisticContainers
> ## this will timeout as
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 90.234 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
> [ERROR] 
> testDSShellWithOpportunisticContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
>   Time elapsed: 90.018 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 9 
> milliseconds
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:1117)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:1089)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers(TestDistributedShell.java:1438)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:748)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   TestDistributedShell.testDSShellWithOpportunisticContainers:1438 » 
> TestTimedOut
> [INFO] 
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
> {code}
> Using {{ps}} command, you can find the yarn processes are still in the 
> background
> {code:bash}
> /bin/bash -c $JRE_HOME/bin/java -Xmx512m 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster 
> --container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 
> --num_containers 2 --priority 0 --appname DistributedShell --homedir 
> file:/Users/ahussein 
> 1>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_01/AppMaster.stdout
>  
> 

[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149253#comment-17149253
 ] 

Adam Antal commented on YARN-10319:
---

Thanks for the patch [~prabhujoseph]. I have some minor nits if you don't mind.

- In {{RMWebServices}} I took a look at how the scheduler and pre-checks are 
performed for the existing {{#getActivities}} function, and it seems to me that 
there are some duplicates. I think the checks (choosing the scheduler, getting 
the {{ActivitiesManager}}) can be moved to a separate function that can be used 
by both endpoints. I think it would be nice, if we could provide the same error 
if not CS is used (I'm especially thinking about 
[L711|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java#L711]
 part).
- Let's {{RESTClient}} be private in {{ActivitiesTestUtils}}.
- Could you please also add an example output for {{/bulk-activities}} into 
{{ResourceManager.md}}

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149233#comment-17149233
 ] 

Adam Antal commented on YARN-10332:
---

Moved this under YARN-914.

I agree with [~bibinchundatt], removing that transition will cause 
{{InvalidStateTransitionException}}, which case should be avoided.

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149232#comment-17149232
 ] 

Adam Antal commented on YARN-10315:
---

Moved this under YARN-914.

> Avoid sending RMNodeResoureupdate event if resource is same
> ---
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-01 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10315:
--
Parent: YARN-914
Issue Type: Sub-task  (was: Improvement)

> Avoid sending RMNodeResoureupdate event if resource is same
> ---
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10332:
--
Parent: YARN-914
Issue Type: Sub-task  (was: Improvement)

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10279) Avoid unnecessary QueueMappingEntity creations

2020-06-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142690#comment-17142690
 ] 

Adam Antal commented on YARN-10279:
---

Thanks for the patch [~mhudaky]. The unit test failures seem unrelated, so I 
give a +1 to this patch.

> Avoid unnecessary QueueMappingEntity creations
> --
>
> Key: YARN-10279
> URL: https://issues.apache.org/jira/browse/YARN-10279
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Hudáky Márton Gyula
>Priority: Minor
> Attachments: YARN-10279.001.patch, YARN-10279.003.patch, 
> YARN-10279.004.patch, YARN-10279.005.patch, YARN-10279.006.patch
>
>
> In CS UserGroupMappingPlacementRule and AppNameMappingPlacementRule classes 
> we create new instances of QueueMappingEntity class. In some cases we simply 
> copy the already received class, so we just duplicate it, which is 
> unnecessary since the class is immutable.
> This is just a minor improvement, probably doesn't have much impact, but 
> still puts some unnecessary load on GC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler

2020-06-19 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140489#comment-17140489
 ] 

Adam Antal commented on YARN-9930:
--

Thanks for the effort on pushing this through [~pbacsko], +1

> Support max running app logic for CapacityScheduler
> ---
>
> Key: YARN-9930
> URL: https://issues.apache.org/jira/browse/YARN-9930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.0, 3.1.1
>Reporter: zhoukang
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9930-001.patch, YARN-9930-002.patch, 
> YARN-9930-003.patch, YARN-9930-004.patch, YARN-9930-005.patch, 
> YARN-9930-006.patch, YARN-9930-POC01.patch, YARN-9930-POC02.patch, 
> YARN-9930-POC03.patch, YARN-9930-POC04.patch, YARN-9930-POC05.patch, 
> screenshot-1.png
>
>
> In FairScheduler, there has limitation for max running which will let 
> application pending.
> But in CapacityScheduler there has no feature like max running app.Only got 
> max app,and jobs will be rejected directly on client.
> This jira i want to implement this semantic for CapacityScheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10281) Redundant QueuePath usage in UserGroupMappingPlacementRule and AppNameMappingPlacementRule

2020-06-17 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-10281:
-

Assignee: Adam Antal  (was: Gergely Pollak)

> Redundant QueuePath usage in UserGroupMappingPlacementRule and 
> AppNameMappingPlacementRule
> --
>
> Key: YARN-10281
> URL: https://issues.apache.org/jira/browse/YARN-10281
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-10281.001.patch, YARN-10281.002.patch, 
> YARN-10281.003.patch, YARN-10281.004.patch, YARN-10281.branch-3.3.001.patch
>
>
> We use the QueuePath and QueueMapping (or QueueMappingEntity) objects in the 
> aforementioned classes, but these technically store the same kind of 
> information, yet we keep converting between them, let's examine if we can use 
> only the QueueMapping(Entity) instead, since that holds more information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10281) Redundant QueuePath usage in UserGroupMappingPlacementRule and AppNameMappingPlacementRule

2020-06-17 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-10281:
-

Assignee: Gergely Pollak  (was: Adam Antal)

> Redundant QueuePath usage in UserGroupMappingPlacementRule and 
> AppNameMappingPlacementRule
> --
>
> Key: YARN-10281
> URL: https://issues.apache.org/jira/browse/YARN-10281
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10281.001.patch, YARN-10281.002.patch, 
> YARN-10281.003.patch, YARN-10281.004.patch, YARN-10281.branch-3.3.001.patch
>
>
> We use the QueuePath and QueueMapping (or QueueMappingEntity) objects in the 
> aforementioned classes, but these technically store the same kind of 
> information, yet we keep converting between them, let's examine if we can use 
> only the QueueMapping(Entity) instead, since that holds more information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-06-16 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-9136:


Assignee: Hudáky Márton Gyula  (was: Gergely Pollak)

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-06-16 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136622#comment-17136622
 ] 

Adam Antal commented on YARN-9136:
--

I hope you don't mind if I take this over [~shuzirra].

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-06-16 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136542#comment-17136542
 ] 

Adam Antal commented on YARN-10304:
---

Let me start with the bad news. I am very sorry that I come up with this thing 
in v5, but /remote-app-log-dir/user/suffix/ is not enough to find apps using 
the new bucketed path. For users this endpoint is valuable when they know 
exactly where the aggregated logs for the application are. Therefore we need 
another query parameter that can specify the application id, and then the 
{{LogServlet}} can construct the whole path (creating bucket id and 
concatenating the app id as well). If no app id is provided then the behaviour 
should be the same as now. This will probably also need another UT :( 

Regarding the existing patch:
- Regarding the unit tests:
  - This seems to be wrong:
{code:java}
  String path = String.format("%s/%s/bucket-%s-%s",
  YarnConfiguration.DEFAULT_NM_REMOTE_APP_LOG_DIR, remoteUser,
  testSuffix, entry.getFileController().toLowerCase());
{code}
  What is that "bucket-" prefix? That should not be there. Also I don't 
understand why the test is passing. The controller's suffix is initialized in 
{{LogAggregationFileController#extractRemoteRootLogDirSuffix}} and you can 
check that there is no "bucket-" involved in this. Could you investigate this?
  - Since the TFile is always added to the bottom for backward compatibility 
purposes, I recommend defining other controllers. Also we probably need to make 
exact match for the controllers, the {{.contains}} is not enough.
  - need exact match for controllers
  - Could you also write another test case without the user queryparam, but 
setting the login user? You can use {{UserGroupInformation#setLoginUser}}. This 
would make sure that we find the right user when the request processed.
- The new Configuration object in {{WebServletModule}} is a bit of an overkill. 
Maybe the easiest solution is to use {{YarnConfiguration}}'s 
{{YarnConfiguration(Configuration)}} constructor to clone the {{conf}} object 
in {{testRemoteLogDir}} - thus you don't need to bother restoring its state. 
That would be just enough for us.

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler

2020-06-16 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136503#comment-17136503
 ] 

Adam Antal commented on YARN-9930:
--

I was trying to make a meaningful review, but stuck on a few questions. 
Apologize if I'm making silly questions.

I am a little nervous about this case:
bq. Limit max-parallel-apps to 4, submit 4 apps, then refresh it to 2. Result: 
running apps were still running, but new apps stayed in Accepted state. From 
that point on, only 2 apps were allowed to run at the same time.
So AFAIU it is absolutely normal that some queue is above its limit if the 
configurations have been changed. Doesn't it need some special attention in 
your algorithm when you recursively update the parents to search for queues 
where new apps could be submitted?

I compared your implementation with the max apps one, it's a bit different. You 
use a separate {{CSMaxRunningAppsEnforcer}} instance in the scheduler which is 
optimized for guessing which queues to check whether their limits enabled more 
apps to run. The existing implementation for max apps (that considers both 
running and pending ones) calls the 
{{OrderingPolicy#getNumSchedulableEntities()}} and compare it the to limit 
inside {{LeafQueue}}. From the algorithm you described above I assume that your 
solution is more effective, but it seems to me that calling these methods of 
{{OrderingPolicy}} in {{LeafQueue#validateSubmitApplication}} already does 
similar things, but from the queue's perspective - while your solution is 
fundamentally implemented inside the scheduler. I'd prefer your solution as its 
more clear, but since we already have the existing logic, the questions arises: 
why do we need a separate enforcer object? Couldn't it be implemented 
similarly? Or am I missing something here?

Nit:
- {{abstract int getNumRunnableApps();}} would be better put into the 
{{CSQueue}} interface instead of {{AbstractCSQueue}} abstract class.

> Support max running app logic for CapacityScheduler
> ---
>
> Key: YARN-9930
> URL: https://issues.apache.org/jira/browse/YARN-9930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.0, 3.1.1
>Reporter: zhoukang
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9930-001.patch, YARN-9930-002.patch, 
> YARN-9930-003.patch, YARN-9930-004.patch, YARN-9930-POC01.patch, 
> YARN-9930-POC02.patch, YARN-9930-POC03.patch, YARN-9930-POC04.patch, 
> YARN-9930-POC05.patch, screenshot-1.png
>
>
> In FairScheduler, there has limitation for max running which will let 
> application pending.
> But in CapacityScheduler there has no feature like max running app.Only got 
> max app,and jobs will be rejected directly on client.
> This jira i want to implement this semantic for CapacityScheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   >