[jira] [Assigned] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error

2023-09-11 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11567:
---

Assignee: Bence Kosztolnik

> Aggregate container launch debug artifacts automatically in case of error
> -
>
> Key: YARN-11567
> URL: https://issues.apache.org/jira/browse/YARN-11567
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>
> In cases where a container fails to launch without writing to a log file, we 
> often would want to see the artifacts captured by 
> {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better 
> understand the cause of the exit code. To enable this feature for every 
> container maybe over kill, so we need a feature flag to capture these 
> artifacts in case of errors. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error

2023-09-11 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11567:
---

 Summary: Aggregate container launch debug artifacts automatically 
in case of error
 Key: YARN-11567
 URL: https://issues.apache.org/jira/browse/YARN-11567
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Bence Kosztolnik


In cases where a container fails to launch without writing to a log file, we 
often would want to see the artifacts captured by 
{{yarn.nodemanager.log-container-debug-info.enabled}} in order to better 
understand the cause of the exit code. To enable this feature for every 
container maybe over kill, so we need a feature flag to capture these artifacts 
in case of errors. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths

2022-07-25 Thread Bence Kosztolnik (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570813#comment-17570813
 ] 

Bence Kosztolnik commented on YARN-11063:
-

I had a sync with [~gandras] , and we discussed I will fix this issue as basic 
ramp-up task

> Support auto queue creation template wildcards for arbitrary queue depths
> -
>
> Key: YARN-11063
> URL: https://issues.apache.org/jira/browse/YARN-11063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10632, we need to support more than one 
> wildcard in queue templates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths

2022-07-25 Thread Bence Kosztolnik (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570813#comment-17570813
 ] 

Bence Kosztolnik edited comment on YARN-11063 at 7/25/22 9:58 AM:
--

I had a sync with [~gandras] , and we discussed I will fix this issue as a 
basic ramp-up task


was (Author: JIRAUSER292672):
I had a sync with [~gandras] , and we discussed I will fix this issue as basic 
ramp-up task

> Support auto queue creation template wildcards for arbitrary queue depths
> -
>
> Key: YARN-11063
> URL: https://issues.apache.org/jira/browse/YARN-11063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10632, we need to support more than one 
> wildcard in queue templates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths

2022-07-25 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11063:
---

Assignee: Bence Kosztolnik  (was: Andras Gyori)

> Support auto queue creation template wildcards for arbitrary queue depths
> -
>
> Key: YARN-11063
> URL: https://issues.apache.org/jira/browse/YARN-11063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10632, we need to support more than one 
> wildcard in queue templates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11356) Upgrade DataTables to 1.11.5 to fix CVEs

2022-10-20 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11356:
---

 Summary: Upgrade DataTables to 1.11.5 to fix CVEs
 Key: YARN-11356
 URL: https://issues.apache.org/jira/browse/YARN-11356
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.3.4
Reporter: Bence Kosztolnik


This ticket is intended to fix the following CVEs in the *DataTables.net* lib.

*CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are 
vulnerable to Prototype Pollution due to an incomplete fix for 
[https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806].

https://nvd.nist.gov/vuln/detail/CVE-2020-28458

*CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net 
before 1.11.3. If an array is passed to the HTML escape entities function it 
would not have its contents escaped.

https://nvd.nist.gov/vuln/detail/CVE-2021-23445



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11356) Upgrade DataTables to 1.11.5 to fix CVEs

2022-10-20 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11356:
---

Assignee: Bence Kosztolnik

> Upgrade DataTables to 1.11.5 to fix CVEs
> 
>
> Key: YARN-11356
> URL: https://issues.apache.org/jira/browse/YARN-11356
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>
> This ticket is intended to fix the following CVEs in the *DataTables.net* lib.
> *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are 
> vulnerable to Prototype Pollution due to an incomplete fix for 
> [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806].
> https://nvd.nist.gov/vuln/detail/CVE-2020-28458
> *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net 
> before 1.11.3. If an array is passed to the HTML escape entities function it 
> would not have its contents escaped.
> https://nvd.nist.gov/vuln/detail/CVE-2021-23445



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11356) Upgrade DataTables to 1.11.5 to fix CVEs

2022-10-20 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11356:

Description: 
This ticket is intended to fix the following CVEs in the *DataTables.net* lib, 
by upgrading the lib to 1.11.5 

*CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are 
vulnerable to Prototype Pollution due to an incomplete fix for 
[https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806].

[https://nvd.nist.gov/vuln/detail/CVE-2020-28458]

*CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net 
before 1.11.3. If an array is passed to the HTML escape entities function it 
would not have its contents escaped.

[https://nvd.nist.gov/vuln/detail/CVE-2021-23445]

  was:
This ticket is intended to fix the following CVEs in the *DataTables.net* lib.

*CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are 
vulnerable to Prototype Pollution due to an incomplete fix for 
[https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806].

https://nvd.nist.gov/vuln/detail/CVE-2020-28458

*CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net 
before 1.11.3. If an array is passed to the HTML escape entities function it 
would not have its contents escaped.

https://nvd.nist.gov/vuln/detail/CVE-2021-23445


> Upgrade DataTables to 1.11.5 to fix CVEs
> 
>
> Key: YARN-11356
> URL: https://issues.apache.org/jira/browse/YARN-11356
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>
> This ticket is intended to fix the following CVEs in the *DataTables.net* 
> lib, by upgrading the lib to 1.11.5 
> *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are 
> vulnerable to Prototype Pollution due to an incomplete fix for 
> [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806].
> [https://nvd.nist.gov/vuln/detail/CVE-2020-28458]
> *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net 
> before 1.11.3. If an array is passed to the HTML escape entities function it 
> would not have its contents escaped.
> [https://nvd.nist.gov/vuln/detail/CVE-2021-23445]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11344) Double checked locking in Configuration

2022-10-12 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik resolved YARN-11344.
-
Resolution: Abandoned

> Double checked locking in Configuration
> ---
>
> Key: YARN-11344
> URL: https://issues.apache.org/jira/browse/YARN-11344
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Priority: Minor
>
> Currently the 
> {code:java}
> org.apache.hadoop.conf.Configuration{code}
> class use synchronised methods in many cases where double check locking would 
> be enough, for example at case of *getProps()* and {*}getOverlay(){*}.
> The class should be refactored to remove the unnecessary locking points. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11344) Double checked locking in Configuration

2022-10-12 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11344:

Summary: Double checked locking in Configuration  (was: Double check 
locking in Configuration)

> Double checked locking in Configuration
> ---
>
> Key: YARN-11344
> URL: https://issues.apache.org/jira/browse/YARN-11344
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Priority: Minor
>
> Currently the 
> {code:java}
> org.apache.hadoop.conf.Configuration{code}
> class use synchronised methods in many cases where double check locking would 
> be enough, for example at case of *getProps()* and {*}getOverlay(){*}.
> The class should be refactored to remove the unnecessary locking points. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11344) Double check locking in Configuration

2022-10-12 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11344:
---

 Summary: Double check locking in Configuration
 Key: YARN-11344
 URL: https://issues.apache.org/jira/browse/YARN-11344
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Bence Kosztolnik


Currently the 
{code:java}
org.apache.hadoop.conf.Configuration{code}
class use synchronised methods in many cases where double check locking would 
be enough, for example at case of *getProps()* and {*}getOverlay(){*}.
The class should be refactored to remove the unnecessary locking points. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11348) Deprecated property can not be unset

2022-10-13 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11348:

Summary: Deprecated property can not be unset  (was: Depricated property 
can not be unset)

> Deprecated property can not be unset
> 
>
> Key: YARN-11348
> URL: https://issues.apache.org/jira/browse/YARN-11348
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bence Kosztolnik
>Priority: Major
>
> If you try to unset a deprecated property in an
> {code:java}
> CapacitySchedulerConfiguration{code}
> object the value wont be removed.
> Example failing Test for the TestCapacitySchedulerConfiguration class
> {noformat}
> @Test
> public void testDeprecationFeatureWorks() {
>   final String value = "VALUE";
>   final String goodName = "koko";
>   final String depName = "dfs.nfs.exports.allowed.hosts";
>   final CapacitySchedulerConfiguration csConf = createDefaultCsConf();
>   csConf.set(goodName, value);
>   csConf.unset(goodName);
>   assertNull(csConf.get(goodName));
>   csConf.set(depName, value);
>   csConf.unset(depName);
>   assertNull(csConf.get(depName));  // fails here
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11348) Deprecated property can not be unset

2022-10-13 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11348:

Description: 
If you try to unset a deprecated property in an
{code:java}
CapacitySchedulerConfiguration{code}
object the value wont be removed.

Example failing Test for the *TestCapacitySchedulerConfiguration* class
{noformat}
@Test
public void testDeprecationFeatureWorks() {
  final String value = "VALUE";
  final String goodName = "koko";
  final String depName = "dfs.nfs.exports.allowed.hosts";
  final CapacitySchedulerConfiguration csConf = createDefaultCsConf();

  csConf.set(goodName, value);
  csConf.unset(goodName);
  assertNull(csConf.get(goodName));

  csConf.set(depName, value);
  csConf.unset(depName);
  assertNull(csConf.get(depName));  // fails here
}{noformat}

  was:
If you try to unset a deprecated property in an
{code:java}
CapacitySchedulerConfiguration{code}
object the value wont be removed.

Example failing Test for the TestCapacitySchedulerConfiguration class


{noformat}
@Test
public void testDeprecationFeatureWorks() {
  final String value = "VALUE";
  final String goodName = "koko";
  final String depName = "dfs.nfs.exports.allowed.hosts";
  final CapacitySchedulerConfiguration csConf = createDefaultCsConf();

  csConf.set(goodName, value);
  csConf.unset(goodName);
  assertNull(csConf.get(goodName));

  csConf.set(depName, value);
  csConf.unset(depName);
  assertNull(csConf.get(depName));  // fails here
}{noformat}


> Deprecated property can not be unset
> 
>
> Key: YARN-11348
> URL: https://issues.apache.org/jira/browse/YARN-11348
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bence Kosztolnik
>Priority: Major
>
> If you try to unset a deprecated property in an
> {code:java}
> CapacitySchedulerConfiguration{code}
> object the value wont be removed.
> Example failing Test for the *TestCapacitySchedulerConfiguration* class
> {noformat}
> @Test
> public void testDeprecationFeatureWorks() {
>   final String value = "VALUE";
>   final String goodName = "koko";
>   final String depName = "dfs.nfs.exports.allowed.hosts";
>   final CapacitySchedulerConfiguration csConf = createDefaultCsConf();
>   csConf.set(goodName, value);
>   csConf.unset(goodName);
>   assertNull(csConf.get(goodName));
>   csConf.set(depName, value);
>   csConf.unset(depName);
>   assertNull(csConf.get(depName));  // fails here
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11348) Deprecated property can not be unset

2022-10-13 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11348:

Description: 
If you try to unset a deprecated property in an 
*CapacitySchedulerConfiguration* object the value wont be removed.

Example failing Test for the *TestCapacitySchedulerConfiguration* class
{noformat}
@Test
public void testDeprecationFeatureWorks() {
  final String value = "VALUE";
  final String goodName = "koko";
  final String depName = "dfs.nfs.exports.allowed.hosts";
  final CapacitySchedulerConfiguration csConf = createDefaultCsConf();

  csConf.set(goodName, value);
  csConf.unset(goodName);
  assertNull(csConf.get(goodName));

  csConf.set(depName, value);
  csConf.unset(depName);
  assertNull(csConf.get(depName));  // fails here
}{noformat}

  was:
If you try to unset a deprecated property in an
{code:java}
CapacitySchedulerConfiguration{code}
object the value wont be removed.

Example failing Test for the *TestCapacitySchedulerConfiguration* class
{noformat}
@Test
public void testDeprecationFeatureWorks() {
  final String value = "VALUE";
  final String goodName = "koko";
  final String depName = "dfs.nfs.exports.allowed.hosts";
  final CapacitySchedulerConfiguration csConf = createDefaultCsConf();

  csConf.set(goodName, value);
  csConf.unset(goodName);
  assertNull(csConf.get(goodName));

  csConf.set(depName, value);
  csConf.unset(depName);
  assertNull(csConf.get(depName));  // fails here
}{noformat}


> Deprecated property can not be unset
> 
>
> Key: YARN-11348
> URL: https://issues.apache.org/jira/browse/YARN-11348
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bence Kosztolnik
>Priority: Major
>  Labels: newbie
>
> If you try to unset a deprecated property in an 
> *CapacitySchedulerConfiguration* object the value wont be removed.
> Example failing Test for the *TestCapacitySchedulerConfiguration* class
> {noformat}
> @Test
> public void testDeprecationFeatureWorks() {
>   final String value = "VALUE";
>   final String goodName = "koko";
>   final String depName = "dfs.nfs.exports.allowed.hosts";
>   final CapacitySchedulerConfiguration csConf = createDefaultCsConf();
>   csConf.set(goodName, value);
>   csConf.unset(goodName);
>   assertNull(csConf.get(goodName));
>   csConf.set(depName, value);
>   csConf.unset(depName);
>   assertNull(csConf.get(depName));  // fails here
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11348) Depricated property can not be unset

2022-10-13 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11348:
---

 Summary: Depricated property can not be unset
 Key: YARN-11348
 URL: https://issues.apache.org/jira/browse/YARN-11348
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bence Kosztolnik


If you try to unset a deprecated property in an
{code:java}
CapacitySchedulerConfiguration{code}
object the value wont be removed.

Example failing Test for the TestCapacitySchedulerConfiguration class


{noformat}
@Test
public void testDeprecationFeatureWorks() {
  final String value = "VALUE";
  final String goodName = "koko";
  final String depName = "dfs.nfs.exports.allowed.hosts";
  final CapacitySchedulerConfiguration csConf = createDefaultCsConf();

  csConf.set(goodName, value);
  csConf.unset(goodName);
  assertNull(csConf.get(goodName));

  csConf.set(depName, value);
  csConf.unset(depName);
  assertNull(csConf.get(depName));  // fails here
}{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11348) Deprecated property can not be unset

2022-10-13 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11348:

Labels: newbie  (was: )

> Deprecated property can not be unset
> 
>
> Key: YARN-11348
> URL: https://issues.apache.org/jira/browse/YARN-11348
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bence Kosztolnik
>Priority: Major
>  Labels: newbie
>
> If you try to unset a deprecated property in an
> {code:java}
> CapacitySchedulerConfiguration{code}
> object the value wont be removed.
> Example failing Test for the *TestCapacitySchedulerConfiguration* class
> {noformat}
> @Test
> public void testDeprecationFeatureWorks() {
>   final String value = "VALUE";
>   final String goodName = "koko";
>   final String depName = "dfs.nfs.exports.allowed.hosts";
>   final CapacitySchedulerConfiguration csConf = createDefaultCsConf();
>   csConf.set(goodName, value);
>   csConf.unset(goodName);
>   assertNull(csConf.get(goodName));
>   csConf.set(depName, value);
>   csConf.unset(depName);
>   assertNull(csConf.get(depName));  // fails here
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11216) Avoid unnecessary reconstruction of ConfigurationProperties

2022-08-01 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11216:
---

Assignee: Bence Kosztolnik

> Avoid unnecessary reconstruction of ConfigurationProperties
> ---
>
> Key: YARN-11216
> URL: https://issues.apache.org/jira/browse/YARN-11216
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: András Győri
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> ConfigurationProperties is expensive to create, however, due to its immutable 
> nature it is possible to copy it/share it between configuration objects (eg. 
> create a copy constructor). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11410) Add default methods for StateMachine

2023-01-08 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11410:
---

 Summary: Add default methods for StateMachine
 Key: YARN-11410
 URL: https://issues.apache.org/jira/browse/YARN-11410
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bence Kosztolnik
 Fix For: 3.4.0


The YARN-11395 created a new method in the StateMachine interface, what can 
break the compatibility with connected softwares, so the method should be 
converted to default method, what can prevent this break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11410) Add default methods for StateMachine

2023-01-08 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11410:
---

Assignee: Bence Kosztolnik

> Add default methods for StateMachine
> 
>
> Key: YARN-11410
> URL: https://issues.apache.org/jira/browse/YARN-11410
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
> Fix For: 3.4.0
>
>
> The YARN-11395 created a new method in the StateMachine interface, what can 
> break the compatibility with connected softwares, so the method should be 
> converted to default method, what can prevent this break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11390) TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 0 now expected: <1> but was: <0>

2022-12-06 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11390:

Description: 
Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails 
with the following message
{noformat}
java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but 
was:<0>
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530){noformat}

This can happen in case if the hardcoded 1s sleep in the test not enough for 
proper shut down.

To fix this issue we should poll the cluster status with a time out, and see 
the cluster can reach the expected state

  was:
Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails 
with the following message


java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but 
was:<0>
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530)
This can happen in case if the hardcoded 1s sleep in the test not enough for 
proper shut down.

To fix this issue we should poll the cluster status with a time out, and see 
the cluster can reach the expected state


> TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 
> 0 now expected: <1> but was: <0>
> -
>
> Key: YARN-11390
> URL: https://issues.apache.org/jira/browse/YARN-11390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>
> Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails 
> with the following message
> {noformat}
> java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but 
> was:<0>
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530){noformat}
> This can happen in case if the hardcoded 1s sleep in the test not enough for 
> proper shut down.
> To fix this issue we should poll the cluster status with a time out, and see 
> the cluster can reach the expected state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11390) TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 0 now expected: <1> but was: <0>

2022-12-06 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11390:
---

 Summary: TestResourceTrackerService.testNodeRemovalNormally: 
Shutdown nodes should be 0 now expected: <1> but was: <0>
 Key: YARN-11390
 URL: https://issues.apache.org/jira/browse/YARN-11390
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Bence Kosztolnik
Assignee: Bence Kosztolnik


Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails 
with the following message


java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but 
was:<0>
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530)
This can happen in case if the hardcoded 1s sleep in the test not enough for 
proper shut down.

To fix this issue we should poll the cluster status with a time out, and see 
the cluster can reach the expected state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11395) Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state

2022-12-13 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11395:
---

 Summary: Resource Manager UI, cluster/appattempt/*, can not 
present FINAL_SAVING state
 Key: YARN-11395
 URL: https://issues.apache.org/jira/browse/YARN-11395
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.4.0
Reporter: Bence Kosztolnik


If an attempt is in *FINAL_SAVING* state, the 
*RMAppAttemptBlock#createAttemptHeadRoomTable* method fails with a convert 
error, what will results a 
{code:java}
RFC6265 Cookie values may not contain character: [ ]{code}
error in the UI an in the logs as well.
RM log:
{code:java}
...
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: No enum constant 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING
    at java.lang.Enum.valueOf(Enum.java:238)
    at 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
    at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppAttemptBlock.createAttemptHeadRoomTable(RMAppAttemptBlock.java:424)
    at 
org.apache.hadoop.yarn.server.webapp.AppAttemptBlock.render(AppAttemptBlock.java:151)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
    at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
    at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
    at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
    at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
    at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
    at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
    at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
    at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.appattempt(RmController.java:62)
    ... 63 more
2022-12-05 04:15:33,029 WARN org.eclipse.jetty.server.HttpChannel: 
/cluster/appattempt/appattempt_1667297151262_0247_01
java.lang.IllegalArgumentException: RFC6265 Cookie values may not contain 
character: [ ]
    at 
org.eclipse.jetty.http.Syntax.requireValidRFC6265CookieValue(Syntax.java:136)
...{code}
This bug was introduced with the YARN-1345 ticket what also caused a similar 
error called YARN-4411. In case of the YARN-4411 the enum mapping logic from 
RMAppAttemptStates to YarnApplicationAttemptState was modified like this:
- if the state is FINAL_SAVING we should represent the previous state

This error can also be occur in case of ALLOCATED_SAVING, 
LAUNCHED_UNMANAGED_SAVING states as well.

So we should modify the *createAttemptHeadRoomTable* method to be able to 
handle the previously mentioned 3 state just like in case of YARN-4411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11395) Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state

2022-12-13 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11395:
---

Assignee: Bence Kosztolnik

> Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state
> -
>
> Key: YARN-11395
> URL: https://issues.apache.org/jira/browse/YARN-11395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Critical
>
> If an attempt is in *FINAL_SAVING* state, the 
> *RMAppAttemptBlock#createAttemptHeadRoomTable* method fails with a convert 
> error, what will results a 
> {code:java}
> RFC6265 Cookie values may not contain character: [ ]{code}
> error in the UI an in the logs as well.
> RM log:
> {code:java}
> ...
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalArgumentException: No enum constant 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING
>     at java.lang.Enum.valueOf(Enum.java:238)
>     at 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppAttemptBlock.createAttemptHeadRoomTable(RMAppAttemptBlock.java:424)
>     at 
> org.apache.hadoop.yarn.server.webapp.AppAttemptBlock.render(AppAttemptBlock.java:151)
>     at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>     at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>     at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
>     at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>     at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>     at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>     at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>     at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>     at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.appattempt(RmController.java:62)
>     ... 63 more
> 2022-12-05 04:15:33,029 WARN org.eclipse.jetty.server.HttpChannel: 
> /cluster/appattempt/appattempt_1667297151262_0247_01
> java.lang.IllegalArgumentException: RFC6265 Cookie values may not contain 
> character: [ ]
>     at 
> org.eclipse.jetty.http.Syntax.requireValidRFC6265CookieValue(Syntax.java:136)
> ...{code}
> This bug was introduced with the YARN-1345 ticket what also caused a similar 
> error called YARN-4411. In case of the YARN-4411 the enum mapping logic from 
> RMAppAttemptStates to YarnApplicationAttemptState was modified like this:
> - if the state is FINAL_SAVING we should represent the previous state
> This error can also be occur in case of ALLOCATED_SAVING, 
> LAUNCHED_UNMANAGED_SAVING states as well.
> So we should modify the *createAttemptHeadRoomTable* method to be able to 
> handle the previously mentioned 3 state just like in case of YARN-4411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11420) Stabilize TestNMClient

2023-01-19 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11420:
---

 Summary: Stabilize TestNMClient
 Key: YARN-11420
 URL: https://issues.apache.org/jira/browse/YARN-11420
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Bence Kosztolnik


The TestNMClient test methods can stuck if the test container fails, while the 
test is expecting it running state. This can happen for example if the 
container fails due low memory. To fix this the test should tolerate some 
failure like this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11420) Stabilize TestNMClient

2023-03-01 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11420:
---

Assignee: Bence Kosztolnik

> Stabilize TestNMClient
> --
>
> Key: YARN-11420
> URL: https://issues.apache.org/jira/browse/YARN-11420
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
>
> The TestNMClient test methods can stuck if the test container fails, while 
> the test is expecting it running state. This can happen for example if the 
> container fails due low memory. To fix this the test should tolerate some 
> failure like this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2023-08-01 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik resolved YARN-10345.
-
Resolution: Duplicate

> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2023-07-25 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-10345:
---

Assignee: Bence Kosztolnik  (was: Prabhu Joseph)

> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values

2024-02-15 Thread Bence Kosztolnik (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817624#comment-17817624
 ] 

Bence Kosztolnik edited comment on YARN-11010 at 2/15/24 10:36 AM:
---

I just see this, maybe we can use this solution here as well
HADOOP-18954


was (Author: JIRAUSER292672):
I just so this, maybe we can use this solution here as well
HADOOP-18954

> YARN ui2 hangs on the Queues page when the scheduler response contains NaN 
> values
> -
>
> Key: YARN-11010
> URL: https://issues.apache.org/jira/browse/YARN-11010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
> Attachments: capacity-scheduler.xml, shresponse.json
>
>
> When the scheduler response contains NaN values for capacity and maxCapacity 
> the UI2 hangs on the Queues page. The console log shows the following error:
> {code:java}
> SyntaxError: Unexpected token N in JSON at position 666 {code}
> The scheduler response:
> {code:java}
> "maxCapacity": NaN,
> "absoluteMaxCapacity": NaN, {code}
> NaN, infinity, -infinity is not valid in JSON syntax: 
> https://www.json.org/json-en.html
> This might be related as well: YARN-10452
>  
> I managed to reproduce this with AQCv1, where I set the parent queue's 
> capacity in absolute mode, then I used percentage mode on the 
> leaf-queue-template. I'm not sure if this is a valid configuration, however 
> there is no error or warning in RM logs about any configuration error. To 
> trigger the issue the DominantResourceCalculator must be used. (When using 
> absolute mode on the leaf-queue-template this issue is not re-producible, 
> further details on: YARN-10922).
>  
> Reproduction steps:
>  # Start the cluster with the attached configuration
>  # Check the Queues page on UI2 (it should work at this point)
>  # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar 
> pi 1 10)
>  # Check the Queues page on UI2 (it should not be working at this point)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values

2024-02-15 Thread Bence Kosztolnik (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817624#comment-17817624
 ] 

Bence Kosztolnik commented on YARN-11010:
-

I just so this, maybe we can use this solution here as well

> YARN ui2 hangs on the Queues page when the scheduler response contains NaN 
> values
> -
>
> Key: YARN-11010
> URL: https://issues.apache.org/jira/browse/YARN-11010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
> Attachments: capacity-scheduler.xml, shresponse.json
>
>
> When the scheduler response contains NaN values for capacity and maxCapacity 
> the UI2 hangs on the Queues page. The console log shows the following error:
> {code:java}
> SyntaxError: Unexpected token N in JSON at position 666 {code}
> The scheduler response:
> {code:java}
> "maxCapacity": NaN,
> "absoluteMaxCapacity": NaN, {code}
> NaN, infinity, -infinity is not valid in JSON syntax: 
> https://www.json.org/json-en.html
> This might be related as well: YARN-10452
>  
> I managed to reproduce this with AQCv1, where I set the parent queue's 
> capacity in absolute mode, then I used percentage mode on the 
> leaf-queue-template. I'm not sure if this is a valid configuration, however 
> there is no error or warning in RM logs about any configuration error. To 
> trigger the issue the DominantResourceCalculator must be used. (When using 
> absolute mode on the leaf-queue-template this issue is not re-producible, 
> further details on: YARN-10922).
>  
> Reproduction steps:
>  # Start the cluster with the attached configuration
>  # Check the Queues page on UI2 (it should work at this point)
>  # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar 
> pi 1 10)
>  # Check the Queues page on UI2 (it should not be working at this point)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values

2024-02-15 Thread Bence Kosztolnik (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817624#comment-17817624
 ] 

Bence Kosztolnik edited comment on YARN-11010 at 2/15/24 9:30 AM:
--

I just so this, maybe we can use this solution here as well
HADOOP-18954


was (Author: JIRAUSER292672):
I just so this, maybe we can use this solution here as well

> YARN ui2 hangs on the Queues page when the scheduler response contains NaN 
> values
> -
>
> Key: YARN-11010
> URL: https://issues.apache.org/jira/browse/YARN-11010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
> Attachments: capacity-scheduler.xml, shresponse.json
>
>
> When the scheduler response contains NaN values for capacity and maxCapacity 
> the UI2 hangs on the Queues page. The console log shows the following error:
> {code:java}
> SyntaxError: Unexpected token N in JSON at position 666 {code}
> The scheduler response:
> {code:java}
> "maxCapacity": NaN,
> "absoluteMaxCapacity": NaN, {code}
> NaN, infinity, -infinity is not valid in JSON syntax: 
> https://www.json.org/json-en.html
> This might be related as well: YARN-10452
>  
> I managed to reproduce this with AQCv1, where I set the parent queue's 
> capacity in absolute mode, then I used percentage mode on the 
> leaf-queue-template. I'm not sure if this is a valid configuration, however 
> there is no error or warning in RM logs about any configuration error. To 
> trigger the issue the DominantResourceCalculator must be used. (When using 
> absolute mode on the leaf-queue-template this issue is not re-producible, 
> further details on: YARN-10922).
>  
> Reproduction steps:
>  # Start the cluster with the attached configuration
>  # Check the Queues page on UI2 (it should work at this point)
>  # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar 
> pi 1 10)
>  # Check the Queues page on UI2 (it should not be working at this point)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11656:
---

 Summary: RMStateStore event queue blocked
 Key: YARN-11656
 URL: https://issues.apache.org/jira/browse/YARN-11656
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.1
Reporter: Bence Kosztolnik
 Attachments: issue.png

I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}

{panel}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}


  was:
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250,width=250!
{panel}



> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250,width=250!
{panel}


  was:
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png! 
{panel}



> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250,width=250!
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png! 
{panel}


  was:
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}

{panel}



> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png! 
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many 
parallel threads should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.max-pool-size.{}.max-pool-size|If event queue is 
full the execution threads will scale up to this manny|8|
|yarn.dispatcher.multi-thread.max-pool-size.{}.keep-alive-seconds|Execution 
threads will be destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.max-pool-size.{}.queue-size|Size of the 
eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.max-pool-size.{}.monitor-seconds|The size of the 
event queue will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.max-pool-size.{}.graceful-stop-seconds|After stop 
signal the dispatcher will wait this many seconds to be able to process the 
incoming events before terminates them|60|



h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many 
parallel threads should execute the parallel event execution| 4|


h2. Testing


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the 

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.default-pool-size|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.max-pool-size|If event queue is full the 
execution threads will scale up to this manny|8|
|yarn.dispatcher.multi-thread.{}.keep-alive-seconds|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.queue-size|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.monitor-seconds|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.graceful-stop-seconds|After stop signal the 
dispatcher will wait this many seconds to be able to process the incoming 
events before terminates them|60|



h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many 
parallel threads should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.max-pool-size.{}.max-pool-size|If event queue is 
full the execution threads will scale up to this manny|8|
|yarn.dispatcher.multi-thread.max-pool-size.{}.keep-alive-seconds|Execution 
threads will be destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.max-pool-size.{}.queue-size|Size of the 
eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.max-pool-size.{}.monitor-seconds|The size of the 
event queue will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.max-pool-size.{}.graceful-stop-seconds|After stop 
signal the dispatcher will wait this many seconds to be able to process the 
incoming events before terminates them|60|



h2. Testing


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>   

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|



h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.default-pool-size|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.max-pool-size|If event queue is full the 
execution threads will scale up to this manny|8|
|yarn.dispatcher.multi-thread.{}.keep-alive-seconds|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.queue-size|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.monitor-seconds|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.graceful-stop-seconds|After stop signal the 
dispatcher will wait this many seconds to be able to process the incoming 
events before terminates them|60|



h2. Testing


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> 

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many 
parallel threads should execute the parallel event execution| 4|


h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

h2. Testing


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> The dispatcher has the following configs ( the placeholder is for the 
> dispatcher name, for example rm-state-store )
> ||Config name||Description||Default value||
> |yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many 
> parallel threads should execute the parallel event execution| 

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Solution:

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

  was:
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}



> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> Solution:
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

h2. Testing

  was:
h2. Problem statement h2.
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution h2.

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

h2. Testing h2.


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> h2. Testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement h2.
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution h2.

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

h2. Testing h2.

  was:
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Solution:

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png
>
>
> h2. Problem statement h2.
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> h2. Solution h2.
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> h2. Testing h2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}
 !log.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|



h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}
 !log.png! 
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|




[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}
 !log.png! 
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|



h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}

{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|



h2. Testing


> 

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}

{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|



h2. Testing

  was:
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|



h2. Testing


> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: 

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Attachment: log.png

> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png, log.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> Also another way to identify the issue if we can see too much time is 
> required to store info for app after reach new_saving state
> {panel:title=How issue can look like in log}
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> The dispatcher has the following configs ( the placeholder is for the 
> dispatcher name, for example, rm-state-store )
> ||Config name||Description||Default value||
> |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel 
> threads should execute the parallel event execution| 4|
> |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full 
> the execution threads will scale up to this many|8|
> |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will 
> be destroyed after this many seconds|10|
> |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 
> 000|
> |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event 
> queue will be logged with this frequency (if not zero) |30|
> |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop 
> signal the dispatcher will wait this many seconds to be able to process the 
> incoming events before terminating them|60|
> h2. Testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}
 !log.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|


{panel:title=Example output from RM JMX api}
{noformat}
...
{
  "name": "Hadoop:service=ResourceManager,name=Event metrics for 
rm-state-store",
  "modelerType": "Event metrics for rm-state-store",
  "tag.Context": "yarn",
  "tag.Hostname": CENSORED
  "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
  "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
  "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_APP_Current": 124,
  "RMStateStoreEventType#STORE_APP_NumOps": 46,
  "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
  "RMStateStoreEventType#UPDATE_APP_Current": 31,
  "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
  "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5,
  "RMStateStoreEventType#REMOVE_APP_Current": 12,
  "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
  "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
  "RMStateStoreEventType#FENCED_Current": 0,
  "RMStateStoreEventType#FENCED_NumOps": 0,
  "RMStateStoreEventType#FENCED_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
  "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
  "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
  

[jira] [Assigned] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11656:
---

Assignee: Bence Kosztolnik

> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png, log.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> Also another way to identify the issue if we can see too much time is 
> required to store info for app after reach new_saving state
> {panel:title=How issue can look like in log}
>  !log.png|height=250!
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> The dispatcher has the following configs ( the placeholder is for the 
> dispatcher name, for example, rm-state-store )
> ||Config name||Description||Default value||
> |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel 
> threads should execute the parallel event execution| 4|
> |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full 
> the execution threads will scale up to this many|8|
> |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will 
> be destroyed after this many seconds|10|
> |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 
> 000|
> |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event 
> queue will be logged with this frequency (if not zero) |30|
> |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop 
> signal the dispatcher will wait this many seconds to be able to process the 
> incoming events before terminating them|60|
> {panel:title=Example output from RM JMX api}
> {noformat}
> ...
> {
>   "name": "Hadoop:service=ResourceManager,name=Event metrics for 
> rm-state-store",
>   "modelerType": "Event metrics for rm-state-store",
>   "tag.Context": "yarn",
>   "tag.Hostname": CENSORED
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
>   "RMStateStoreEventType#STORE_APP_Current": 124,
>   "RMStateStoreEventType#STORE_APP_NumOps": 46,
>   "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
>   "RMStateStoreEventType#UPDATE_APP_Current": 31,
>   "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
>   "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5,
>   "RMStateStoreEventType#REMOVE_APP_Current": 12,
>   "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
>   "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
>   "RMStateStoreEventType#FENCED_Current": 0,
>   "RMStateStoreEventType#FENCED_NumOps": 0,
>   "RMStateStoreEventType#FENCED_AvgTime": 0.0,
>   "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
>   "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
>   "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
>   "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
>   

[jira] [Updated] (YARN-11656) RMStateStore eventqueue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Summary: RMStateStore eventqueue blocked  (was: RMStateStore event queue 
blocked)

> RMStateStore eventqueue blocked
> ---
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png, log.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> Also another way to identify the issue if we can see too much time is 
> required to store info for app after reach new_saving state
> {panel:title=How issue can look like in log}
>  !log.png|height=250!
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> The dispatcher has the following configs ( the placeholder is for the 
> dispatcher name, for example, rm-state-store )
> ||Config name||Description||Default value||
> |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel 
> threads should execute the parallel event execution| 4|
> |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full 
> the execution threads will scale up to this many|8|
> |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will 
> be destroyed after this many seconds|10|
> |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 
> 000|
> |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event 
> queue will be logged with this frequency (if not zero) |30|
> |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop 
> signal the dispatcher will wait this many seconds to be able to process the 
> incoming events before terminating them|60|
> {panel:title=Example output from RM JMX api}
> {noformat}
> ...
> {
>   "name": "Hadoop:service=ResourceManager,name=Event metrics for 
> rm-state-store",
>   "modelerType": "Event metrics for rm-state-store",
>   "tag.Context": "yarn",
>   "tag.Hostname": CENSORED
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
>   "RMStateStoreEventType#STORE_APP_Current": 124,
>   "RMStateStoreEventType#STORE_APP_NumOps": 46,
>   "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
>   "RMStateStoreEventType#UPDATE_APP_Current": 31,
>   "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
>   "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5,
>   "RMStateStoreEventType#REMOVE_APP_Current": 12,
>   "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
>   "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
>   "RMStateStoreEventType#FENCED_Current": 0,
>   "RMStateStoreEventType#FENCED_NumOps": 0,
>   "RMStateStoreEventType#FENCED_AvgTime": 0.0,
>   "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
>   "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
>   "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
>   

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}
 !log.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution| 4|
|yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the 
execution threads will scale up to this many|8|
|yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be 
destroyed after this many seconds|10|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |30|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|


{panel:title=Example output from RM JMX api}
{noformat}
...
{
  "name": "Hadoop:service=ResourceManager,name=Event metrics for 
rm-state-store",
  "modelerType": "Event metrics for rm-state-store",
  "tag.Context": "yarn",
  "tag.Hostname": CENSORED
  "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
  "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
  "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_APP_Current": 124,
  "RMStateStoreEventType#STORE_APP_NumOps": 46,
  "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
  "RMStateStoreEventType#UPDATE_APP_Current": 31,
  "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
  "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5,
  "RMStateStoreEventType#REMOVE_APP_Current": 12,
  "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
  "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
  "RMStateStoreEventType#FENCED_Current": 0,
  "RMStateStoreEventType#FENCED_NumOps": 0,
  "RMStateStoreEventType#FENCED_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
  "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
  "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
  

[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Summary: RMStateStore event queue blocked  (was: RMStateStore eventqueue 
blocked)

> RMStateStore event queue blocked
> 
>
> Key: YARN-11656
> URL: https://issues.apache.org/jira/browse/YARN-11656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.1
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
> Attachments: issue.png, log.png
>
>
> h2. Problem statement
>  
> I observed Yarn cluster has pending and available resources as well, but the 
> cluster utilization is usually around ~50%. The cluster had loaded with 200 
> parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
> reduce containers configured, on a 50 nodes cluster, where each node had 8 
> cores, and a lot of memory (there was cpu bottleneck).
> Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
> persist a RMStateStoreEvent (using FileSystemRMStateStore).
> To reduce the impact of the issue:
> - create a dispatcher where events can persist in parallel threads
> - create metric data for the RMStateStore event queue to be able easily to 
> identify the problem if occurs on a cluster
> {panel:title=Issue visible on UI2}
>  !issue.png|height=250!
> {panel}
> Also another way to identify the issue if we can see too much time is 
> required to store info for app after reach new_saving state
> {panel:title=How issue can look like in log}
>  !log.png|height=250!
> {panel}
> h2. Solution
> Created a *MultiDispatcher* class which implements the Dispatcher interface.
> The Dispatcher creates a separate metric object called _Event metrics for 
> "rm-state-store"_ where we can see 
> - how many unhandled events are currently present in the event queue for the 
> specific event type
> - how many events were handled for the specific event type
> - average execution time for the specific event
> The dispatcher has the following configs ( the placeholder is for the 
> dispatcher name, for example, rm-state-store )
> ||Config name||Description||Default value||
> |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel 
> threads should execute the parallel event execution| 4|
> |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full 
> the execution threads will scale up to this many|8|
> |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will 
> be destroyed after this many seconds|10|
> |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 
> 000|
> |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event 
> queue will be logged with this frequency (if not zero) |30|
> |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop 
> signal the dispatcher will wait this many seconds to be able to process the 
> incoming events before terminating them|60|
> {panel:title=Example output from RM JMX api}
> {noformat}
> ...
> {
>   "name": "Hadoop:service=ResourceManager,name=Event metrics for 
> rm-state-store",
>   "modelerType": "Event metrics for rm-state-store",
>   "tag.Context": "yarn",
>   "tag.Hostname": CENSORED
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
>   "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
>   "RMStateStoreEventType#STORE_APP_Current": 124,
>   "RMStateStoreEventType#STORE_APP_NumOps": 46,
>   "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
>   "RMStateStoreEventType#UPDATE_APP_Current": 31,
>   "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
>   "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
>   "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5,
>   "RMStateStoreEventType#REMOVE_APP_Current": 12,
>   "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
>   "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
>   "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
>   "RMStateStoreEventType#FENCED_Current": 0,
>   "RMStateStoreEventType#FENCED_NumOps": 0,
>   "RMStateStoreEventType#FENCED_AvgTime": 0.0,
>   "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
>   "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
>   "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
>   

[jira] [Created] (YARN-11634) Speed-up TestTimelineClient

2023-12-19 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11634:
---

 Summary: Speed-up TestTimelineClient
 Key: YARN-11634
 URL: https://issues.apache.org/jira/browse/YARN-11634
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Bence Kosztolnik
Assignee: Bence Kosztolnik


The TimelineConnector.class has a hardcoded 1-minute connection time out, which 
makes the TestTimelineClient a long-running test (~15:30 min).
Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11634) Speed-up TestTimelineClient

2023-12-19 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11634:

Priority: Minor  (was: Major)

> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11687) Update CGroupsResourceCalculator to track usages using cgroupv2

2024-04-19 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik reassigned YARN-11687:
---

Assignee: Bence Kosztolnik

> Update CGroupsResourceCalculator to track usages using cgroupv2
> ---
>
> Key: YARN-11687
> URL: https://issues.apache.org/jira/browse/YARN-11687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Bence Kosztolnik
>Priority: Major
>
> [CGroupsResourceCalculator|https://github.com/apache/hadoop/blob/f609460bda0c2bd87dd3580158e549e2f34f14d5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsResourceCalculator.java]
>  should also be updated to handle the cgroup v2 changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11656) RMStateStore event queue blocked

2024-03-13 Thread Bence Kosztolnik (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bence Kosztolnik updated YARN-11656:

Description: 
h2. Problem statement
 
I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}
 !issue.png|height=250!
{panel}

Also another way to identify the issue if we can see too much time is required 
to store info for app after reach new_saving state

{panel:title=How issue can look like in log}
 !log.png|height=250!
{panel}

h2. Solution

Created a *MultiDispatcher* class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called _Event metrics for 
"rm-state-store"_ where we can see 
- how many unhandled events are currently present in the event queue for the 
specific event type
- how many events were handled for the specific event type
- average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the 
dispatcher name, for example, rm-state-store )

||Config name||Description||Default value||
|yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads 
should execute the parallel event execution|4|
|yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000|
|yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue 
will be logged with this frequency (if not zero) |0|
|yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal 
the dispatcher will wait this many seconds to be able to process the incoming 
events before terminating them|60|
|yarn.dispatcher.multi-thread.{}.*metrics-enabled*|The dispatcher should 
publish metrics data to the metric system|false|


{panel:title=Example output from RM JMX api}
{noformat}
...
{
  "name": "Hadoop:service=ResourceManager,name=Event metrics for 
rm-state-store",
  "modelerType": "Event metrics for rm-state-store",
  "tag.Context": "yarn",
  "tag.Hostname": CENSORED
  "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
  "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
  "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_APP_Current": 124,
  "RMStateStoreEventType#STORE_APP_NumOps": 46,
  "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
  "RMStateStoreEventType#UPDATE_APP_Current": 31,
  "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
  "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
  "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5,
  "RMStateStoreEventType#REMOVE_APP_Current": 12,
  "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
  "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
  "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
  "RMStateStoreEventType#FENCED_Current": 0,
  "RMStateStoreEventType#FENCED_NumOps": 0,
  "RMStateStoreEventType#FENCED_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
  "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
  "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0,
  "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
  "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0,
  "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0,