[jira] [Assigned] (YARN-11692) Support mixed cgroup v1/v2 controller structure

2024-05-06 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-11692:
--

Assignee: Peter Szucs

> Support mixed cgroup v1/v2 controller structure
> ---
>
> Key: YARN-11692
> URL: https://issues.apache.org/jira/browse/YARN-11692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>
> There were heavy changes on the device side in cgroup v2. To keep supporting 
> FGPAs and GPUs short term, mixed structures where some of the cgroup 
> controllers are from v1 while others from v2 should be supported. More info: 
> https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support

2024-04-26 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11675:
---
Description: 
cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about updating 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
h3. *Differences in the controls comparing to cgroup v1:*
**
h3. Hard limit on memory

{_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
h3. Soft limit on memory

{_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_

Detailed descriptions about the memory controls can be found in the official 
[cgroup v2 documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html].
h3. Swappiness

_memory.swappiness_ has been removed from the available cgroup v2 controls.

Quoting [redhat documentation|https://access.redhat.com/solutions/103833]:
{quote}Swappiness is a property for the Linux kernel that changes the balance 
between swapping out runtime memory, as opposed to dropping pages from the 
system page cache. Swappiness can be set to values between 0 and 100, 
inclusive. A low value means the kernel will try to avoid swapping as much as 
possible where a higher value instead will make the kernel aggressively try to 
use swap space.
{quote}
Referring [this|https://github.com/opencontainers/runtime-spec/issues/1005] 
case study we found that most of the time swappiness didn't work as expected as 
it mostly depends on the I/O balance of the system, so it is no longer 
available in cgroup v2.

  was:
cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about updating 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
 Differences in the controls comparing to cgroup v1:
h3. *Hard limit on memory*

{_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
h3. *Soft limit on memory*

{_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_

Detailed descriptions about the memory controls can be found in the official 
[cgroup v2 documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html].
h3. *_Swappiness_*

_memory.swappiness_ has been removed from the available cgroup v2 controls.

Quoting [redhat documentation|https://access.redhat.com/solutions/103833]:
{quote}Swappiness is a property for the Linux kernel that changes the balance 
between swapping out runtime memory, as opposed to dropping pages from the 
system page cache. Swappiness can be set to values between 0 and 100, 
inclusive. A low value means the kernel will try to avoid swapping as much as 
possible where a higher value instead will make the kernel aggressively try to 
use swap space.
{quote}
Referring [this|https://github.com/opencontainers/runtime-spec/issues/1005] 
case study we found that most of the time swappiness didn't work as expected as 
it mostly depends on the I/O balance of the system, so it is no longer 
available in cgroup v2.


> Update MemoryResourceHandler implementation for cgroup v2 support
> -
>
> Key: YARN-11675
> URL: https://issues.apache.org/jira/browse/YARN-11675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> cgroup v2 has some changes in various controllers (some changed their 
> functionality, some were removed). This task is about updating 
> MemoryResourceHandler's 
> [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
> h3. *Differences in the controls comparing to cgroup v1:*
> **
> h3. Hard limit on memory
> {_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
> h3. Soft limit on memory
> {_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_
> Detailed descriptions about the memory controls can be found in the official 
> [cgroup v2 

[jira] [Updated] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support

2024-04-26 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11675:
---
Description: 
cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about updating 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
h3. *Differences in the controls comparing to cgroup v1:*
h3. Hard limit on memory

{_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
h3. Soft limit on memory

{_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_

Detailed descriptions about the memory controls can be found in the official 
[cgroup v2 documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html].
h3. Swappiness

_memory.swappiness_ has been removed from the available cgroup v2 controls.

Quoting [redhat documentation|https://access.redhat.com/solutions/103833]:
{quote}Swappiness is a property for the Linux kernel that changes the balance 
between swapping out runtime memory, as opposed to dropping pages from the 
system page cache. Swappiness can be set to values between 0 and 100, 
inclusive. A low value means the kernel will try to avoid swapping as much as 
possible where a higher value instead will make the kernel aggressively try to 
use swap space.
{quote}
Referring [this|https://github.com/opencontainers/runtime-spec/issues/1005] 
case study we found that most of the time swappiness didn't work as expected as 
it mostly depends on the I/O balance of the system, so it is no longer 
available in cgroup v2.

  was:
cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about updating 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
h3. *Differences in the controls comparing to cgroup v1:*
**
h3. Hard limit on memory

{_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
h3. Soft limit on memory

{_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_

Detailed descriptions about the memory controls can be found in the official 
[cgroup v2 documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html].
h3. Swappiness

_memory.swappiness_ has been removed from the available cgroup v2 controls.

Quoting [redhat documentation|https://access.redhat.com/solutions/103833]:
{quote}Swappiness is a property for the Linux kernel that changes the balance 
between swapping out runtime memory, as opposed to dropping pages from the 
system page cache. Swappiness can be set to values between 0 and 100, 
inclusive. A low value means the kernel will try to avoid swapping as much as 
possible where a higher value instead will make the kernel aggressively try to 
use swap space.
{quote}
Referring [this|https://github.com/opencontainers/runtime-spec/issues/1005] 
case study we found that most of the time swappiness didn't work as expected as 
it mostly depends on the I/O balance of the system, so it is no longer 
available in cgroup v2.


> Update MemoryResourceHandler implementation for cgroup v2 support
> -
>
> Key: YARN-11675
> URL: https://issues.apache.org/jira/browse/YARN-11675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> cgroup v2 has some changes in various controllers (some changed their 
> functionality, some were removed). This task is about updating 
> MemoryResourceHandler's 
> [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
> h3. *Differences in the controls comparing to cgroup v1:*
> h3. Hard limit on memory
> {_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
> h3. Soft limit on memory
> {_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_
> Detailed descriptions about the memory controls can be found in the official 
> [cgroup v2 

[jira] [Updated] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support

2024-04-26 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11675:
---
Description: 
cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about updating 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
 Differences in the controls comparing to cgroup v1:
h3. *Hard limit on memory*

{_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
h3. *Soft limit on memory*

{_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_

Detailed descriptions about the memory controls can be found in the official 
[cgroup v2 documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html].
h3. *_Swappiness_*

_memory.swappiness_ has been removed from the available cgroup v2 controls.

Quoting [redhat documentation|https://access.redhat.com/solutions/103833]:
{quote}Swappiness is a property for the Linux kernel that changes the balance 
between swapping out runtime memory, as opposed to dropping pages from the 
system page cache. Swappiness can be set to values between 0 and 100, 
inclusive. A low value means the kernel will try to avoid swapping as much as 
possible where a higher value instead will make the kernel aggressively try to 
use swap space.
{quote}
Referring [this|https://github.com/opencontainers/runtime-spec/issues/1005] 
case study we found that most of the time swappiness didn't work as expected as 
it mostly depends on the I/O balance of the system, so it is no longer 
available in cgroup v2.

  was:cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46]
 need any updates.


> Update MemoryResourceHandler implementation for cgroup v2 support
> -
>
> Key: YARN-11675
> URL: https://issues.apache.org/jira/browse/YARN-11675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> cgroup v2 has some changes in various controllers (some changed their 
> functionality, some were removed). This task is about updating 
> MemoryResourceHandler's 
> [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46].
>  Differences in the controls comparing to cgroup v1:
> h3. *Hard limit on memory*
> {_}memory{_}.{_}limit_in_bytes{_} control is replaced with _memory.max_
> h3. *Soft limit on memory*
> {_}memory{_}.soft_{_}limit_in_bytes{_} control is replaced with _memory.low_
> Detailed descriptions about the memory controls can be found in the official 
> [cgroup v2 documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html].
> h3. *_Swappiness_*
> _memory.swappiness_ has been removed from the available cgroup v2 controls.
> Quoting [redhat documentation|https://access.redhat.com/solutions/103833]:
> {quote}Swappiness is a property for the Linux kernel that changes the balance 
> between swapping out runtime memory, as opposed to dropping pages from the 
> system page cache. Swappiness can be set to values between 0 and 100, 
> inclusive. A low value means the kernel will try to avoid swapping as much as 
> possible where a higher value instead will make the kernel aggressively try 
> to use swap space.
> {quote}
> Referring [this|https://github.com/opencontainers/runtime-spec/issues/1005] 
> case study we found that most of the time swappiness didn't work as expected 
> as it mostly depends on the I/O balance of the system, so it is no longer 
> available in cgroup v2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11685) Create a config to enable/disable cgroup v2 functionality

2024-04-23 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-11685:
--

Assignee: Peter Szucs

> Create a config to enable/disable cgroup v2 functionality
> -
>
> Key: YARN-11685
> URL: https://issues.apache.org/jira/browse/YARN-11685
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>
> Various OS's mount the cgroup v2 differently, some of them mount both the v1 
> and v2 structure, others mount a hybrid structure. To avoid initialization 
> issues the cgroup v1/v2 functionality should be set by a config property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support

2024-04-16 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-11675:
--

Assignee: Peter Szucs

> Update MemoryResourceHandler implementation for cgroup v2 support
> -
>
> Key: YARN-11675
> URL: https://issues.apache.org/jira/browse/YARN-11675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>
> cgroup v2 has some changes in various controllers (some changed their 
> functionality, some were removed). This task is about checking if 
> MemoryResourceHandler's 
> [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46]
>  need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III

2024-03-12 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-5305:
-

Assignee: Peter Szucs

> Yarn Application Log Aggregation fails due to NM can not get correct HDFS 
> delegation token III
> --
>
> Key: YARN-5305
> URL: https://issues.apache.org/jira/browse/YARN-5305
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Xianyin Xin
>Assignee: Peter Szucs
>Priority: Major
>
> Different with YARN-5098 and YARN-5302, this problem happens when AM submits 
> a startContainer request with a new HDFS token (say, tokenB) which is not 
> managed by YARN, so two tokens exist in the credentials of the user on NM, 
> one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is 
> selected when connect to HDFS and tokenB expires, exception happens.
> Supplementary: this problem happen due to that AM didn't use the service name 
> as the token alias in credentials, so two tokens for the same service can 
> co-exist in one credentials. TokenSelector can only select the first matched 
> token, it doesn't care if the token is valid or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11630) Passing admin Java options to container localizers

2023-12-12 Thread Peter Szucs (Jira)
Peter Szucs created YARN-11630:
--

 Summary: Passing admin Java options to container localizers
 Key: YARN-11630
 URL: https://issues.apache.org/jira/browse/YARN-11630
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Peter Szucs
Assignee: Peter Szucs


Currently we can specify Java options for container localizers in 
_"yarn.nodemanager.container-localizer.java.opts"_ parameter.

The aim of this ticket is to create a parameter which we can use to pass admin 
options as well. It would work similarly as the admin Java options we can pass 
for Mapreduce jobs, first we should pass the admin options to the container 
executor, then the user-defined ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11545) FS2CS not converts ACLs when all users are allowed

2023-07-30 Thread Peter Szucs (Jira)
Peter Szucs created YARN-11545:
--

 Summary: FS2CS not converts ACLs when all users are allowed
 Key: YARN-11545
 URL: https://issues.apache.org/jira/browse/YARN-11545
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Peter Szucs
Assignee: Peter Szucs


Currently we only convert ACLs if users or groups are set. This should be 
extended to check if the "allAllowed" flag is set in the AcessControlList to be 
able to preserve * values also for the ACLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11542) NegativeArraySizeException when running MR jobs with large data size

2023-07-27 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747879#comment-17747879
 ] 

Peter Szucs commented on YARN-11542:


Moving this to mapreduce project.

> NegativeArraySizeException when running MR jobs with large data size
> 
>
> Key: YARN-11542
> URL: https://issues.apache.org/jira/browse/YARN-11542
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> We are using bit shifting to double the byte array in IFile's 
> [nextRawValue|https://github.infra.cloudera.com/CDH/hadoop/blob/bef14a39c7616e3b9f437a6fb24fc7a55a676b57/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java#L437]
>  method to store the byte values in it. With large dataset it can easily 
> happen that we shift the leftmost bit when we are calculating the size of the 
> array, which can lead to a negative number as the array size, causing the 
> NegativeArraySizeException.
> It would be safer to expand the backing array with a 1.5x factor, and have a 
> check not to extend Integer's max value during that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11542) NegativeArraySizeException when running MR jobs with large data size

2023-07-27 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs resolved YARN-11542.

Resolution: Abandoned

> NegativeArraySizeException when running MR jobs with large data size
> 
>
> Key: YARN-11542
> URL: https://issues.apache.org/jira/browse/YARN-11542
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> We are using bit shifting to double the byte array in IFile's 
> [nextRawValue|https://github.infra.cloudera.com/CDH/hadoop/blob/bef14a39c7616e3b9f437a6fb24fc7a55a676b57/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java#L437]
>  method to store the byte values in it. With large dataset it can easily 
> happen that we shift the leftmost bit when we are calculating the size of the 
> array, which can lead to a negative number as the array size, causing the 
> NegativeArraySizeException.
> It would be safer to expand the backing array with a 1.5x factor, and have a 
> check not to extend Integer's max value during that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11542) NegativeArraySizeException when running MR jobs with large data size

2023-07-27 Thread Peter Szucs (Jira)
Peter Szucs created YARN-11542:
--

 Summary: NegativeArraySizeException when running MR jobs with 
large data size
 Key: YARN-11542
 URL: https://issues.apache.org/jira/browse/YARN-11542
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Peter Szucs
Assignee: Peter Szucs


We are using bit shifting to double the byte array in IFile's 
[nextRawValue|https://github.infra.cloudera.com/CDH/hadoop/blob/bef14a39c7616e3b9f437a6fb24fc7a55a676b57/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java#L437]
 method to store the byte values in it. With large dataset it can easily happen 
that we shift the leftmost bit when we are calculating the size of the array, 
which can lead to a negative number as the array size, causing the 
NegativeArraySizeException.

It would be safer to expand the backing array with a 1.5x factor, and have a 
check not to extend Integer's max value during that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11534) Incorrect exception handling during container recovery

2023-07-21 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11534:
---
Summary: Incorrect exception handling during container recovery  (was: 
Incorrect exception handling in RecoveredContainerLaunch)

> Incorrect exception handling during container recovery
> --
>
> Key: YARN-11534
> URL: https://issues.apache.org/jira/browse/YARN-11534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When NM is restarted during a container recovery, it can happen that it 
> interrupts the container reaquisition during the LinuxContainerExecutor's 
> signalContainer method. In this case we will get the following exception:
> {code:java}
> java.io.InterruptedIOException: java.lang.InterruptedException
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
>     at org.apache.hadoop.util.Shell.run(Shell.java:901)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.InterruptedException
>     at java.base/java.lang.Object.wait(Native Method)
>     at java.base/java.lang.Object.wait(Object.java:328)
>     at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
>     ... 15 more{code}
> Later this InterruptedIOException get caught and wrapped inside a 
> PrivilegedOperationException and a ContainerExecutionException. In 
> LinuxContainerExecutor's 
> [signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
>  method we catch this exception again, and throw an IOException from it, 
> indicating this error message in the stack trace:
> {code:java}
> IOException from it, causing the following stack trace:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Signal container failed
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> 

[jira] [Updated] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

2023-07-18 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11534:
---
Description: 
When NM is restarted during a container recovery, it can happen that it 
interrupts the container reaquisition during the LinuxContainerExecutor's 
signalContainer method. In this case we will get the following exception:
{code:java}
java.io.InterruptedIOException: java.lang.InterruptedException
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
    at org.apache.hadoop.util.Shell.run(Shell.java:901)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException
    at java.base/java.lang.Object.wait(Native Method)
    at java.base/java.lang.Object.wait(Object.java:328)
    at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
    ... 15 more{code}
Later this InterruptedIOException get caught and wrapped inside a 
PrivilegedOperationException and a ContainerExecutionException. In 
LinuxContainerExecutor's 
[signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
 method we catch this exception again, and throw an IOException from it, 
indicating this error message in the stack trace:
{code:java}
IOException from it, causing the following stack trace:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Signal container failed
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-06-20 18:24:31,777 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
 Unable to recover container container_e03_1687266197584_0033_01_01
java.io.IOException: Problem signalling 

[jira] [Updated] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

2023-07-18 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11534:
---
Description: 
When NM is restarted during a container recovery, it can happen that it 
interrupts the container reaquisition during the LinuxContainerExecutor's 
signalContainer method. In this case we will get the following exception:
{code:java}
java.io.InterruptedIOException: java.lang.InterruptedException
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
    at org.apache.hadoop.util.Shell.run(Shell.java:901)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException
    at java.base/java.lang.Object.wait(Native Method)
    at java.base/java.lang.Object.wait(Object.java:328)
    at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
    ... 15 more{code}
Later this InterruptedIOException get caught and wrapped inside a 
PrivilegedOperationException and a ContainerExecutionException. In 
LinuxContainerExecutor's 
[signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
 method we catch this exception again, and throw an IOException from it, 
indicating this error message in the stack trace:
{code:java}
IOException from it, causing the following stack trace:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Signal container failed
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-06-20 18:24:31,777 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
 Unable to recover container container_e03_1687266197584_0033_01_01
java.io.IOException: Problem signalling 

[jira] [Updated] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

2023-07-18 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11534:
---
Description: 
When NM is restarted during a container recovery, it can happen that it 
interrupts the container reaquisition during the LinuxContainerExecutor's 
signalContainer method. In this case we will get the following exception:
{code:java}
java.io.InterruptedIOException: java.lang.InterruptedException
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
    at org.apache.hadoop.util.Shell.run(Shell.java:901)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException
    at java.base/java.lang.Object.wait(Native Method)
    at java.base/java.lang.Object.wait(Object.java:328)
    at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
    ... 15 more{code}

Later this InterruptedIOException get caught and wrapped inside a 
PrivilegedOperationException and a ContainerExecutionException. In 
LinuxContainerExecutor's 
[signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
 method we catch this exception again, and throw an 
{code:java}
IOException from it, causing the following stack trace:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Signal container failed
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-06-20 18:24:31,777 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
 Unable to recover container container_e03_1687266197584_0033_01_01
java.io.IOException: Problem signalling container 256974 with NULL; output: 
null and exitCode: -1
    at 

[jira] [Updated] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

2023-07-18 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11534:
---
Description: 
When NM is restarted during a container recovery, it can happen that it 
interrupts the container reaquisition during the LinuxContainerExecutor's 
signalContainer method. In this case we will get the following exception:
java.io.InterruptedIOException: java.lang.InterruptedException
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
    at org.apache.hadoop.util.Shell.run(Shell.java:901)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException
    at java.base/java.lang.Object.wait(Native Method)
    at java.base/java.lang.Object.wait(Object.java:328)
    at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
    ... 15 more
Later this InterruptedIOException get caught and wrapped inside a 
PrivilegedOperationException and a ContainerExecutionException. In 
LinuxContainerExecutor's 
[signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
 method we catch this exception again, and throw an IOException from it, 
causing the following stack trace:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Signal container failed
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-06-20 18:24:31,777 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
 Unable to recover container container_e03_1687266197584_0033_01_01
java.io.IOException: Problem signalling container 256974 with NULL; output: 
null and exitCode: -1
    at 

[jira] [Created] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

2023-07-18 Thread Peter Szucs (Jira)
Peter Szucs created YARN-11534:
--

 Summary: Incorrect exception handling in RecoveredContainerLaunch
 Key: YARN-11534
 URL: https://issues.apache.org/jira/browse/YARN-11534
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Peter Szucs
Assignee: Peter Szucs






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11372) Migrate legacy AQC to flexible AQC

2023-03-09 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11372:
---
Description: Currently the codebase of Legacy AQC (with 
ManagedParentQueue/AutoCreatedLeafQueue) classes live next to the basic queue 
classes that are used by the flexible AQC. The scope of this task is to 
eliminate the former while migrating the functionality of legacy AQC.  (was: 
Currently the codebase of Legacy AQC (with ManagedParentQueue/ManagedLeafQueue) 
classes live next to the basic queue classes that are used by the flexible AQC. 
The scope of this task is to eliminate the former while migrating the 
functionality of legacy AQC.)

> Migrate legacy AQC to flexible AQC
> --
>
> Key: YARN-11372
> URL: https://issues.apache.org/jira/browse/YARN-11372
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>
> Currently the codebase of Legacy AQC (with 
> ManagedParentQueue/AutoCreatedLeafQueue) classes live next to the basic queue 
> classes that are used by the flexible AQC. The scope of this task is to 
> eliminate the former while migrating the functionality of legacy AQC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10921) AbstractCSQueue: Node Labels logic is scattered and iteration logic is repeated all over the place

2023-01-04 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-10921:
--

Assignee: Peter Szucs

> AbstractCSQueue: Node Labels logic is scattered and iteration logic is 
> repeated all over the place
> --
>
> Key: YARN-10921
> URL: https://issues.apache.org/jira/browse/YARN-10921
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> TODO items:
> - Check original Node labels epic / jiras?
> - Think about ways to improve repetitive iteration on configuredNodeLabels
> - Search for: "String label" in code
> Code blocks to handle Node labels:
> - AbstractCSQueue#setupQueueConfigs
> - AbstractCSQueue#getQueueConfigurations
> - AbstractCSQueue#accessibleToPartition
> - AbstractCSQueue#getNodeLabelsForQueue
> - AbstractCSQueue#updateAbsoluteCapacities
> - AbstractCSQueue#updateConfigurableResourceRequirement
> - CSQueueUtils#loadCapacitiesByLabelsFromConf
> - AutoCreatedLeafQueue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10926) Test validation after YARN-10504 and YARN-10506: Check if modified test expectations are correct or not

2023-01-03 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648132#comment-17648132
 ] 

Peter Szucs edited comment on YARN-10926 at 1/3/23 4:17 PM:


I checked the mentioned jiras and the tests in the related code changes and 
didn't see any invalid assertions in them.

For YARN-10504 there were a TODO to fix 
TestAbsoluteResourceWithAutoQueue#testAutoCreateLeafQueueCreation 
[here|#diff-1ed6d328b2546b3599468f169e823d2d411a3bdb85ea7871a8533cd205e2d311],] 
and it has been fixed in a later PR: 
[https://github.com/apache/hadoop/pull/3868]

As I saw in the last comments of YARN-10504 test issues remained in 
TestAbsoluteResourceConfiguration.testSimpleMinMaxResourceConfigurartionPerQueue
 but it was also fixed in a follow-up commit 
[here|https://github.com/apache/hadoop/commit/4f008153ef5fca9e1f71ebc7069c502e803ab1e8]


was (Author: JIRAUSER297340):
I checked the mentioned jiras and the tests in the related code changes and 
didn't see any invalid assertions in them.

For YARN-10504 there were a TODO to fix 
TestAbsoluteResourceWithAutoQueue#testAutoCreateLeafQueueCreation 
[here|[https://github.com/apache/hadoop/commit/b0eec0909772cf92427957670da5630b1dd11da0#diff-1ed6d328b2546b3599468f169e823d2d411a3bdb85ea7871a8533cd205e2d311],]
 and it has been fixed in a later PR: 
[https://github.com/apache/hadoop/pull/3868]

As I saw in the last comments of 
[YARN-10504|https://issues.apache.org/jira/browse/YARN-10504] test issues 
remained in 
TestAbsoluteResourceConfiguration.testSimpleMinMaxResourceConfigurartionPerQueue
 but it was also fixed in a follow-up commit 
[here|https://github.com/apache/hadoop/commit/4f008153ef5fca9e1f71ebc7069c502e803ab1e8
 ]

> Test validation after YARN-10504 and YARN-10506: Check if modified test 
> expectations are correct or not
> ---
>
> Key: YARN-10926
> URL: https://issues.apache.org/jira/browse/YARN-10926
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> YARN-10504 and YARN-10506 modified some test expectations.
> The task is to verify if those expectations are correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10905) Investigate if AbstractCSQueue#configuredNodeLabels vs. QueueCapacities#getExistingNodeLabels holds the same data

2023-01-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs resolved YARN-10905.

Resolution: Won't Fix

> Investigate if AbstractCSQueue#configuredNodeLabels vs. 
> QueueCapacities#getExistingNodeLabels holds the same data
> -
>
> Key: YARN-10905
> URL: https://issues.apache.org/jira/browse/YARN-10905
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> The task is to investigate whether the field 
> AbstractCSQueue#configuredNodeLabels holds the same data as 
> QueueCapacities#getExistingNodeLabels.
> Obviously, we don't want double-entry bookkeeping so if the data is the same, 
> we can remove this or that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10926) Test validation after YARN-10504 and YARN-10506: Check if modified test expectations are correct or not

2023-01-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs resolved YARN-10926.

Resolution: Won't Fix

> Test validation after YARN-10504 and YARN-10506: Check if modified test 
> expectations are correct or not
> ---
>
> Key: YARN-10926
> URL: https://issues.apache.org/jira/browse/YARN-10926
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> YARN-10504 and YARN-10506 modified some test expectations.
> The task is to verify if those expectations are correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-01-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reopened YARN-11041:


> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-01-03 Thread Peter Szucs (Jira)


[ https://issues.apache.org/jira/browse/YARN-11041 ]


Peter Szucs deleted comment on YARN-11041:


was (Author: JIRAUSER297340):
The attached pull request is merged to trunk, I think just the administration 
is left here, I'll close this ticket.

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  

[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-01-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs resolved YARN-11041.

Resolution: Fixed

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-01-03 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653938#comment-17653938
 ] 

Peter Szucs commented on YARN-11041:


The attached pull request is merged to trunk, I think just the administration 
is left here, I'll close this ticket.

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks 

[jira] [Updated] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-01-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs updated YARN-11041:
---
Fix Version/s: 3.4.0

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-01-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-11041:
--

Assignee: Peter Szucs

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (YARN-10926) Test validation after YARN-10504 and YARN-10506: Check if modified test expectations are correct or not

2022-12-15 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648132#comment-17648132
 ] 

Peter Szucs commented on YARN-10926:


I checked the mentioned jiras and the tests in the related code changes and 
didn't see any invalid assertions in them.

For YARN-10504 there were a TODO to fix 
TestAbsoluteResourceWithAutoQueue#testAutoCreateLeafQueueCreation 
[here|[https://github.com/apache/hadoop/commit/b0eec0909772cf92427957670da5630b1dd11da0#diff-1ed6d328b2546b3599468f169e823d2d411a3bdb85ea7871a8533cd205e2d311],]
 and it has been fixed in a later PR: 
[https://github.com/apache/hadoop/pull/3868]

As I saw in the last comments of 
[YARN-10504|https://issues.apache.org/jira/browse/YARN-10504] test issues 
remained in 
TestAbsoluteResourceConfiguration.testSimpleMinMaxResourceConfigurartionPerQueue
 but it was also fixed in a follow-up commit 
[here|https://github.com/apache/hadoop/commit/4f008153ef5fca9e1f71ebc7069c502e803ab1e8
 ]

> Test validation after YARN-10504 and YARN-10506: Check if modified test 
> expectations are correct or not
> ---
>
> Key: YARN-10926
> URL: https://issues.apache.org/jira/browse/YARN-10926
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> YARN-10504 and YARN-10506 modified some test expectations.
> The task is to verify if those expectations are correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10905) Investigate if AbstractCSQueue#configuredNodeLabels vs. QueueCapacities#getExistingNodeLabels holds the same data

2022-12-09 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645316#comment-17645316
 ] 

Peter Szucs commented on YARN-10905:


As I saw in my investigation, _configuredNodeLabels_ method is extracted to 
_NodeLabelsSettings_ class since the ticket was issued. In the 
_AbstractCSQueue's_ _setupQueueConfigs_ method we load the 
_queueNodeLabelsSettings_ and the _queueCapacities_ every time we refresh a 
queue. The process of this is the following:
 * we initialize the _queueNodeLabelsSettings_ and read all the node label 
informations (accessible/configured node labels, defaultLabelExpression) from 
the config. We store _configuredNodeLabels_ as a set of strings here.
 * after this we initialize the _queueCapacities_ map with iterating through 
the _configuredNodeLabels_ and reading the capacity properties for each label 
for a given queue from the config.
 * _QueueCapacities#getExistingNodeLabels_ returns the keyset of this map

Since we are iterating through the 
_queueNodeLabelsSettings#configuredNodeLabels_ and creating another map from it 
for the detailed capacities, _configuredNodeLabels_ and 
_QueueCapacities#getExistingNodeLabels_ should return the same set of labels.

*Conclusion:*

_QueueCapacities_ needs the _configuredNodeLabels_ for the initialization, so I 
think the only thing that can be removed is the 
_QueueCapacities#getExistingNodeLabels_ method, but I think it's reasonable to 
have a method in _QueueCapacities_ to retrieve the keyset of the capacities map 
for code parts that are dealing with only the {_}QueueCapacities{_}, for 
example {_}QueueCapacitiesInfo{_}, or _mergeCapacities_ in the 
{_}AutoCreatedLeafQueue{_}, where we are creating one capacity map from 
another, so I haven't found a nice way to clean this up yet.

> Investigate if AbstractCSQueue#configuredNodeLabels vs. 
> QueueCapacities#getExistingNodeLabels holds the same data
> -
>
> Key: YARN-10905
> URL: https://issues.apache.org/jira/browse/YARN-10905
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> The task is to investigate whether the field 
> AbstractCSQueue#configuredNodeLabels holds the same data as 
> QueueCapacities#getExistingNodeLabels.
> Obviously, we don't want double-entry bookkeeping so if the data is the same, 
> we can remove this or that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10926) Test validation after YARN-10504 and YARN-10506: Check if modified test expectations are correct or not

2022-12-05 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-10926:
--

Assignee: Peter Szucs

> Test validation after YARN-10504 and YARN-10506: Check if modified test 
> expectations are correct or not
> ---
>
> Key: YARN-10926
> URL: https://issues.apache.org/jira/browse/YARN-10926
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> YARN-10504 and YARN-10506 modified some test expectations.
> The task is to verify if those expectations are correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10905) Investigate if AbstractCSQueue#configuredNodeLabels vs. QueueCapacities#getExistingNodeLabels holds the same data

2022-12-01 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-10905:
--

Assignee: Peter Szucs

> Investigate if AbstractCSQueue#configuredNodeLabels vs. 
> QueueCapacities#getExistingNodeLabels holds the same data
> -
>
> Key: YARN-10905
> URL: https://issues.apache.org/jira/browse/YARN-10905
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> The task is to investigate whether the field 
> AbstractCSQueue#configuredNodeLabels holds the same data as 
> QueueCapacities#getExistingNodeLabels.
> Obviously, we don't want double-entry bookkeeping so if the data is the same, 
> we can remove this or that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10946) AbstractCSQueue: Create separate class for constructing Queue API objects

2022-11-22 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-10946:
--

Assignee: Peter Szucs

> AbstractCSQueue: Create separate class for constructing Queue API objects
> -
>
> Key: YARN-10946
> URL: https://issues.apache.org/jira/browse/YARN-10946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> Relevant methods are: 
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueConfigurations
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueInfo
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueStatistics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-22 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs resolved YARN-10959.

Resolution: Resolved

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-22 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637151#comment-17637151
 ] 

Peter Szucs edited comment on YARN-10959 at 11/22/22 10:32 AM:
---

We saw that extracting wouldn't provide us a real benefit here because of the 
size of the duplication and the difference in the logic, so as per our 
discussion with [~snemeth] I close this ticket.


was (Author: JIRAUSER297340):
We saw that extracting wouldn't provide us a real benefit here because of the 
size of the duplication and the difference in the logic, ** so as per our 
discussion with [~snemeth] I close this ticket.

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-22 Thread Peter Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637151#comment-17637151
 ] 

Peter Szucs commented on YARN-10959:


We saw that extracting wouldn't provide us a real benefit here because of the 
size of the duplication and the difference in the logic, ** so as per our 
discussion with [~snemeth] I close this ticket.

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-18 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-10959:
--

Assignee: Peter Szucs

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10005) Code improvements in MutableCSConfigurationProvider

2022-11-03 Thread Peter Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Szucs reassigned YARN-10005:
--

Assignee: Peter Szucs

> Code improvements in MutableCSConfigurationProvider
> ---
>
> Key: YARN-10005
> URL: https://issues.apache.org/jira/browse/YARN-10005
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> * Important: constructKeyValueConfUpdate and all related methods seems a 
> separate responsibility: how to convert incoming SchedConfUpdateInfo to 
> Configuration changes (Configuration object)
> * Duplicated code block (9 lines) in init / formatConfigurationInStore methods
> * Method "getConfStore" could be package-private



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org