[jira] [Commented] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931109#comment-16931109
 ] 

Tarun Parimi commented on YARN-9837:


Thanks for the review [~eyang] .

> YARN Service fails to fetch status for Stopped apps with bigger spec files
> --
>
> Key: YARN-9837
> URL: https://issues.apache.org/jira/browse/YARN-9837
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9837.001.patch
>
>
> Was unable to fetch status for a STOPPED app due to the below error in RM 
> logs.
> {code:java}
> ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: 
> {}
> java.io.EOFException: Read of 
> hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
> finished prematurely
> at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
> at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
> {code}
> This seems to happen when the json file my-service.json is larger than 128KB 
> in my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5913) Consolidate "resource" and "amResourceRequest" in ApplicationSubmissionContext

2019-09-16 Thread Daniel Templeton (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931077#comment-16931077
 ] 

Daniel Templeton commented on YARN-5913:


The issue at hand is that in {{ApplicationSubmissionContext}} {{resource}} 
({{getResource()}}/{{setResource()}}) has the same meaning as 
{{AMContainerResourceRequests}} 
({{getAMContainerResourceRequests()}}/{{setAMContainerResourceRequests()}}).  
Having two different properties that do the same thing is confusing.  The task 
here is to pick one and replace all calls to the other with it.  I would 
recommend replacing all uses of {{resource}} with 
{{AMContainerResourceRequests}}.  Then mark {{getResource()}} and 
{{setResource()}} as deprecated.

There's one more issue.  In YARN-2493 the patch strips the {{@Stable}} tag off 
{{getResource()}} and {{setResource()}}.  That's not OK.  The {{@Stable}} 
should be restored.

Let me know if you need more details.

> Consolidate "resource" and "amResourceRequest" in ApplicationSubmissionContext
> --
>
> Key: YARN-5913
> URL: https://issues.apache.org/jira/browse/YARN-5913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Yufei Gu
>Assignee: Yousef Abu-Salah
>Priority: Minor
>  Labels: newbie
>
> Usage of these two variables overlaps and causes confusion. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-16 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930998#comment-16930998
 ] 

shanyu zhao commented on YARN-9834:
---

[~eyang], You are talking about Docker container executor. What I meant is 
running Yarn node manager itself inside Docker container (managed by 
Kubernetes). The node manager (that runs SSSD) container needs to be domain 
joined to be able to sync domain users.

This change will honor the fundamental security design of process owner and 
file owner, that's why it uses reference counting for all these local files, 
the local user is not released to reallocate until all FileDeletionTask are 
done, so a Yarn container won't be able to access any other previous 
container's local files/logs.

Currently LinuxContainerExecutor already supports non-secure mode and secure 
mode, and this change is just modifying the getRunAsUser() behavior for secure 
mode. I could inherit from LinuxContainerExecutor and override this single file 
with one line change. But in the end, it is still LinuxContainerExecutor, not a 
drastically different container executor.

As for ResourceLocalization modification, basically we are treating "PRIVATE" 
type of resources as "APPLICATION" without banning "PRIVATE" resources. The end 
result is that the PRIVATE resources are kept within application folder and 
will be removed after the application is done.

> Allow using a pool of local users to run Yarn Secure Container in secure mode
> -
>
> Key: YARN-9834
> URL: https://issues.apache.org/jira/browse/YARN-9834
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: shanyu zhao
>Assignee: shanyu zhao
>Priority: Major
>
> Yarn Secure Container in secure mode allows separation of different user's 
> local files and container processes running on the same node manager. This 
> depends on an out of band service such as SSSD/Winbind to sync all domain 
> users to local machine.
> Winbind user sync has lots of overhead, especially for large corporations. 
> Also if running Yarn inside Kubernetes cluster (meaning node managers running 
> inside Docker container), it doesn't make sense for each container to domain 
> join with Active Directory and sync a whole copy of domain users.
> We should allow a new configuration to Yarn, such that we can pre-create a 
> pool of users on each machine/Docker container. And at runtime, Yarn 
> allocates a local user to the domain user that submits the application. When 
> all containers of that user are finished and all files belonging to that user 
> are deleted, we can release the allocation and allow other users to use the 
> same local user to run their Yarn containers.
> h2. Design
> We propose to add these new configurations:
> {code:java}
> yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, 
> defaults to false
> yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
> defaults to "user"{code}
> By default this feature is turned off. If we enable it, with 
> local-user-prefix set to "user", then we expect there are pre-created local 
> users user0 - usern, where the total number of local users equals to:
> {code:java}
> yarn.nodemanager.resource.cpu-vcores {code}
> We can use an in-memory allocator to keep the domain user to local user 
> mapping. 
> Now when to add the mapping and when to remove it?
> In node manager, ApplicationImpl implements the state machine for a Yarn app 
> life cycle, only if the app has at least 1 container running on that node 
> manager. We can hook up the code to add the mapping during application 
> initialization.
> For removing the mapping, we need to wait for 3 things:
> 1) All applications of the same user is completed;
>  2) All log handling of the applications (log aggregation or non-aggregated 
> handling) is done;
>  3) All pending FileDeletionTask that use the user's identity is finished.
> Note that all operation to these reference counting should be synchronized 
> operation.
> If all of our local users in the pool are allocated, we'll return 
> "nonexistuser" as runas user, this will cause the container to fail to 
> execute and Yarn will relaunch it in other nodes.
> h2. Limitations
> 1) This feature does not support PRIVATE visibility type of resource 
> allocation. Because PRIVATE type of resources are potentially cached in the 
> node manager for a very long time, supporting it will be a security problem 
> that a user might be able to peek into previous user's PRIVATE resources. We 
> can modify code to treat all PRIVATE type of resource as APPLICATION type.
> 2) It is recommended to enable DominantResourceCalculator so that no more 
> than "cpu-vcores" number of 

[jira] [Commented] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-16 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930980#comment-16930980
 ] 

Eric Yang commented on YARN-9834:
-

[~shanyu] {quote}I forgot to mention that for Winbind/SSSD to work the 
container needs to be domain joined to Active Directory, which doesn't work and 
doesn't seem to be efficient.{quote}

I don't think this statement is true.  Container doesn't need to join AD.  User 
credential can be setup using sssd socket to Linux host sssd cache, see: 
https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/DockerContainers.html#SSSD
 for setup instructions.

I can understand the idea of this feature to make sure there is only small set 
of users to reuse to reduce overloading of user information from disjoined 
system.  However, this feature could be dangerous if the user has existing data 
that can be accidentally accessed by future users.  It tries to strip away some 
of the fundamental security design like process owner and file owner in Linux 
like system.  The modification to LinuxContainerExecutor and 
ResourceLocalizationService are not safe practices.  The conflicting model will 
make it hard to maintain Linux container executor.  

I would suggest to write a separate container executor to accomplish your goal 
instead of hacking LinuxContainerExecutor for this purpose.

> Allow using a pool of local users to run Yarn Secure Container in secure mode
> -
>
> Key: YARN-9834
> URL: https://issues.apache.org/jira/browse/YARN-9834
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: shanyu zhao
>Assignee: shanyu zhao
>Priority: Major
>
> Yarn Secure Container in secure mode allows separation of different user's 
> local files and container processes running on the same node manager. This 
> depends on an out of band service such as SSSD/Winbind to sync all domain 
> users to local machine.
> Winbind user sync has lots of overhead, especially for large corporations. 
> Also if running Yarn inside Kubernetes cluster (meaning node managers running 
> inside Docker container), it doesn't make sense for each container to domain 
> join with Active Directory and sync a whole copy of domain users.
> We should allow a new configuration to Yarn, such that we can pre-create a 
> pool of users on each machine/Docker container. And at runtime, Yarn 
> allocates a local user to the domain user that submits the application. When 
> all containers of that user are finished and all files belonging to that user 
> are deleted, we can release the allocation and allow other users to use the 
> same local user to run their Yarn containers.
> h2. Design
> We propose to add these new configurations:
> {code:java}
> yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, 
> defaults to false
> yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
> defaults to "user"{code}
> By default this feature is turned off. If we enable it, with 
> local-user-prefix set to "user", then we expect there are pre-created local 
> users user0 - usern, where the total number of local users equals to:
> {code:java}
> yarn.nodemanager.resource.cpu-vcores {code}
> We can use an in-memory allocator to keep the domain user to local user 
> mapping. 
> Now when to add the mapping and when to remove it?
> In node manager, ApplicationImpl implements the state machine for a Yarn app 
> life cycle, only if the app has at least 1 container running on that node 
> manager. We can hook up the code to add the mapping during application 
> initialization.
> For removing the mapping, we need to wait for 3 things:
> 1) All applications of the same user is completed;
>  2) All log handling of the applications (log aggregation or non-aggregated 
> handling) is done;
>  3) All pending FileDeletionTask that use the user's identity is finished.
> Note that all operation to these reference counting should be synchronized 
> operation.
> If all of our local users in the pool are allocated, we'll return 
> "nonexistuser" as runas user, this will cause the container to fail to 
> execute and Yarn will relaunch it in other nodes.
> h2. Limitations
> 1) This feature does not support PRIVATE visibility type of resource 
> allocation. Because PRIVATE type of resources are potentially cached in the 
> node manager for a very long time, supporting it will be a security problem 
> that a user might be able to peek into previous user's PRIVATE resources. We 
> can modify code to treat all PRIVATE type of resource as APPLICATION type.
> 2) It is recommended to enable DominantResourceCalculator so that no more 
> than "cpu-vcores" number of concurrent containers running on a node manager:
> 

[jira] [Commented] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-16 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930945#comment-16930945
 ] 

shanyu zhao commented on YARN-9834:
---

Thanks [~eyang]! I forgot to mention that for Winbind/SSSD to work the 
container needs to be domain joined to Active Directory, which doesn't work and 
doesn't seem to be efficient.

I pushed a change to rename "use-local-user" to "use-pool-user", and 
"local-user-prefix" to "pool-user-prefix".

> Allow using a pool of local users to run Yarn Secure Container in secure mode
> -
>
> Key: YARN-9834
> URL: https://issues.apache.org/jira/browse/YARN-9834
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: shanyu zhao
>Assignee: shanyu zhao
>Priority: Major
>
> Yarn Secure Container in secure mode allows separation of different user's 
> local files and container processes running on the same node manager. This 
> depends on an out of band service such as SSSD/Winbind to sync all domain 
> users to local machine.
> Winbind user sync has lots of overhead, especially for large corporations. 
> Also if running Yarn inside Kubernetes cluster (meaning node managers running 
> inside Docker container), it doesn't make sense for each container to domain 
> join with Active Directory and sync a whole copy of domain users.
> We should allow a new configuration to Yarn, such that we can pre-create a 
> pool of users on each machine/Docker container. And at runtime, Yarn 
> allocates a local user to the domain user that submits the application. When 
> all containers of that user are finished and all files belonging to that user 
> are deleted, we can release the allocation and allow other users to use the 
> same local user to run their Yarn containers.
> h2. Design
> We propose to add these new configurations:
> {code:java}
> yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, 
> defaults to false
> yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
> defaults to "user"{code}
> By default this feature is turned off. If we enable it, with 
> local-user-prefix set to "user", then we expect there are pre-created local 
> users user0 - usern, where the total number of local users equals to:
> {code:java}
> yarn.nodemanager.resource.cpu-vcores {code}
> We can use an in-memory allocator to keep the domain user to local user 
> mapping. 
> Now when to add the mapping and when to remove it?
> In node manager, ApplicationImpl implements the state machine for a Yarn app 
> life cycle, only if the app has at least 1 container running on that node 
> manager. We can hook up the code to add the mapping during application 
> initialization.
> For removing the mapping, we need to wait for 3 things:
> 1) All applications of the same user is completed;
>  2) All log handling of the applications (log aggregation or non-aggregated 
> handling) is done;
>  3) All pending FileDeletionTask that use the user's identity is finished.
> Note that all operation to these reference counting should be synchronized 
> operation.
> If all of our local users in the pool are allocated, we'll return 
> "nonexistuser" as runas user, this will cause the container to fail to 
> execute and Yarn will relaunch it in other nodes.
> h2. Limitations
> 1) This feature does not support PRIVATE visibility type of resource 
> allocation. Because PRIVATE type of resources are potentially cached in the 
> node manager for a very long time, supporting it will be a security problem 
> that a user might be able to peek into previous user's PRIVATE resources. We 
> can modify code to treat all PRIVATE type of resource as APPLICATION type.
> 2) It is recommended to enable DominantResourceCalculator so that no more 
> than "cpu-vcores" number of concurrent containers running on a node manager:
> {code:java}
> yarn.scheduler.capacity.resource-calculator
> = org.apache.hadoop.yarn.util.resource.DominantResourceCalculator {code}
> 3) Currently this feature does not work with Yarn Node Manager recovery. This 
> is because the mappings are kept in memory, it cannot be recovered after node 
> manager restart.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-16 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated YARN-9834:
--
Description: 
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

Winbind user sync has lots of overhead, especially for large corporations. Also 
if running Yarn inside Kubernetes cluster (meaning node managers running inside 
Docker container), it doesn't make sense for each container to domain join with 
Active Directory and sync a whole copy of domain users.

We should allow a new configuration to Yarn, such that we can pre-create a pool 
of users on each machine/Docker container. And at runtime, Yarn allocates a 
local user to the domain user that submits the application. When all containers 
of that user are finished and all files belonging to that user are deleted, we 
can release the allocation and allow other users to use the same local user to 
run their Yarn containers.
h2. Design

We propose to add these new configurations:
{code:java}
yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, defaults 
to false
yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
defaults to "user"{code}
By default this feature is turned off. If we enable it, with local-user-prefix 
set to "user", then we expect there are pre-created local users user0 - usern, 
where the total number of local users equals to:
{code:java}
yarn.nodemanager.resource.cpu-vcores {code}
We can use an in-memory allocator to keep the domain user to local user 
mapping. 

Now when to add the mapping and when to remove it?

In node manager, ApplicationImpl implements the state machine for a Yarn app 
life cycle, only if the app has at least 1 container running on that node 
manager. We can hook up the code to add the mapping during application 
initialization.

For removing the mapping, we need to wait for 3 things:

1) All applications of the same user is completed;
 2) All log handling of the applications (log aggregation or non-aggregated 
handling) is done;
 3) All pending FileDeletionTask that use the user's identity is finished.

Note that all operation to these reference counting should be synchronized 
operation.

If all of our local users in the pool are allocated, we'll return 
"nonexistuser" as runas user, this will cause the container to fail to execute 
and Yarn will relaunch it in other nodes.
h2. Limitations

1) This feature does not support PRIVATE visibility type of resource 
allocation. Because PRIVATE type of resources are potentially cached in the 
node manager for a very long time, supporting it will be a security problem 
that a user might be able to peek into previous user's PRIVATE resources. We 
can modify code to treat all PRIVATE type of resource as APPLICATION type.

2) It is recommended to enable DominantResourceCalculator so that no more than 
"cpu-vcores" number of concurrent containers running on a node manager:
{code:java}
yarn.scheduler.capacity.resource-calculator
= org.apache.hadoop.yarn.util.resource.DominantResourceCalculator {code}
3) Currently this feature does not work with Yarn Node Manager recovery. This 
is because the mappings are kept in memory, it cannot be recovered after node 
manager restart.

 

  was:
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

SSSD/Winbind user sync has lots of overhead, especially for large corporations. 
Also if running Yarn inside Kubernetes cluster (meaning node managers running 
inside Docker container), it doesn't make sense for each container to sync a 
whole copy of domain users.

We should allow a new configuration to Yarn, such that we can pre-create a pool 
of users on each machine/Docker container. And at runtime, Yarn allocates a 
local user to the domain user that submits the application. When all containers 
of that user are finished and all files belonging to that user are deleted, we 
can release the allocation and allow other users to use the same local user to 
run their Yarn containers.
h2. Design

We propose to add these new configurations:
{code:java}
yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, defaults 
to false
yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
defaults to "user"{code}
By default this feature is turned off. If we enable it, with local-user-prefix 
set to "user", then we expect there are pre-created local users user0 - usern, 
where the total number of local users equals to:
{code:java}
yarn.nodemanager.resource.cpu-vcores {code}
We can use an in-memory 

[jira] [Updated] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-16 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated YARN-9834:
--
Description: 
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

SSSD/Winbind user sync has lots of overhead, especially for large corporations. 
Also if running Yarn inside Kubernetes cluster (meaning node managers running 
inside Docker container), it doesn't make sense for each container to sync a 
whole copy of domain users.

We should allow a new configuration to Yarn, such that we can pre-create a pool 
of users on each machine/Docker container. And at runtime, Yarn allocates a 
local user to the domain user that submits the application. When all containers 
of that user are finished and all files belonging to that user are deleted, we 
can release the allocation and allow other users to use the same local user to 
run their Yarn containers.
h2. Design

We propose to add these new configurations:
{code:java}
yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, defaults 
to false
yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
defaults to "user"{code}
By default this feature is turned off. If we enable it, with local-user-prefix 
set to "user", then we expect there are pre-created local users user0 - usern, 
where the total number of local users equals to:
{code:java}
yarn.nodemanager.resource.cpu-vcores {code}
We can use an in-memory allocator to keep the domain user to local user 
mapping. 

Now when to add the mapping and when to remove it?

In node manager, ApplicationImpl implements the state machine for a Yarn app 
life cycle, only if the app has at least 1 container running on that node 
manager. We can hook up the code to add the mapping during application 
initialization.

For removing the mapping, we need to wait for 3 things:

1) All applications of the same user is completed;
 2) All log handling of the applications (log aggregation or non-aggregated 
handling) is done;
 3) All pending FileDeletionTask that use the user's identity is finished.

If all of our local users in the pool are allocated, we'll return 
"nonexistuser" as runas user, this will cause the container to fail to execute 
and Yarn will relaunch it in other nodes.
h2. Limitations

1) This feature does not support PRIVATE visibility type of resource 
allocation. Because PRIVATE type of resources are potentially cached in the 
node manager for a very long time, supporting it will be a security problem 
that a user might be able to peek into previous user's PRIVATE resources. We 
can modify code to treat all PRIVATE type of resource as APPLICATION type.

2) It is recommended to enable DominantResourceCalculator so that no more than 
"cpu-vcores" number of concurrent containers running on a node manager:
{code:java}
yarn.scheduler.capacity.resource-calculator
= org.apache.hadoop.yarn.util.resource.DominantResourceCalculator {code}
3) Currently this feature does not work with Yarn Node Manager recovery. This 
is because the mappings are kept in memory, it cannot be recovered after node 
manager restart.

 

  was:
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

SSSD/Winbind user sync has lots of overhead, especially for large corporations. 
Also if running Yarn inside Kubernetes cluster (meaning node managers running 
inside Docker container), it doesn't make sense for each container to sync a 
whole copy of domain users.

We should allow a new configuration to Yarn, such that we can pre-create a pool 
of users on each machine/Docker container. And at runtime, Yarn allocates a 
local user to the domain user that submits the application. When all containers 
of that user and all files belonging to that user are deleted, we can release 
the allocation and allow other users to use the same local user to run their 
Yarn containers.
h2. Design

We propose to add these new configurations:
{code:java}
yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, defaults 
to false
yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
defaults to "user"{code}
By default this feature is turned off. If we enable it, with local-user-prefix 
set to "user", then we expect there are pre-created local users user0 - usern, 
where n equals to:
{code:java}
yarn.nodemanager.resource.cpu-vcores {code}
We can use an in-memory allocator to keep the domain user to local user 
mapping. When to add the mapping and when to remove it?

In node manager, ApplicationImpl implements the state machine 

[jira] [Commented] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930775#comment-16930775
 ] 

Hadoop QA commented on YARN-9837:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
24s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 53s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core:
 The patch generated 0 new + 16 unchanged - 1 fixed = 16 total (was 17) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
38s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 66m 45s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9837 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980427/YARN-9837.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 1265b68bcc90 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 56f042c |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24800/testReport/ |
| Max. process+thread count | 742 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 U: 

[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930774#comment-16930774
 ] 

Hadoop QA commented on YARN-9011:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 14 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
51s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 42s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m  
7s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 14s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 1 new + 562 unchanged - 3 fixed = 563 total (was 565) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 19s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 85m 
13s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 
38s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}185m 55s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9011 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980422/YARN-9011-004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 43a4d061ddbd 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 56f042c |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 

[jira] [Commented] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930770#comment-16930770
 ] 

Eric Yang commented on YARN-9837:
-

[~tarunparimi] Thank you for the patch.  Patch 001 looks good to me, pending 
Jenkins reports.

> YARN Service fails to fetch status for Stopped apps with bigger spec files
> --
>
> Key: YARN-9837
> URL: https://issues.apache.org/jira/browse/YARN-9837
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9837.001.patch
>
>
> Was unable to fetch status for a STOPPED app due to the below error in RM 
> logs.
> {code:java}
> ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: 
> {}
> java.io.EOFException: Read of 
> hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
> finished prematurely
> at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
> at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
> {code}
> This seems to happen when the json file my-service.json is larger than 128KB 
> in my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-16 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930716#comment-16930716
 ] 

Eric Yang commented on YARN-9834:
-

[~shanyu] SSSD does not mirror all users, and it only caches users on demand 
during per user lookup.  Hence, it is very efficient.  Winbind does have the 
limitation that this issue tries to address.  

Can we change the config from:

{code}
yarn.nodemanager.linux-container-executor.secure-mode.use-local-user=false
{code}

to

{code}
yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user=false
{code}

Use-local-user can be misleading for some users who would like to use 
/etc/passwd users, but get confused by the meaning of the config name.  Thanks

> Allow using a pool of local users to run Yarn Secure Container in secure mode
> -
>
> Key: YARN-9834
> URL: https://issues.apache.org/jira/browse/YARN-9834
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: shanyu zhao
>Assignee: shanyu zhao
>Priority: Major
>
> Yarn Secure Container in secure mode allows separation of different user's 
> local files and container processes running on the same node manager. This 
> depends on an out of band service such as SSSD/Winbind to sync all domain 
> users to local machine.
> SSSD/Winbind user sync has lots of overhead, especially for large 
> corporations. Also if running Yarn inside Kubernetes cluster (meaning node 
> managers running inside Docker container), it doesn't make sense for each 
> container to sync a whole copy of domain users.
> We should allow a new configuration to Yarn, such that we can pre-create a 
> pool of users on each machine/Docker container. And at runtime, Yarn 
> allocates a local user to the domain user that submits the application. When 
> all containers of that user and all files belonging to that user are deleted, 
> we can release the allocation and allow other users to use the same local 
> user to run their Yarn containers.
> h2. Design
> We propose to add these new configurations:
> {code:java}
> yarn.nodemanager.linux-container-executor.secure-mode.use-local-user, 
> defaults to false
> yarn.nodemanager.linux-container-executor.secure-mode.local-user-prefix, 
> defaults to "user"{code}
> By default this feature is turned off. If we enable it, with 
> local-user-prefix set to "user", then we expect there are pre-created local 
> users user0 - usern, where n equals to:
> {code:java}
> yarn.nodemanager.resource.cpu-vcores {code}
> We can use an in-memory allocator to keep the domain user to local user 
> mapping. When to add the mapping and when to remove it?
> In node manager, ApplicationImpl implements the state machine for a Yarn app 
> life cycle, only if the app has at least 1 container running on that node 
> manager. We can hook up the code to add the mapping during application 
> initialization.
> For removing the mapping, we need to wait for 3 things:
> 1) All applications of the same user is completed;
>  2) All log handling of the applications (log aggregation or non-aggregated 
> handling) is done;
>  3) All pending FileDeletionTask that use the user's identity is finished.
> If all of our local users in the pool are allocated, we'll return 
> "nonexistuser" as runas user, this will cause the container to fail to 
> execute and Yarn will relaunch it in other nodes.
> h2. Limitations
> 1) This feature does not support PRIVATE visibility type of resource 
> allocation. Because PRIVATE type of resources are potentially cached in the 
> node manager for a very long time, supporting it will be a security problem 
> that a user might be able to peek into previous user's PRIVATE resources. We 
> can modify code to treat all PRIVATE type of resource as APPLICATION type.
> 2) It is recommended to enable DominantResourceCalculator so that no more 
> than "cpu-vcores" number of concurrent containers running on a node manager:
> {code:java}
> yarn.scheduler.capacity.resource-calculator
> = org.apache.hadoop.yarn.util.resource.DominantResourceCalculator {code}
> 3) Currently this feature does not work with Yarn Node Manager recovery. This 
> is because the mappings are kept in memory, it cannot be recovered after node 
> manager restart.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9837:
---
Attachment: YARN-9837.001.patch

> YARN Service fails to fetch status for Stopped apps with bigger spec files
> --
>
> Key: YARN-9837
> URL: https://issues.apache.org/jira/browse/YARN-9837
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9837.001.patch
>
>
> Was unable to fetch status for a STOPPED app due to the below error in RM 
> logs.
> {code:java}
> ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: 
> {}
> java.io.EOFException: Read of 
> hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
> finished prematurely
> at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
> at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
> {code}
> This seems to happen when the json file my-service.json is larger than 128KB 
> in my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9837:
--

 Summary: YARN Service fails to fetch status for Stopped apps with 
bigger spec files
 Key: YARN-9837
 URL: https://issues.apache.org/jira/browse/YARN-9837
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Affects Versions: 3.1.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


Was unable to fetch status for a STOPPED app due to the below error in RM logs.
{code:java}
ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: {}
java.io.EOFException: Read of 
hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
finished prematurely
at 
org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
at 
org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
at 
org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
at 
org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
{code}
This seems to happen when the json file my-service.json is larger than 128KB in 
my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9011) Race condition during decommissioning

2019-09-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9011:
---
Attachment: YARN-9011-004.patch

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9836) General usability improvements in showSimulationTrace.html

2019-09-16 Thread Adam Antal (Jira)
Adam Antal created YARN-9836:


 Summary: General usability improvements in showSimulationTrace.html
 Key: YARN-9836
 URL: https://issues.apache.org/jira/browse/YARN-9836
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler-load-simulator
Affects Versions: 3.3.0
Reporter: Adam Antal
Assignee: Adam Antal


There are some small usability improvements that can be made for the offline 
analysis page (showSimulationTrace.html):
- empty divs can be hidden until no data is displayed
- the site can be refactored to be responsive given that bootstrap is already 
available as third party library
- there's no proper error handling in the site (e.g. a JSON is malformed and 
similar cases) which is really a big problem
- there's no indentation in the raw html file which makes supportability even 
worse



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930656#comment-16930656
 ] 

Hadoop QA commented on YARN-9011:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
56s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 14 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
18s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 47s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
18s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  9m 
58s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 19s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 3 new + 562 unchanged - 3 fixed = 565 total (was 565) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 49s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 86m 
37s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 
42s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
37s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}202m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9011 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980402/YARN-9011-003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 0e055a82ebb4 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 85b1c72 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| 

[jira] [Assigned] (YARN-9832) YARN UI has decommissioned nodemanager links

2019-09-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-9832:
---

Assignee: Tarun Parimi  (was: Prabhu Joseph)

> YARN UI has decommissioned nodemanager links
> 
>
> Key: YARN-9832
> URL: https://issues.apache.org/jira/browse/YARN-9832
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> Container logs from yarn UI don't work if the container was on a node that 
> got deleted as part of a scale down operation. The Appattempts and Containers 
> points to the NodeManager link of a decommissioned node. 
> The UI code can change the link to point to Log Server (AHS) if the NM is not 
> in the Yarn Node List (Running Nodes) or an error page "Node is 
> decommissioned, check Logs Cli". 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930521#comment-16930521
 ] 

Tarun Parimi commented on YARN-9772:


bq. Should we extend the duplicates check (as of now, it does only for leaf 
queues) to parent queues as well? 
[~maniraj...@gmail.com], Only problem I see is that there will be existing 
users who might have already have a queue config containing parent queues with 
duplicate names. They will face error when they upgrade and be forced to modify 
their current queue config.

> CapacitySchedulerQueueManager has incorrect list of queues
> --
>
> Key: YARN-9772
> URL: https://issues.apache.org/jira/browse/YARN-9772
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> CapacitySchedulerQueueManager has incorrect list of queues when there is more 
> than one parent queue (say at middle level) with same name.
> For example,
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. 
> While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in 
> the map because of similar name. After parsing all the queues, map count 
> should be 7, but it is 6. Any reference to queue "root.a.b" in code path is 
> nothing but "root.a.d.b" object. Since 
> {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, 
> will need to understand the implications in detail. For example, 
> {{CapapcityScheduler#getQueue}} has been used in many places which in turn 
> uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9766) YARN CapacityScheduler QueueMetrics has missing metrics for parent queues having same name

2019-09-16 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930516#comment-16930516
 ] 

Prabhu Joseph commented on YARN-9766:
-

[~tarunparimi] The patch looks good. +1 (non-binding)

[~sunilg] [~eepayne] Can you review and commit this Jira when you get time. 
Thanks.

> YARN CapacityScheduler QueueMetrics has missing metrics for parent queues 
> having same name
> --
>
> Key: YARN-9766
> URL: https://issues.apache.org/jira/browse/YARN-9766
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9766.001.patch
>
>
> In Capacity Scheduler, we enforce Leaf Queues to have unique names. But it is 
> not the case for Parent Queues. For example, we can have the below queue 
> hierarchy, where "b" is the queue name for two different queue paths root.a.b 
> and root.a.d.b . Since it is not a leaf queue this configuration works and 
> apps run fine in the leaf queues 'c'  and 'e'.
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> But the jmx metrics does not show the metrics for the parent queue 
> "root.a.d.b" . We can see metrics only for "root.a.b" queue.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9794) RM crashes due to runtime errors in TimelineServiceV2Publisher

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930503#comment-16930503
 ] 

Tarun Parimi commented on YARN-9794:


Thanks [~abmodi],[~Prabhu Joseph] for the reviews and commit.

> RM crashes due to runtime errors in TimelineServiceV2Publisher
> --
>
> Key: YARN-9794
> URL: https://issues.apache.org/jira/browse/YARN-9794
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9794.001.patch, YARN-9794.002.patch
>
>
> Saw that RM crashes while startup due to errors while putting entity in 
> TimelineServiceV2Publisher.
> {code:java}
> 2019-08-28 09:35:45,273 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.RuntimeException: java.lang.IllegalArgumentException: 
> org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException:
>  CodedInputStream encountered an embedded string or message which claimed to 
> have negative size
> .
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:200)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:269)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
> at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:321)
> at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:285)
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.flush(TypedBufferedMutator.java:66)
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:566)
> at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.flushBufferedTimelineEntities(TimelineCollector.java:173)
> at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntities(TimelineCollector.java:150)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:459)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:73)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:494)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:483)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: 
> org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException:
>  CodedInputStream encountered an embedded string or message which claimed to 
> have negative size.
> at 
> org.apache.hbase.thirdparty.com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930498#comment-16930498
 ] 

Tarun Parimi commented on YARN-9833:


Great find. Recently came across this issue in a production cluster, where it 
was happening sporadically.

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-16 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930481#comment-16930481
 ] 

Adam Antal commented on YARN-9833:
--

+1 (non-binding).

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930476#comment-16930476
 ] 

Hadoop QA commented on YARN-9833:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
39s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 
51s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
45s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 81m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9833 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980384/YARN-9833-001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux b992a2391a39 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 85b1c72 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24797/testReport/ |
| Max. process+thread count | 306 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/24797/console |
| Powered by | Apache Yetus 0.8.0   

[jira] [Updated] (YARN-9011) Race condition during decommissioning

2019-09-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9011:
---
Attachment: YARN-9011-003.patch

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9011) Race condition during decommissioning

2019-09-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9011:
---
Attachment: (was: YARN-9011-003.patch)

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9011) Race condition during decommissioning

2019-09-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9011:
---
Attachment: YARN-9011-003.patch

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930429#comment-16930429
 ] 

Hadoop QA commented on YARN-9814:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
24s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
43s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 21s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
37s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
17s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  6m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
13s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 226 unchanged - 1 fixed = 226 total (was 227) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 42s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
58s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
56s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
46s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9814 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980378/YARN-9814.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 98b970169c2a 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 85b1c72 |
| maven | version: 

[jira] [Updated] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9833:
---
Attachment: YARN-9833-001.patch

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9733) Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when subprocess of container dead

2019-09-16 Thread qian han (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qian han updated YARN-9733:
---
Attachment: (was: YARN-9733.001.patch)

> Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when 
> subprocess of container dead
> 
>
> Key: YARN-9733
> URL: https://issues.apache.org/jira/browse/YARN-9733
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: qian han
>Assignee: qian han
>Priority: Major
> Attachments: YARN-9733.001.patch
>
>
> The method getTotalProcessJiffies only gets jiffies for running processes not 
> dead processes.
> For example, process pid100 and its children pid200 and pid300.
> We call getCpuUsagePercent the first time, assume that pid100 has a jiffies 
> 1000, pid200 2000 and pid300 3000. The totalProcessJiffies1 is 6000.
> And We kill pid300. Then we call getCpuUsagePercent the second time, assume 
> that pid100 has a jiffies 1100, pid200 2200. The totalProcessJiffies2 is 3300.
> So we got a cpu usage percent 0.
> I would like to fix this bug.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9733) Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when subprocess of container dead

2019-09-16 Thread qian han (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qian han updated YARN-9733:
---
Attachment: YARN-9733.001.patch

> Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when 
> subprocess of container dead
> 
>
> Key: YARN-9733
> URL: https://issues.apache.org/jira/browse/YARN-9733
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: qian han
>Assignee: qian han
>Priority: Major
> Attachments: YARN-9733.001.patch
>
>
> The method getTotalProcessJiffies only gets jiffies for running processes not 
> dead processes.
> For example, process pid100 and its children pid200 and pid300.
> We call getCpuUsagePercent the first time, assume that pid100 has a jiffies 
> 1000, pid200 2000 and pid300 3000. The totalProcessJiffies1 is 6000.
> And We kill pid300. Then we call getCpuUsagePercent the second time, assume 
> that pid100 has a jiffies 1100, pid200 2200. The totalProcessJiffies2 is 3300.
> So we got a cpu usage percent 0.
> I would like to fix this bug.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930361#comment-16930361
 ] 

Adam Antal commented on YARN-9814:
--

Thanks for the review [~sunilg].
- The extra debug logging seems a bit of overkill, since there's no computation 
that would we save, but added it anyways.
- There was no test for the existing default log directory creation - there is 
now. Also mocked the loginUser of {{UserGroupInformation}} in both tests to 
make more precise test.

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch, YARN-9814.005.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-9814:
-
Attachment: YARN-9814.005.patch

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch, YARN-9814.005.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930335#comment-16930335
 ] 

Hadoop QA commented on YARN-9814:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
31s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 29s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 25s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
51s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
43s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
46s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 85m  8s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9814 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980368/YARN-9814.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux a8ed60a4989b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 85b1c72 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 

[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930281#comment-16930281
 ] 

Sunil Govindan commented on YARN-9814:
--

Thanks [~adam.antal].

This approach looks fine to me.

couple of minor comments:
 # Please renamed remote-app-log-dir.group => remote-app-log-dir.groupname or 
group-name. wanted to explicitly understand what group means, as its bit less 
informations. 
 # New LOG.debug which is added, please put it under if(LOG.isDebugEnabled()) 
flag
 # Is it possible to test when custom group is not added, it takes the default 
one ? if its already there, please point to me to that.

Thanks

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-9814:
-
Attachment: YARN-9814.004.patch

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930269#comment-16930269
 ] 

Adam Antal commented on YARN-9814:
--

Thanks for the review [~Prabhu Joseph]. Indeed, you're right about this. Added 
the {{primaryGroup.isEmpty()}} part to the condition.

[~sunilg], could you please take a look at this and commit if you agree?

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org