[jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException

2019-09-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934080#comment-16934080
 ] 

Peter Bacsko commented on YARN-9552:


[~Steven Rand] it shouldn't be a big deal to create patches that applies to 
those branches. I'll see if there's any conflict and upload patches for 
branch-3.2 and 3.1.

> FairScheduler: NODE_UPDATE can cause NoSuchElementException
> ---
>
> Key: YARN-9552
> URL: https://issues.apache.org/jira/browse/YARN-9552
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9552-001.patch, YARN-9552-002.patch, 
> YARN-9552-003.patch, YARN-9552-004.patch
>
>
> We observed a race condition inside YARN with the following stack trace:
> {noformat}
> 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR 
> EventDispatcher: Error in handling event type NODE_UPDATE to the Event 
> Dispatcher
> java.util.NoSuchElementException
> at 
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
> at 
> java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> This is basically the same as the one described in YARN-7382, but the root 
> cause is different.
> When we create an application attempt, we create an {{FSAppAttempt}} object. 
> This contains an {{AppSchedulingInfo}} which contains a set of 
> {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a 
> bit later on a separate thread during a state transition:
> {noformat}
> 2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] 
> recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for 
> app: application_1557237478804_0001
> 2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
> 2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted 
> application application_1557237478804_0001 from user: bacskop, in queue: 
> root.bacskop, currently num of applications: 1
> 2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
> 2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] 
> resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app 
> attempt : appattempt_1557237478804_0001_01
> 2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
> (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 
> State change from NEW to SUBMITTED on event = START
> 2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of 
> SchedulerApplicationAttempt
> 2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(230)) - *** Contents of 
> appSchedulingInfo: []
> 2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] 
> 

[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore

2019-09-19 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934068#comment-16934068
 ] 

Akira Ajisaka commented on YARN-9826:
-

bq. I don't see any side effects with that change.

There are no side effects, but there may be duplicate log creations.
I think we can use another lock object to avoid duplicate operations as follows:

{code}
  private final Object fsOpLock = new Object();
(snip)
// Note that the content in the cache log storage may be stale.
cacheItem = this.cachedLogs.get(groupId);
// If the cache already exists, we don't need to hold any locks.
if (cacheItem == null) {
  // Use lock to serialize fs operations
  synchronized(fsOpLock) {
// Recheck cache to avoid duplicate fs operations
cacheItem = this.cachedLogs.get(groupId);
if (cacheItem == null) {
  LOG.debug("Set up new cache item for id {}", groupId);
  cacheItem = new EntityCacheItem(groupId, getConfig());
  AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
  if (appLogs != null) {
LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
cacheItem.setAppLogs(appLogs);
this.cachedLogs.put(groupId, cacheItem);
  } else {
LOG.warn("AppLogs for groupId {} is set to null!", groupId);
  }
}
  }
}
{code}

> Blocked threads at EntityGroupFSTimelineStore#getCachedStore
> 
>
> Key: YARN-9826
> URL: https://issues.apache.org/jira/browse/YARN-9826
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have observed this case several times on our production cluster where 100s 
> of TimelineServer threads are blocked at the following synchronized block in 
> EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under 
> high load.
> {code:java}
> synchronized (this.cachedLogs) {
>   // Note that the content in the cache log storage may be stale.
>   cacheItem = this.cachedLogs.get(groupId);
>   if (cacheItem == null) {
> LOG.debug("Set up new cache item for id {}", groupId);
> cacheItem = new EntityCacheItem(groupId, getConfig());
> AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
> if (appLogs != null) {
>   LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
>   cacheItem.setAppLogs(appLogs);
>   this.cachedLogs.put(groupId, cacheItem);
> } else {
>   LOG.warn("AppLogs for groupId {} is set to null!", groupId);
> }
>   }
> }
> {code}
> One thread inside the synchronized block performs multiple fs operations 
> (fs.exists) inside getAndSetAppLogs, which could block other threads when, 
> for instance, the NameNode RPC queue is full.
> One possible solution is to move getAndSetAppLogs outside the synchronized 
> block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9848) revert YARN-4946

2019-09-19 Thread Steven Rand (Jira)
Steven Rand created YARN-9848:
-

 Summary: revert YARN-4946
 Key: YARN-9848
 URL: https://issues.apache.org/jira/browse/YARN-9848
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, resourcemanager
Reporter: Steven Rand


In YARN-4946, we've been discussing a revert due to the potential for keeping 
more applications in the state store than desired, and the potential to greatly 
increase RM recovery times.

 

I'm in favor of reverting the patch, but other ideas along the lines of 
YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException

2019-09-19 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934050#comment-16934050
 ] 

Steven Rand commented on YARN-9552:
---

This seems like an important fix since it prevents the RM from crashing – any 
chance we can backport it to the 3.2 and 3.1 maintenance releases?

> FairScheduler: NODE_UPDATE can cause NoSuchElementException
> ---
>
> Key: YARN-9552
> URL: https://issues.apache.org/jira/browse/YARN-9552
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9552-001.patch, YARN-9552-002.patch, 
> YARN-9552-003.patch, YARN-9552-004.patch
>
>
> We observed a race condition inside YARN with the following stack trace:
> {noformat}
> 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR 
> EventDispatcher: Error in handling event type NODE_UPDATE to the Event 
> Dispatcher
> java.util.NoSuchElementException
> at 
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
> at 
> java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> This is basically the same as the one described in YARN-7382, but the root 
> cause is different.
> When we create an application attempt, we create an {{FSAppAttempt}} object. 
> This contains an {{AppSchedulingInfo}} which contains a set of 
> {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a 
> bit later on a separate thread during a state transition:
> {noformat}
> 2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] 
> recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for 
> app: application_1557237478804_0001
> 2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
> 2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted 
> application application_1557237478804_0001 from user: bacskop, in queue: 
> root.bacskop, currently num of applications: 1
> 2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
> 2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] 
> resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app 
> attempt : appattempt_1557237478804_0001_01
> 2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
> (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 
> State change from NEW to SUBMITTED on event = START
> 2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of 
> SchedulerApplicationAttempt
> 2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(230)) - *** Contents of 
> appSchedulingInfo: []
> 2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler 

[jira] [Created] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-19 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9847:


 Summary: ZKRMStateStore will cause zk connection loss when writing 
huge data into znode
 Key: YARN-9847
 URL: https://issues.apache.org/jira/browse/YARN-9847
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


Recently, we encountered RM ZK connection issue due to RM was trying to write 
huge data into znode. This behavior will zk report Len error and then cause zk 
session connection loss. And eventually RM would crash due to zk connection 
issue.

*The fix*

In order to protect ResouceManager from crash due to this.
This fix is trying to limit the size of data for attemp by limiting the 
diagnostic info when writing ApplicationAttemptStateData into znode. The size 
will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
be also used by zookeeper server.

*The story*

ResourceManager Log
{code:java}
2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, unexpected 
error, closing socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

2019-07-29 04:27:35,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
{code}


ResourceManager will retry to connect to zookeeper until it exhausted retry 
number and then give up.

{code:java}
2019-07-29 02:25:06,404 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 999


2019-07-29 02:25:06,718 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: 
Client will use GSSAPI as SASL mechanism.
2019-07-29 02:25:06,718 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 2019-07-29 02:25:06,404 INFO 

[jira] [Updated] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-19 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated YARN-9834:
--
Description: 
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

Winbind user sync has lots of overhead, especially for large corporations. Also 
if running Yarn inside Kubernetes cluster (meaning node managers running inside 
Docker container), it doesn't make sense for each container (running node 
manager inside) to domain join with Active Directory and sync a whole copy of 
domain users.

We need a light-weighted config to enable Yarn Secure Container as an 
alternative to AD domain join, SSSD/Winbind.

We should allow a new configuration to Yarn, such that we can pre-create a pool 
of users on each machine/Docker container. And at runtime, Yarn allocates a 
local user to the domain user that submits the application. When all containers 
of that user are finished and all files belonging to that user are deleted, we 
can release the allocation and allow other users to use the same local user to 
run their Yarn containers.
h2. Design

We propose to extend LinuxContainerExecutor to support pool-user in secure 
mode. LinuxContainerExecutor is the main class and accompanying classes and the 
container-executor binary to implement the Yarn Secure Container feature. 

There are existing configurations like 
"yarn.nodemanager.linux-container-executor.nonsecure-mode.xxx", we propose to 
add these new configurations for secure mode:
{code:java}
yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user, defaults 
to false
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-prefix, 
defaults to "user"
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-count, defaults 
to -1, meaning the value of yarn.nodemanager.resource.cpu-vcores are used.
{code}
By default this feature is turned off. If we enable it, with pool-user-prefix 
set to "user", then we expect there are pre-created local users user0 - usern, 
where the total number of local users equals to pool-user-count. The default 
pool-user-count equals to cpu-vcores, because in theory that's the maximum 
number of concurrent containers running on a given yarn node manager.

We use an in-memory allocator to keep the domain user to local user mapping. 

Now when to add the mapping and when to remove it?

In node manager, ApplicationImpl implements the state machine for a Yarn app 
life cycle, only if the app has at least 1 container running on that node 
manager. We can hook up the code to add the mapping during application 
initialization.

For removing the mapping, we need to wait for 3 things:

1) All applications of the same user is completed;
 2) All log handling of the applications (log aggregation or non-aggregated 
handling) is done;
 3) All pending FileDeletionTask that use the user's identity is finished.

Note that all operation to these reference counting should be synchronized 
operation.

If all of our local users in the pool are allocated, we'll return 
"nonexistuser" as runas user, this will cause the container to fail to execute 
and Yarn will relaunch it in other nodes.

What about node manager restarts? During ResourceLocalizationService init, it 
renames the root folders used by the node manager and schedules 
FileDeletionTask to delete the content of these files as the owner (local pool 
users) of these local files. To prevent the newly launched Yarn containers to 
be able to peek into the yet-to-be-deleted old application folders right after 
node manager restart, we can allocate these local pool users to the requested 
user in FileDeletionTask, which results in a call to incrementFileOpCount(). 
Therefore we allow local pool user allocation during the call to 
incrementFileOpCount(appUser) and if appUser matches with a local pool user, we 
allocate that user to the same named appUser, preventing new containers to 
reuse the same local pool user, until all the FileDeletionTask belonging to 
that user are done.
h2. Limitations

1) This feature does not support PRIVATE visibility type of resource 
allocation. Because PRIVATE type of resources are potentially cached in the 
node manager for a very long time, supporting it will be a security problem 
that a user might be able to peek into previous user's PRIVATE resources. We 
can modify code to treat all PRIVATE type of resource as APPLICATION type.

2) It is recommended to enable DominantResourceCalculator so that no more than 
"cpu-vcores" number of concurrent containers running on a node manager:
{code:java}
yarn.scheduler.capacity.resource-calculator
= org.apache.hadoop.yarn.util.resource.DominantResourceCalculator {code}
3) Currently this feature does not work with 

[jira] [Updated] (YARN-9834) Allow using a pool of local users to run Yarn Secure Container in secure mode

2019-09-19 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated YARN-9834:
--
Description: 
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

Winbind user sync has lots of overhead, especially for large corporations. Also 
if running Yarn inside Kubernetes cluster (meaning node managers running inside 
Docker container), it doesn't make sense for each container to domain join with 
Active Directory and sync a whole copy of domain users.

We should allow a new configuration to Yarn, such that we can pre-create a pool 
of users on each machine/Docker container. And at runtime, Yarn allocates a 
local user to the domain user that submits the application. When all containers 
of that user are finished and all files belonging to that user are deleted, we 
can release the allocation and allow other users to use the same local user to 
run their Yarn containers.
h2. Design

We propose to add these new configurations:
{code:java}
yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user, defaults 
to false
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-prefix, 
defaults to "user"
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-count, defaults 
to -1, meaning the value of yarn.nodemanager.resource.cpu-vcores are used.
{code}
By default this feature is turned off. If we enable it, with pool-user-prefix 
set to "user", then we expect there are pre-created local users user0 - usern, 
where the total number of local users equals to pool-user-count. The default 
pool-user-count equals to cpu-vcores, because in theory that's the maximum 
number of concurrent containers running on a given yarn node manager.

We use an in-memory allocator to keep the domain user to local user mapping. 

Now when to add the mapping and when to remove it?

In node manager, ApplicationImpl implements the state machine for a Yarn app 
life cycle, only if the app has at least 1 container running on that node 
manager. We can hook up the code to add the mapping during application 
initialization.

For removing the mapping, we need to wait for 3 things:

1) All applications of the same user is completed;
 2) All log handling of the applications (log aggregation or non-aggregated 
handling) is done;
 3) All pending FileDeletionTask that use the user's identity is finished.

Note that all operation to these reference counting should be synchronized 
operation.

If all of our local users in the pool are allocated, we'll return 
"nonexistuser" as runas user, this will cause the container to fail to execute 
and Yarn will relaunch it in other nodes.

What about node manager restarts? During ResourceLocalizationService init, it 
renames the root folders used by the node manager and schedules 
FileDeletionTask to delete the content of these files as the owner (local pool 
users) of these local files. To prevent the newly launched Yarn containers to 
be able to peek into the yet-to-be-deleted old application folders right after 
node manager restart, we can allocate these local pool users to the requested 
user in FileDeletionTask, which results in a call to incrementFileOpCount(). 
Therefore we allow local pool user allocation during the call to 
incrementFileOpCount(appUser) and if appUser matches with a local pool user, we 
allocate that user to the same named appUser, preventing new containers to 
reuse the same local pool user, until all the FileDeletionTask belonging to 
that user are done.
h2. Limitations

1) This feature does not support PRIVATE visibility type of resource 
allocation. Because PRIVATE type of resources are potentially cached in the 
node manager for a very long time, supporting it will be a security problem 
that a user might be able to peek into previous user's PRIVATE resources. We 
can modify code to treat all PRIVATE type of resource as APPLICATION type.

2) It is recommended to enable DominantResourceCalculator so that no more than 
"cpu-vcores" number of concurrent containers running on a node manager:
{code:java}
yarn.scheduler.capacity.resource-calculator
= org.apache.hadoop.yarn.util.resource.DominantResourceCalculator {code}
3) Currently this feature does not work with Yarn Node Manager recovery. This 
is because the mappings are kept in memory, it cannot be recovered after node 
manager restart.

 

  was:
Yarn Secure Container in secure mode allows separation of different user's 
local files and container processes running on the same node manager. This 
depends on an out of band service such as SSSD/Winbind to sync all domain users 
to local machine.

Winbind user sync has lots of overhead, especially for large corporations. Also 
if running Yarn inside 

[jira] [Commented] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933871#comment-16933871
 ] 

Hadoop QA commented on YARN-9846:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 25s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 101 unchanged - 4 fixed = 101 total (was 105) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 43s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 56s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 75m 20s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9846 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980791/YARN-9846.2.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux eb9e38ef5616 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 126ef77 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/24811/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results 

[jira] [Commented] (YARN-9737) Performance degradation, Distributed Opportunistic Scheduling

2019-09-19 Thread Konstantinos Karanasos (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933869#comment-16933869
 ] 

Konstantinos Karanasos commented on YARN-9737:
--

Hi [~Babbleshack], just saw this.

The performance of opportunistic containers depends on a lot of things -- here 
are a few to consider:
 * Note that distributed scheduling is most probably not what affects you here, 
but instead the use of opportunistic containers. So, I am pretty sure you would 
get the same results for centralized scheduling of opportunistic containers.
 * Opportunistic containers can be killed by guaranteed containers, therefore 
execution is much more sensitive as cluster utilization increases. So if your 
gridmix script ends up leading to high cluster utilization, you might end up 
getting excessive killing or queuing, hence the slowdown you observe.
 * Along the same lines, you might want to decrease the number of concurrent 
applications, especially if your utilization is high. Opportunistic containers 
do not count towards actually used resources for that matter, so if you are not 
careful you will end up launching too many jobs, therefore too many AMs, which 
use guaranteed containers and will be killing running opportunistic ones.
 * The fact that you allow at most two containers per node does not help either 
(and it is not a very common set-up in practice). It means that when a 
guaranteed container arrives at that node, at least 50% of the node's 
opportunistic containers will be killed (just by killing a single container).
 * You might also try to see what happens if you decrease the size of your 
queue to 0 or 1 (you will not avoid killing but you will avoid queuing).
 * How are you setting the percentage of opportunistic containers? I guess in 
the AM? Note that even when you set it to 100%, the AM will still be launched 
with a guaranteed container.

 

Hope this helps. Then there are a lot of other things to tune for the 
scheduling of opportunistic containers, but I would start from the above list.

cc: [~abmodi] that is currently working on opportunistic containers actively.

> Performance degradation, Distributed Opportunistic Scheduling
> -
>
> Key: YARN-9737
> URL: https://issues.apache.org/jira/browse/YARN-9737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 3.1.2
> Environment: OS: Ubuntu 18.04
>  JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
>  1 * Resource Manager – Intel Core i7-4770 CPU @ 3.40GHz, 16GB Memory, 256GB 
> ssd.
>  37 * Node Managers - Intel Core i7-4770 CPU @ 3.40GHz, 8GB Memory, 256GB 
> ssd. 
>  2 * 3.5 Gb slots per Node Manager, 1x cpu per slot
> yarn-site: [^yarn-site.xml]
>  yarn-client-yarn-site: [^yarn-client.yarn-site.xml]
>  
>Reporter: Babble Shack
>Priority: Major
>  Labels: performance, scheduler, scheduling
> Attachments: jct_cdf_100j_100t_1500.svg, 
> jct_cdf_100j_50t_1500_with_outliers.svg, jet_boxplot_j100_50t_1500.svg, 
> jet_boxplot_j100_50t_1500_with_outliers.svg, 
> task_throughput_boxplot_100j_50t_1500.svg, yarn-client.yarn-site.xml, 
> yarn-site.xml
>
>
> Opportunistic scheduling is supposed to provide lower scheduling time, and 
> thus higher task throughput and lower job completion times for short 
> jobs/tasks.
> Through my experiments I have found distributed scheduling can degrade 
> performance.
> I ran a gridmix trace of 100 short jobs, each with 50 tasks. Average task run 
> time was 1523ms.
> Findings:
>  * Job completion time, the time take from submitting a job to job 
> completion, may degrade by over 200%
>  [^jct_cdf_100j_100t_1500.svg]
>  [^jct_cdf_100j_50t_1500_with_outliers.svg]
>  * Job execution time may increase by up to 300%
>  [^jet_boxplot_j100_50t_1500.svg]
>  [^jet_boxplot_j100_50t_1500_with_outliers.svg]
>  * Task throughput decreased by 100%
>  ^[^task_throughput_boxplot_100j_50t_1500.svg]^



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6684) TestAMRMClient tests fail on branch-2.7

2019-09-19 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung resolved YARN-6684.
-
Resolution: Won't Fix

branch-2.7 EOL, closing as won't fix

> TestAMRMClient tests fail on branch-2.7
> ---
>
> Key: YARN-6684
> URL: https://issues.apache.org/jira/browse/YARN-6684
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Priority: Major
>
> {noformat}2017-06-01 19:10:44,362 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:addNode(1335)) - Added node 
> jhung-ld2.linkedin.biz:58205 clusterResource: 
> 2017-06-01 19:10:44,370 INFO  server.MiniYARNCluster 
> (MiniYARNCluster.java:waitForNodeManagersToConnect(657)) - All Node Managers 
> connected in MiniYARNCluster
> 2017-06-01 19:10:44,376 INFO  client.RMProxy (RMProxy.java:createRMProxy(98)) 
> - Connecting to ResourceManager at jhung-ld2.linkedin.biz/ipaddr:36167
> 2017-06-01 19:10:45,501 INFO  ipc.Client 
> (Client.java:handleConnectionFailure(872)) - Retrying connect to server: 
> jhung-ld2.linkedin.biz/ipaddr:36167. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2017-06-01 19:10:46,502 INFO  ipc.Client 
> (Client.java:handleConnectionFailure(872)) - Retrying connect to server: 
> jhung-ld2.linkedin.biz/ipaddr:36167. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2017-06-01 19:10:47,503 INFO  ipc.Client 
> (Client.java:handleConnectionFailure(872)) - Retrying connect to server: 
> jhung-ld2.linkedin.biz/ipaddr:36167. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2017-06-01 19:10:48,504 INFO  ipc.Client 
> (Client.java:handleConnectionFailure(872)) - Retrying connect to server: 
> jhung-ld2.linkedin.biz/ipaddr:36167. Already tried 3 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS){noformat}
> After some investigation, seems it is the same issue as described here: 
> HDFS-11893



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8825) Print application tags in ApplicationSummary

2019-09-19 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung resolved YARN-8825.
-
Resolution: Duplicate

> Print application tags in ApplicationSummary
> 
>
> Key: YARN-8825
> URL: https://issues.apache.org/jira/browse/YARN-8825
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> Useful for tracking application tag metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Attachment: YARN-9846.2.patch

> Use Finer-Grain Synchronization in ResourceLocalizationService
> --
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
> Attachments: YARN-9846.1.patch, YARN-9846.2.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Attachment: (was: YARN-9846.2.patch)

> Use Finer-Grain Synchronization in ResourceLocalizationService
> --
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
> Attachments: YARN-9846.1.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Attachment: YARN-9846.2.patch

> Use Finer-Grain Synchronization in ResourceLocalizationService
> --
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
> Attachments: YARN-9846.1.patch, YARN-9846.2.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7410) Cleanup FixedValueResource to avoid dependency to ResourceUtils

2019-09-19 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-7410:

Fix Version/s: 2.10.0

> Cleanup FixedValueResource to avoid dependency to ResourceUtils
> ---
>
> Key: YARN-7410
> URL: https://issues.apache.org/jira/browse/YARN-7410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Sunil Govindan
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 3.0.0, 2.10.0
>
> Attachments: YARN-7410.001.patch, YARN-7410.002.patch, 
> YARN-7410.003.patch, YARN-7410.branch-3.0.003.patch
>
>
> After YARN-7307, Client/AM don't need to keep a up-to-dated resource-type.xml 
> in the classpath. Instead, they can use YarnClient/ApplicationMasterProtocol 
> APIs to get the resource types from RM and refresh local types.
> One biggest issue of this approach is FixedValueResource: Since we initialize 
> FixedValueResource in static block, and they won't be updated if resource 
> types refreshed.
> So we need to properly update FixedValueResource to make it can get 
> up-to-date results



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7410) Cleanup FixedValueResource to avoid dependency to ResourceUtils

2019-09-19 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933827#comment-16933827
 ] 

Jonathan Hung commented on YARN-7410:
-

Committed to branch-2.

> Cleanup FixedValueResource to avoid dependency to ResourceUtils
> ---
>
> Key: YARN-7410
> URL: https://issues.apache.org/jira/browse/YARN-7410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Sunil Govindan
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: YARN-7410.001.patch, YARN-7410.002.patch, 
> YARN-7410.003.patch, YARN-7410.branch-3.0.003.patch
>
>
> After YARN-7307, Client/AM don't need to keep a up-to-dated resource-type.xml 
> in the classpath. Instead, they can use YarnClient/ApplicationMasterProtocol 
> APIs to get the resource types from RM and refresh local types.
> One biggest issue of this approach is FixedValueResource: Since we initialize 
> FixedValueResource in static block, and they won't be updated if resource 
> types refreshed.
> So we need to properly update FixedValueResource to make it can get 
> up-to-date results



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung resolved YARN-9844.
-
Resolution: Fixed

> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Assignee: Jonathan Hung
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933823#comment-16933823
 ] 

Hadoop QA commented on YARN-9846:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 43s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 24s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 3 new + 101 unchanged - 4 fixed = 104 total (was 105) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 50s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 
15s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m  9s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9846 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980775/YARN-9846.1.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e765fafcba7f 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 126ef77 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/24810/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24810/testReport/ |
| Max. process+thread count | 448 (vs. ulimit 

[jira] [Commented] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933805#comment-16933805
 ] 

Jonathan Hung commented on YARN-9844:
-

Thanks for the report [~Jim_Brennan], LightWeightResource#resources array was 
not initialized properly since it gets this array length from 
ResourceUtils#getNumberOfKnownResourceTypes which is not updated in 
ResourceUtils#initializeResourcesFromResourceInformationMap. This was fixed as 
part of YARN-7410. I'll port that patch to branch-2.

> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Assignee: Jonathan Hung
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reassigned YARN-9844:
---

Assignee: Jonathan Hung

> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Assignee: Jonathan Hung
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9845) Update LocalResourcesTrackerImpl to Use Java 8 Map Concurrent API

2019-09-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933769#comment-16933769
 ] 

Hadoop QA commented on YARN-9845:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 23m 
50s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 27s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 14 unchanged - 1 fixed = 14 total (was 15) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 38s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 
21s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}104m 58s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9845 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980764/YARN-9845.1.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a8c1684170b1 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 126ef77 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24809/testReport/ |
| Max. process+thread count | 413 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |

[jira] [Updated] (YARN-7860) Fix UT failure TestRMWebServiceAppsNodelabel#testAppsRunning

2019-09-19 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7860:
-
Fix Version/s: 2.10.0

> Fix UT failure TestRMWebServiceAppsNodelabel#testAppsRunning
> 
>
> Key: YARN-7860
> URL: https://issues.apache.org/jira/browse/YARN-7860
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.1.0, 2.10.0
>
> Attachments: YARN-7860.001.patch
>
>
> {{TestRMWebServiceAppsNodelabel#testAppsRunning}} is failing since YARN-7817.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7817) Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages.

2019-09-19 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7817:
-
Fix Version/s: 2.10.0

> Add Resource reference to RM's NodeInfo object so REST API can get non 
> memory/vcore resource usages.
> 
>
> Key: YARN-7817
> URL: https://issues.apache.org/jira/browse/YARN-7817
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.1.0, 2.10.0
>
> Attachments: Screen Shot 2018-01-25 at 11.59.31 PM.png, 
> YARN-7817.001.patch, YARN-7817.002.patch, YARN-7817.003.patch, 
> YARN-7817.004.patch, YARN-7817.005.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9773) Add QueueMetrics for Custom Resources

2019-09-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933733#comment-16933733
 ] 

Eric Payne commented on YARN-9773:
--

Thanks [~maniraj...@gmail.com] for the updated patch. Here are a couple of 
other comments.
 - All of the private static strings at the beginning of QueueMetrics should be 
final.
 - QueueMetrics#registerCustomResources:
 -- Can code be refactored to call {{this.registry.get(metricPrefix + 
resourceName)}} only once?
 - Please add unit tests.
 -- Perhaps something in {{TestQueueMetricsForCustomResources}}? There may be 
helper functions available in {{MetricsAsserts}}, or maybe you could sub-class 
{{QueueMetrics}} from within {{TestQueueMetricsForCustomResources}} to access 
{{QueueMetrics#registry}} directly.
 - One minor nit: I would find it more readable if you were to use the full 80 
characters, if possible, rather than splitting lines up when it's not 
necessary. For example:
{code:java|title=QueueMetrics#registerCustomResources}
+  String resourceName =
+entry.getKey();
{code}
Could be
{code:java}
+  String resourceName = entry.getKey();
{code}
Please review the rest of the code for other opportunities to do this.

Once these changes are made, I would like to figure out how to backport this to 
previous branches. I think that includes backporting YARN-8842 and YARN-8750. I 
will look into this.

> Add QueueMetrics for Custom Resources
> -
>
> Key: YARN-9773
> URL: https://issues.apache.org/jira/browse/YARN-9773
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9773.001.patch, YARN-9773.002.patch
>
>
> Although the custom resource metrics are calculated and saved as a 
> QueueMetricsForCustomResources object within the QueueMetrics class, the JMX 
> and Simon QueueMetrics do not report that information for custom resources. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Attachment: YARN-9846.1.patch

> Use Finer-Grain Synchronization in ResourceLocalizationService
> --
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
> Attachments: YARN-9846.1.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Summary: Use Finer-Grain Synchronization in ResourceLocalizationService  
(was: User Finer-Grain Synchronization in ResourceLocalizationService)

> Use Finer-Grain Synchronization in ResourceLocalizationService
> --
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) User Finer-Grain Synchronization in ResourceLocalizationService

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Summary: User Finer-Grain Synchronization in ResourceLocalizationService  
(was: User Finer-Grain Synchronization in ResourceLocalizationService.java)

> User Finer-Grain Synchronization in ResourceLocalizationService
> ---
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) User Finer-Grain Synchronization in ResourceLocalizationService.java

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Summary: User Finer-Grain Synchronization in 
ResourceLocalizationService.java  (was: User Fineer-Grain Synchronization in 
ResourceLocalizationService.java)

> User Finer-Grain Synchronization in ResourceLocalizationService.java
> 
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9846) User Fineer-Grain Synchronization in ResourceLocalizationService.java

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9846:
-
Flags: Patch

> User Fineer-Grain Synchronization in ResourceLocalizationService.java
> -
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9846) User Fineer-Grain Synchronization in ResourceLocalizationService.java

2019-09-19 Thread David Mollitor (Jira)
David Mollitor created YARN-9846:


 Summary: User Fineer-Grain Synchronization in 
ResourceLocalizationService.java
 Key: YARN-9846
 URL: https://issues.apache.org/jira/browse/YARN-9846
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.2.0
Reporter: David Mollitor
Assignee: David Mollitor


https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788

# Remove these synchronization blocks
# Ensure {{recentlyCleanedLocalizers}} is thread safe





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9760) Support configuring application priorities on a workflow level

2019-09-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933707#comment-16933707
 ] 

Hadoop QA commented on YARN-9760:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 37m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
5s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 56s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
32s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 16m 
18s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
2m 25s{color} | {color:orange} root: The patch generated 20 new + 534 unchanged 
- 3 fixed = 554 total (was 537) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
5s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 50s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  7m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 
21s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
53s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
52s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 89m 10s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}119m 28s{color} 
| {color:red} hadoop-mapreduce-client-jobclient in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  1m 
12s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}379m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.2 Server=19.03.2 

[jira] [Commented] (YARN-7817) Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages.

2019-09-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933698#comment-16933698
 ] 

Eric Payne commented on YARN-7817:
--

bq. It seems this patch breaks YARN-7860, please help port this patch in 
addition to this one.
Yup. Thanks [~jhung] for the reminder.

> Add Resource reference to RM's NodeInfo object so REST API can get non 
> memory/vcore resource usages.
> 
>
> Key: YARN-7817
> URL: https://issues.apache.org/jira/browse/YARN-7817
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2018-01-25 at 11.59.31 PM.png, 
> YARN-7817.001.patch, YARN-7817.002.patch, YARN-7817.003.patch, 
> YARN-7817.004.patch, YARN-7817.005.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933687#comment-16933687
 ] 

Chandni Singh commented on YARN-9839:
-

The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933687#comment-16933687
 ] 

Chandni Singh edited comment on YARN-9839 at 9/19/19 7:03 PM:
--

The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
ones when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 



was (Author: csingh):
The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}



--
This message was sent by 

[jira] [Updated] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9839:

Description: 
NM fails with the below error even though the ulimit for NM is large.

{code}
2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
java.lang.OutOfMemoryError: unable to create new native thread. One possible 
reason is that ulimit setting of 'max user processes' is too low. If so, do 
'ulimit -u ' and try again.
2019-09-12 10:27:46,348 FATAL 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for container_e95_1568242982456_152026_01_000132,5,main] 
threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
{code}

For each container localization request, there is a {{LocalizerRunner}} thread 
created and each {{LocalizerRunner}} creates another thread to get file 
permission info which is where we see this failure from. It is in Shell.java -> 
{{runCommand()}}

{code}
Thread errThread = new Thread() {
  @Override
  public void run() {
try {
  String line = errReader.readLine();
  while((line != null) && !isInterrupted()) {
errMsg.append(line);
errMsg.append(System.getProperty("line.separator"));
line = errReader.readLine();
  }
} catch(IOException ioe) {
  LOG.warn("Error reading the error stream", ioe);
}
  }
};
{code}






  was:
NM fails with the below error even though the ulimit for NM is large.

{code}
2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
java.lang.OutOfMemoryError: unable to create new native thread. One possible 
reason is that ulimit setting of 'max user processes' is too low. If so, do 
'ulimit -u ' and try again.
2019-09-12 10:27:46,348 FATAL 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for container_e95_1568242982456_152026_01_000132,5,main] 
threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
at 

[jira] [Commented] (YARN-7817) Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages.

2019-09-19 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933661#comment-16933661
 ] 

Jonathan Hung commented on YARN-7817:
-

Hi [~eepayne], no objections, please go ahead. It seems this patch breaks 
YARN-7860, please help port this patch in addition to this one. Thanks!

> Add Resource reference to RM's NodeInfo object so REST API can get non 
> memory/vcore resource usages.
> 
>
> Key: YARN-7817
> URL: https://issues.apache.org/jira/browse/YARN-7817
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2018-01-25 at 11.59.31 PM.png, 
> YARN-7817.001.patch, YARN-7817.002.patch, YARN-7817.003.patch, 
> YARN-7817.004.patch, YARN-7817.005.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-09-19 Thread Abhishek Modi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933658#comment-16933658
 ] 

Abhishek Modi commented on YARN-9697:
-

[~elgoiri] could you please review the approach of 
wip2(https://issues.apache.org/jira/secure/attachment/12980716/YARN-9697.wip2.patch)
 patch. Thanks.

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.ut.patch, YARN-9697.ut2.patch, 
> YARN-9697.wip1.patch, YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9762) Add submission context label to audit logs

2019-09-19 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933646#comment-16933646
 ] 

Jonathan Hung commented on YARN-9762:
-

Thx [~mkumar1984] for the patch, a few issues:
 * Can we keep the whitespace changes to a minimum? e.g. the logFailure in 
ClientRMService, we can just add an extra line which adds the submissioncontext 
nodelabelexpression. Also a few of the  LOG.warns in RMAuditLogger, there's a 
few unnecessary newline changes. It keeps the git blame cleaner if we minimize 
these changes.
 * There seems to be a lot of unnecessary whitespace in the added 
RMAuditLogger#logFailure javadoc (after the argument names, + before the Note 
at the bottom of the javadoc), can we remove these?
 * For TestRMAuditLogger#testFailureLogFormatHelper, let's add the queueName + 
partition arguments before the "args" argument. The case with "appId, ..., 
queueName, partition" and the "args" case are orthogonal, so it seems best to 
order the arguments as such.
 * In the same method, let's put the if (queueName != null) and if (partition 
!= null) checks before the if (args != null) check
 * The added test case
{noformat}
testFailureLogFormatHelper(checkIP, null, null, null, null, null, null,
QUEUE, PARTITION); {noformat}
doesn't seem right to me. We should be adding QUEUE and PARTITION test cases 
with testFailureLogFormatHelper calls which have non-null APPID, ATTEMPTID, 
etc. arguments, i.e. add a
{noformat}
​ testFailureLogFormatHelper(checkIP, APPID, ATTEMPTID, CONTAINERID,
new CallerContext.Builder(CALLER_CONTEXT).setSignature(CALLER_SIGNATURE)
.build(), RESOURCE, QUEUE);{noformat}
and
{noformat}
testFailureLogFormatHelper(checkIP, APPID, ATTEMPTID, CONTAINERID,
new CallerContext.Builder(CALLER_CONTEXT).setSignature(CALLER_SIGNATURE)
.build(), RESOURCE, QUEUE, PARTITION); {noformat}
test cases.

> Add submission context label to audit logs
> --
>
> Key: YARN-9762
> URL: https://issues.apache.org/jira/browse/YARN-9762
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Manoj Kumar
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9762.01.patch
>
>
> Currently we log NODELABEL in container allocation/release audit logs, we 
> should also log NODELABEL of application submission context on app submission.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9845) Update LocalResourcesTrackerImpl to Use Java 8 Map Concurrent API

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9845:
-
Attachment: YARN-9845.1.patch

> Update LocalResourcesTrackerImpl to Use Java 8 Map Concurrent API
> -
>
> Key: YARN-9845
> URL: https://issues.apache.org/jira/browse/YARN-9845
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
> Attachments: YARN-9845.1.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L467
> Class is using a {{ConcurrentHashMap}} but is not taking advantage of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9845) Update LocalResourcesTrackerImpl to Use Java 8 Map Concurrent API

2019-09-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated YARN-9845:
-
Summary: Update LocalResourcesTrackerImpl to Use Java 8 Map Concurrent API  
(was: Update to Use Java 8 Map Concurrent API)

> Update LocalResourcesTrackerImpl to Use Java 8 Map Concurrent API
> -
>
> Key: YARN-9845
> URL: https://issues.apache.org/jira/browse/YARN-9845
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L467
> Class is using a {{ConcurrentHashMap}} but is not taking advantage of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9845) Update to Use Java 8 Map Concurrent API

2019-09-19 Thread David Mollitor (Jira)
David Mollitor created YARN-9845:


 Summary: Update to Use Java 8 Map Concurrent API
 Key: YARN-9845
 URL: https://issues.apache.org/jira/browse/YARN-9845
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.2.0
Reporter: David Mollitor
Assignee: David Mollitor


https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java#L467

Class is using a {{ConcurrentHashMap}} but is not taking advantage of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues

2019-09-19 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933593#comment-16933593
 ] 

Sunil Govindan commented on YARN-9772:
--

Thanks. I think the point made by [~tarunparimi] is valid. Customer who has 
already a queue setup will run into issues during this.  we need to come with 
some way to smoothen that part. YARN-9766 was removing some checks, hence i had 
my reservations. [~tarunparimi] , lets fix cleanly. Also at same time, let 
customer get a cleaner upgrade path as well, if thats needed some tooling or 
scripts. 

 

> CapacitySchedulerQueueManager has incorrect list of queues
> --
>
> Key: YARN-9772
> URL: https://issues.apache.org/jira/browse/YARN-9772
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> CapacitySchedulerQueueManager has incorrect list of queues when there is more 
> than one parent queue (say at middle level) with same name.
> For example,
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. 
> While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in 
> the map because of similar name. After parsing all the queues, map count 
> should be 7, but it is 6. Any reference to queue "root.a.b" in code path is 
> nothing but "root.a.d.b" object. Since 
> {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, 
> will need to understand the implications in detail. For example, 
> {{CapapcityScheduler#getQueue}} has been used in many places which in turn 
> uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7817) Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages.

2019-09-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933538#comment-16933538
 ] 

Eric Payne commented on YARN-7817:
--

[~jhung], is there a plan to backport this to 2.10? If not, do you have any 
objections if I do it now?

> Add Resource reference to RM's NodeInfo object so REST API can get non 
> memory/vcore resource usages.
> 
>
> Key: YARN-7817
> URL: https://issues.apache.org/jira/browse/YARN-7817
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2018-01-25 at 11.59.31 PM.png, 
> YARN-7817.001.patch, YARN-7817.002.patch, YARN-7817.003.patch, 
> YARN-7817.004.patch, YARN-7817.005.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933536#comment-16933536
 ] 

Jim Brennan commented on YARN-9844:
---

It appears that I can run these individually, and there are no failures.   Only 
when trying to run the full test do they fail.  Is it possible these tests are 
being run in parallel?

 

> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-09-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933530#comment-16933530
 ] 

Hadoop QA commented on YARN-9697:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 53m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
49s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  0s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
16s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
7s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 55s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 16 new 
+ 21 unchanged - 4 fixed = 37 total (was 25) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
58s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
37s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager
 generated 1 new + 4 unchanged - 0 fixed = 5 total (was 4) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
56s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m  3s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}230m  0s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService 
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 |
| JIRA Issue | YARN-9697 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980716/YARN-9697.wip2.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a5ba741cf8d3 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool 

[jira] [Commented] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933510#comment-16933510
 ] 

Jim Brennan commented on YARN-9844:
---

Here is a output from running the tests:
{noformat}
mvn test -DRunCapacitySchedulerPerfTests=true -Dtest=TestCapacitySchedulerPerf
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:2.10.0-SNAPSHOT
[WARNING] 
'dependencyManagement.dependencies.dependency.(groupId:artifactId:type:classifier)'
 must be unique: com.microsoft.azure:azure-storage:jar -> version 7.0.0 vs 
5.4.0 @ org.apache.hadoop:hadoop-project:2.10.0-SNAPSHOT, 
/Users/jbrennan02/git/apache-hadoop/hadoop-project/pom.xml, line 1175, column 19
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten 
the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support 
building such malformed projects.
[WARNING] 
[INFO] 
[INFO] 
[INFO] Building Apache Hadoop YARN ResourceManager 2.10.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (create-testdirs) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Executing tasks


main:
[INFO] Executed tasks
[INFO] 
[INFO] --- hadoop-maven-plugins:2.10.0-SNAPSHOT:protoc (compile-protoc) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Wrote protoc checksums to file 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-maven-plugins-protoc-checksums.json
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/resources
[INFO] Copying 2 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Compiling 3 source files to 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/classes
[INFO] 
[INFO] --- hadoop-maven-plugins:2.10.0-SNAPSHOT:test-protoc 
(compile-test-protoc) @ hadoop-yarn-server-resourcemanager ---
[INFO] Wrote protoc checksums to file 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-maven-plugins-protoc-checksums.json
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 11 resources
[INFO] Copying 1 resource
[INFO] Copying 2 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Compiling 1 source file to 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes
[INFO] 
[INFO] --- maven-jar-plugin:2.5:test-jar (default) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Building jar: 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-yarn-server-resourcemanager-2.10.0-SNAPSHOT-tests.jar
[INFO] 
[INFO] --- maven-surefire-plugin:2.21.0:test (default-test) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf
[ERROR] Tests run: 4, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 42.365 
s <<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf
[ERROR] 
testUserLimitThroughputForFiveResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf)
  Time elapsed: 0.038 s  <<< ERROR!
java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:241)
at 
org.apache.hadoop.yarn.api.records.Resource.setResourceValue(Resource.java:351)
at 
org.apache.hadoop.yarn.util.resource.ResourceUtils.getResourceTypesMinimumAllocation(ResourceUtils.java:534)
at 

[jira] [Created] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)
Jim Brennan created YARN-9844:
-

 Summary: TestCapacitySchedulerPerf test errors in branch-2
 Key: YARN-9844
 URL: https://issues.apache.org/jira/browse/YARN-9844
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Affects Versions: 2.10.0
Reporter: Jim Brennan


**These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9844:
--
Description: 
These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}
{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}
{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}

  was:
**These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}


> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2019-09-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933458#comment-16933458
 ] 

Eric Badger commented on YARN-8786:
---

bq. YARN-9833 could fix this issue
Given that we don't have more information on the nature of the error in this 
JIRA other than that the mkdirs failed, I'm inclined to agree that this is 
likely fixed by YARN-9833. I would be fine with closing this as a dup of 
YARN-9833 and re-opening if we continue to see the failure.

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was 

[jira] [Commented] (YARN-9617) RM UI enables viewing pages using Timeline Reader for a user who can not access the YARN config endpoint

2019-09-19 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933375#comment-16933375
 ] 

Sunil Govindan commented on YARN-9617:
--

[~akhilpb] cud u pls add a validation for this case?

> RM UI enables viewing pages using Timeline Reader for a user who can not 
> access the YARN config endpoint
> 
>
> Key: YARN-9617
> URL: https://issues.apache.org/jira/browse/YARN-9617
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.1
>Reporter: Balázs Szabó
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> If a user who can not access the /conf endpoint she/he will be unable to 
> query the address of the Timeline Service Reader 
> (yarn.timeline-service.reader.webapp.address). In this case, the user 
> receives a "403 Unauthenticated users are not authorized to access this page" 
> response, when trying to view pages requesting data from the Timeline Reader 
> (i.e. Flow Activity tab). In this case the UI is falling back to the default 
> address (localhost:8188), which eventually yields the 401 response (see 
> attached screenshots).
>  
> !1.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9766) YARN CapacityScheduler QueueMetrics has missing metrics for parent queues having same name

2019-09-19 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933372#comment-16933372
 ] 

Sunil Govindan commented on YARN-9766:
--

[~tarunparimi] [~Prabhu Joseph]

I went through the fix. I think its a short term fix. And it may have impacts.

This should be fixed in more cleaner way. I would like to enforce this 
validation if this doesnt make sense. Lets fix this at creation level of queues 
itself.
 # as per me, we are trying to over come the duplicated name issue by not 
looking at old one
 # this has happened because value got overriden.

Hence either we avoid such config, or reimplement the map to much better data 
structure.

 

cc [~eepayne] [~leftnoteasy] [~cheersyang]

> YARN CapacityScheduler QueueMetrics has missing metrics for parent queues 
> having same name
> --
>
> Key: YARN-9766
> URL: https://issues.apache.org/jira/browse/YARN-9766
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9766.001.patch
>
>
> In Capacity Scheduler, we enforce Leaf Queues to have unique names. But it is 
> not the case for Parent Queues. For example, we can have the below queue 
> hierarchy, where "b" is the queue name for two different queue paths root.a.b 
> and root.a.d.b . Since it is not a leaf queue this configuration works and 
> apps run fine in the leaf queues 'c'  and 'e'.
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> But the jmx metrics does not show the metrics for the parent queue 
> "root.a.d.b" . We can see metrics only for "root.a.b" queue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9760) Support configuring application priorities on a workflow level

2019-09-19 Thread Varun Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-9760:
---
Attachment: YARN-9760.02.patch

> Support configuring application priorities on a workflow level
> --
>
> Key: YARN-9760
> URL: https://issues.apache.org/jira/browse/YARN-9760
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jonathan Hung
>Assignee: Varun Saxena
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9760.01.patch, YARN-9760.02.patch
>
>
> Currently priorities are submitted on an application level, but for end users 
> it's common to submit workloads to YARN at a workflow level. This jira 
> proposes a feature to store workflow id + priority mappings on RM (similar to 
> queue mappings). If app is submitted with a certain workflow id (as set in 
> application submission context) RM will override this app's priority with the 
> one defined in the mapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-09-19 Thread Abhishek Modi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Modi updated YARN-9697:

Attachment: YARN-9697.wip2.patch

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.ut.patch, YARN-9697.ut2.patch, 
> YARN-9697.wip1.patch, YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2019-09-19 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933274#comment-16933274
 ] 

Tarun Parimi commented on YARN-8786:


YARN-9833 could fix this issue

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9782) Avoid DNS resolution while running SLS.

2019-09-19 Thread Abhishek Modi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933270#comment-16933270
 ] 

Abhishek Modi commented on YARN-9782:
-

Thanks [~elgoiri] for review. Filed YARN-9843 for making 
TestAMSimulator.testAMSimulator more resilient.

> Avoid DNS resolution while running SLS.
> ---
>
> Key: YARN-9782
> URL: https://issues.apache.org/jira/browse/YARN-9782
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9782.001.patch, YARN-9782.002.patch, 
> YARN-9782.003.patch
>
>
> In SLS, we add nodes with random names and rack. DNS resolution of these 
> nodes takes around 2 seconds because it will timeout after that. This makes 
> the result of SLS unreliable and adds spikes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9843) Test TestAMSimulator.testAMSimulator fails intermittently.

2019-09-19 Thread Abhishek Modi (Jira)
Abhishek Modi created YARN-9843:
---

 Summary: Test TestAMSimulator.testAMSimulator fails intermittently.
 Key: YARN-9843
 URL: https://issues.apache.org/jira/browse/YARN-9843
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Abhishek Modi
Assignee: Abhishek Modi


Stack trace for failure:

java.lang.AssertionError: java.io.IOException: Unable to delete directory 
/testptch/hadoop/hadoop-tools/hadoop-sls/target/test-dir/output4038286622450859971/metrics.
 at org.junit.Assert.fail(Assert.java:88)
 at 
org.apache.hadoop.yarn.sls.appmaster.TestAMSimulator.deleteMetricOutputDir(TestAMSimulator.java:141)
 at 
org.apache.hadoop.yarn.sls.appmaster.TestAMSimulator.tearDown(TestAMSimulator.java:298)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
 at org.junit.runners.Suite.runChild(Suite.java:128)
 at org.junit.runners.Suite.runChild(Suite.java:27)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
 at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
 at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933262#comment-16933262
 ] 

Steve Loughran commented on YARN-9839:
--

FYI, I'm adding some tests in HADOOP-16570 which verify that one of the FS 
clients doesn't leak threads -caches the set at the start, compares those at 
the end, after filtering out some demon threads which don't ever go away. The 
same trick might work here

> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> {{LocalizerRunner}} are Threads which are cached in 
> {{ResourceLocalizationService}}. Looking into a possibility if they are not 
> getting removed from the cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9842) Port YARN-9608 DecommissioningNodesWatcher should get lists of running applications on node from RMNode to branch-3.0/branch-2

2019-09-19 Thread Abhishek Modi (Jira)
Abhishek Modi created YARN-9842:
---

 Summary: Port YARN-9608 DecommissioningNodesWatcher should get 
lists of running applications on node from RMNode to branch-3.0/branch-2
 Key: YARN-9842
 URL: https://issues.apache.org/jira/browse/YARN-9842
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Abhishek Modi
Assignee: Abhishek Modi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5040) CPU Isolation with CGroups triggers kernel panics on Centos 7.1/7.2 when yarn.nodemanager.resource.percentage-physical-cpu-limit < 100

2019-09-19 Thread yinghua_zh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933167#comment-16933167
 ] 

yinghua_zh edited comment on YARN-5040 at 9/19/19 8:54 AM:
---

I also encountered the same problem. When CGroup is enabled and a big task is 
running, after a period of time, the operating system kernel crash,But not 
every time. My version information is as follows:

Hadoop:2.7.2

OS: CentOS Linux release 7.3.1611

OS kernel: 3.10.0-514.el7.x86_64

How did you solve the problem?[~vvasudev] [~sidharta-s] [~Tao Jie]  [~ecwpp] 
[~cheersyang]@


was (Author: yinghua_zh):
I also encountered the same problem. When CGroup is enabled and a big task is 
running, after a period of time, the operating system kernel crash, My version 
information is as follows:

Hadoop:2.7.2

OS: CentOS Linux release 7.3.1611

OS kernel: 3.10.0-514.el7.x86_64

How did you solve the problem?[~vvasudev] [~sidharta-s] [~Tao Jie]  [~ecwpp] @

> CPU Isolation with CGroups triggers kernel panics on Centos 7.1/7.2 when 
> yarn.nodemanager.resource.percentage-physical-cpu-limit < 100
> --
>
> Key: YARN-5040
> URL: https://issues.apache.org/jira/browse/YARN-5040
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Sidharta Seethana
>Assignee: Varun Vasudev
>Priority: Major
>
> /cc [~vvasudev]
> We have been running some benchmarks internally with resource isolation 
> enabled. We have consistently run into kernel panics when running a large job 
> ( a large pi job, terasort ). These kernel panics wen't away when we set 
> yarn.nodemanager.resource.percentage-physical-cpu-limit=100 . Anything less 
> than 100 triggers different behavior in YARN's CPU resource handler which 
> seems to cause these issues. Looking at the kernel crash dumps, the 
> backtraces were different - sometimes pointing to java processes, sometimes 
> not. 
> Kernel versions used : 3.10.0-229.14.1.el7.x86_64 and 
> 3.10.0-327.13.1.el7.x86_64 . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5040) CPU Isolation with CGroups triggers kernel panics on Centos 7.1/7.2 when yarn.nodemanager.resource.percentage-physical-cpu-limit < 100

2019-09-19 Thread yinghua_zh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933167#comment-16933167
 ] 

yinghua_zh commented on YARN-5040:
--

I also encountered the same problem. When CGroup is enabled and a big task is 
running, after a period of time, the operating system kernel crash, My version 
information is as follows:

Hadoop:2.7.2

OS: CentOS Linux release 7.3.1611

OS kernel: 3.10.0-514.el7.x86_64

How did you solve the problem?[~vvasudev] [~sidharta-s] [~Tao Jie]  [~ecwpp] @

> CPU Isolation with CGroups triggers kernel panics on Centos 7.1/7.2 when 
> yarn.nodemanager.resource.percentage-physical-cpu-limit < 100
> --
>
> Key: YARN-5040
> URL: https://issues.apache.org/jira/browse/YARN-5040
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Sidharta Seethana
>Assignee: Varun Vasudev
>Priority: Major
>
> /cc [~vvasudev]
> We have been running some benchmarks internally with resource isolation 
> enabled. We have consistently run into kernel panics when running a large job 
> ( a large pi job, terasort ). These kernel panics wen't away when we set 
> yarn.nodemanager.resource.percentage-physical-cpu-limit=100 . Anything less 
> than 100 triggers different behavior in YARN's CPU resource handler which 
> seems to cause these issues. Looking at the kernel crash dumps, the 
> backtraces were different - sometimes pointing to java processes, sometimes 
> not. 
> Kernel versions used : 3.10.0-229.14.1.el7.x86_64 and 
> 3.10.0-327.13.1.el7.x86_64 . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9840) Capacity scheduler: add support for Secondary Group rule mapping

2019-09-19 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9840:
---
Summary: Capacity scheduler: add support for Secondary Group rule mapping  
(was: Capacity scheduler: add support for Secondary Group user mapping)

> Capacity scheduler: add support for Secondary Group rule mapping
> 
>
> Key: YARN-9840
> URL: https://issues.apache.org/jira/browse/YARN-9840
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Currently, Capacity Scheduler only supports primary group rule mapping like 
> this:
> {{u:%user:%primary_group}}
> Fair scheduler already supports secondary group placement rule. Let's add 
> this to CS to reduce the feature gap.
> Class of interest: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9841) Capacity scheduler: add support for combined %user + %primary_group mapping

2019-09-19 Thread Peter Bacsko (Jira)
Peter Bacsko created YARN-9841:
--

 Summary: Capacity scheduler: add support for combined %user + 
%primary_group mapping
 Key: YARN-9841
 URL: https://issues.apache.org/jira/browse/YARN-9841
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Right now in CS, using {{%primary_group}} with a parent queue is only possible 
this way:

{{u:%user:parentqueue.%primary_group}}

Looking at 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java,
 we cannot do something like:

{{u:%user:%primary_group.%user}}

Fair Scheduler supports a nested rule where such a placement/mapping rule is 
possible. This improvement would reduce this feature gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9840) Capacity scheduler: add support for Secondary Group user mapping

2019-09-19 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9840:
---
Component/s: capacity scheduler

> Capacity scheduler: add support for Secondary Group user mapping
> 
>
> Key: YARN-9840
> URL: https://issues.apache.org/jira/browse/YARN-9840
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Currently, Capacity Scheduler only supports primary group rule mapping like 
> this:
> {{u:%user:%primary_group}}
> Fair scheduler already supports secondary group placement rule. Let's add 
> this to CS to reduce the feature gap.
> Class of interest: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9840) Capacity scheduler: add support for Secondary Group user mapping

2019-09-19 Thread Peter Bacsko (Jira)
Peter Bacsko created YARN-9840:
--

 Summary: Capacity scheduler: add support for Secondary Group user 
mapping
 Key: YARN-9840
 URL: https://issues.apache.org/jira/browse/YARN-9840
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Currently, Capacity Scheduler only supports primary group rule mapping like 
this:

{{u:%user:%primary_group}}

Fair scheduler already supports secondary group placement rule. Let's add this 
to CS to reduce the feature gap.

Class of interest: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri

2019-09-19 Thread jiulongzhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933137#comment-16933137
 ] 

jiulongzhu commented on YARN-9838:
--

Test case failure is unrelated with this patch.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.3
>
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9760) Support configuring application priorities on a workflow level

2019-09-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933124#comment-16933124
 ] 

Hadoop QA commented on YARN-9760:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m 10s{color} 
| {color:red} YARN-9760 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-9760 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12980334/YARN-9760.01.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/24806/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Support configuring application priorities on a workflow level
> --
>
> Key: YARN-9760
> URL: https://issues.apache.org/jira/browse/YARN-9760
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jonathan Hung
>Assignee: Varun Saxena
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9760.01.patch
>
>
> Currently priorities are submitted on an application level, but for end users 
> it's common to submit workloads to YARN at a workflow level. This jira 
> proposes a feature to store workflow id + priority mappings on RM (similar to 
> queue mappings). If app is submitted with a certain workflow id (as set in 
> application submission context) RM will override this app's priority with the 
> one defined in the mapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org