[jira] [Comment Edited] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2017-12-20 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298192#comment-16298192
 ] 

zhangshilong edited comment on YARN-7672 at 12/20/17 9:55 AM:
--

[~cxcw] I use two daemons deployed in different two hosts.
  I start 1000~5000 threads to simulate NM/AM,because I need to simulate 1 
apps running with 1 NM nodes.
one task use 1vcore and 2304Mb. And one NM has 50 vcore and 50*2304 Mb 
resources. 
All of NM and AM simulators are all cpu type of task.  So cpu.load will go up 
to 100+ (only 32 cores)  And as we know, Scheduler will also use one process 
for allocating resources.



was (Author: zsl2007):
[~cxcw] I use two daemons deployed on different two hosts.
  I start 1000~5000 threads to simulate NM/AM,because I need to simulate 1 
apps running with 1 NM nodes.
one task use 1vcore and 2304Mb. And one NM has 50 vcore and 50*2304 Mb 
resources. 
All of NM and AM simulators are all cpu type of task.  So cpu.load will go up 
to 100+ (only 32 cores)  And as we know, Scheduler will also use one process 
for allocating resources.


> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2017-12-20 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298192#comment-16298192
 ] 

zhangshilong commented on YARN-7672:


[~cxcw] I use two daemons deployed on different two hosts.
  I start 1000~5000 threads to simulate NM/AM,because I need to simulate 1 
apps running with 1 NM nodes.
one task use 1vcore and 2304Mb. And one NM has 50 vcore and 50*2304 Mb 
resources. 
All of NM and AM simulators are all cpu type of task.  So cpu.load will go up 
to 100+ (only 32 cores)  And as we know, Scheduler will also use one process 
for allocating resources.


> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2017-12-18 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-7672:
---
Attachment: YARN-7672.patch

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2017-12-18 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-7672:
---
Description: 
Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
pressure test.
Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
high to 100+. I thought that will affect  performance evaluation of scheduler. 
So I thought to separate the scheduler from the simulator.
I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
using RM RPC.

  was:
Our YARN cluster scale to nearly 10 thousands nodes.
We need to do scheduler pressure test.
we start  2000+ threads to simulate NM and AM. So  cpu.load very high to 100+. 
I thought that will affect  performance evaluation of scheduler. 
So I thought to separate the scheduler from the simulator.
I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
using RM RPC.


> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2017-12-18 Thread zhangshilong (JIRA)
zhangshilong created YARN-7672:
--

 Summary: hadoop-sls can not simulate huge scale of YARN
 Key: YARN-7672
 URL: https://issues.apache.org/jira/browse/YARN-7672
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhangshilong
Assignee: zhangshilong


Our YARN cluster scale to nearly 10 thousands nodes.
We need to do scheduler pressure test.
we start  2000+ threads to simulate NM and AM. So  cpu.load very high to 100+. 
I thought that will affect  performance evaluation of scheduler. 
So I thought to separate the scheduler from the simulator.
I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171396#comment-16171396
 ] 

zhangshilong edited comment on YARN-7214 at 9/19/17 9:38 AM:
-

!screenshot-1.png!
generally,
1、 NM complete one container(c) and send to RM
2、RM sent c to AM, tell AM c is completed.
3、RM sent c to NM, tell NM c can be removed from NM.
If RM restart before step 3,  c will be duplicated container completed to AM.


was (Author: zsl2007):
!screenshot-1.png!
generally,
1、 NM complete one container(c) and send to RM
2、RM sent c to AM, tell AM c is completed.
3、RM sent c to NM, tell NM c can be removed from NM.
If RM restart before step 3, c will be in in context of NM for ever. 
If RM restart again, c will be duplicated container completed to AM.

> duplicated container completed To AM
> 
>
> Key: YARN-7214
> URL: https://issues.apache.org/jira/browse/YARN-7214
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1, 3.0.0-alpha3
> Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
>Reporter: zhangshilong
> Attachments: screenshot-1.png
>
>
> env: hadoop 2.7.1  with rm recovery and nm recovery enabled
> case:
>  spark app(app1) running least one container(named c1) in NM1.
>  1、NM1 crashed,and RM found NM1 expired in 10 minutes.
>  2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive 
> c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 
> lost.
>  3、NM1 restart and register with RM(c1 in register request),but RM found NM1 
> is lost and will not handle containers from NM1.
> 4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will 
> not removed from context of NM1.
> 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. 
> RM will send c1 complted message to AM of app1.  So, app1 received duplicated 
> c1. 
> once spark AM   receive one container completed from RM, it will allocate one 
> new container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171396#comment-16171396
 ] 

zhangshilong commented on YARN-7214:


!screenshot-1.png!
generally,
1、 NM complete one container(c) and send to RM
2、RM sent c to AM, tell AM c is completed.
3、RM sent c to NM, tell NM c can be removed from NM.
If RM restart before step 3, c will be in in context of NM for ever. 
If RM restart again, c will be duplicated container completed to AM.

> duplicated container completed To AM
> 
>
> Key: YARN-7214
> URL: https://issues.apache.org/jira/browse/YARN-7214
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1, 3.0.0-alpha3
> Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
>Reporter: zhangshilong
> Attachments: screenshot-1.png
>
>
> env: hadoop 2.7.1  with rm recovery and nm recovery enabled
> case:
>  spark app(app1) running least one container(named c1) in NM1.
>  1、NM1 crashed,and RM found NM1 expired in 10 minutes.
>  2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive 
> c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 
> lost.
>  3、NM1 restart and register with RM(c1 in register request),but RM found NM1 
> is lost and will not handle containers from NM1.
> 4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will 
> not removed from context of NM1.
> 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. 
> RM will send c1 complted message to AM of app1.  So, app1 received duplicated 
> c1. 
> once spark AM   receive one container completed from RM, it will allocate one 
> new container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-7214:
---
Attachment: screenshot-1.png

> duplicated container completed To AM
> 
>
> Key: YARN-7214
> URL: https://issues.apache.org/jira/browse/YARN-7214
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1, 3.0.0-alpha3
> Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
>Reporter: zhangshilong
> Attachments: screenshot-1.png
>
>
> env: hadoop 2.7.1  with rm recovery and nm recovery enabled
> case:
>  spark app(app1) running least one container(named c1) in NM1.
>  1、NM1 crashed,and RM found NM1 expired in 10 minutes.
>  2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive 
> c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 
> lost.
>  3、NM1 restart and register with RM(c1 in register request),but RM found NM1 
> is lost and will not handle containers from NM1.
> 4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will 
> not removed from context of NM1.
> 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. 
> RM will send c1 complted message to AM of app1.  So, app1 received duplicated 
> c1. 
> once spark AM   receive one container completed from RM, it will allocate one 
> new container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171322#comment-16171322
 ] 

zhangshilong commented on YARN-7214:


in my thought,  containers in recentlyStoppedContainers can be removed from 
NMContext if NM heartbeat normally with RM.

> duplicated container completed To AM
> 
>
> Key: YARN-7214
> URL: https://issues.apache.org/jira/browse/YARN-7214
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1, 3.0.0-alpha3
> Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
>Reporter: zhangshilong
>
> env: hadoop 2.7.1  with rm recovery and nm recovery enabled
> case:
>  spark app(app1) running least one container(named c1) in NM1.
>  1、NM1 crashed,and RM found NM1 expired in 10 minutes.
>  2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive 
> c1 completed message.But RM can not send c1(to be removed) to NM1 because NM1 
> lost.
>  3、NM1 restart and register with RM(c1 in register request),but RM found NM1 
> is lost and will not handle containers from NM1.
> 4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will 
> not removed from context of NM1.
> 5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. 
> RM will send c1 complted message to AM of app1.  So, app1 received duplicated 
> c1. 
> once spark AM   receive one container completed from RM, it will allocate one 
> new container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171270#comment-16171270
 ] 

zhangshilong commented on YARN-7214:


3. 
{code:java}
 public static class AddNodeTransition implements
  SingleArcTransition {

@Override
public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
  // Inform the scheduler
  RMNodeStartedEvent startEvent = (RMNodeStartedEvent) event;
  List containers = null;

  NodeId nodeId = rmNode.nodeId;
  RMNode previousRMNode =
  rmNode.context.getInactiveRMNodes().remove(nodeId);
  if (previousRMNode != null) {
rmNode.updateMetricsForRejoinedNode(previousRMNode.getState());
  } else {
NodeId unknownNodeId =
NodesListManager.createUnknownNodeId(nodeId.getHost());
previousRMNode =
rmNode.context.getInactiveRMNodes().remove(unknownNodeId);
if (previousRMNode != null) {
  ClusterMetrics.getMetrics().decrDecommisionedNMs();
}
// Increment activeNodes explicitly because this is a new node.
ClusterMetrics.getMetrics().incrNumActiveNodes();
containers = startEvent.getNMContainerStatuses();
if (containers != null && !containers.isEmpty()) {
  for (NMContainerStatus container : containers) {
if (container.getContainerState() == ContainerState.RUNNING ||
container.getContainerState() == ContainerState.SCHEDULED) {
  rmNode.launchedContainers.add(container.getContainerId());
}
  }
}
  }

  if (null != startEvent.getRunningApplications()) {
for (ApplicationId appId : startEvent.getRunningApplications()) {
  handleRunningAppOnNode(rmNode, rmNode.context, appId, rmNode.nodeId);
}
  }

  rmNode.context.getDispatcher().getEventHandler()
.handle(new NodeAddedSchedulerEvent(rmNode, containers));
  rmNode.context.getDispatcher().getEventHandler().handle(
new NodesListManagerEvent(
NodesListManagerEventType.NODE_USABLE, rmNode));
}
  }
{code}

4、 in NodeStatusUpdaterImpl.java
  before register: getNMContainerStatuses will be called. So 
completedContainer will be put into recentlyStoppedContainers.
  in register request: completed containers will be sent to RM.
{code:java}
  public void addCompletedContainer(ContainerId containerId) {
synchronized (recentlyStoppedContainers) {
  removeVeryOldStoppedContainersFromCache();
  if (!recentlyStoppedContainers.containsKey(containerId)) {
recentlyStoppedContainers.put(containerId,
System.currentTimeMillis() + durationToTrackStoppedContainers);
  }
}
  }
{code}
normal heartbeat,  getContainerStatuses is called.
So completed container will not be put into containerStatuses beacause it is in 
recentlyStoppedContainers.
So completed container will not be sent to RM.
{code:java}
protected List getContainerStatuses() throws IOException {
List containerStatuses = new ArrayList();
for (Container container : this.context.getContainers().values()) {
  ContainerId containerId = container.getContainerId();
  ApplicationId applicationId = containerId.getApplicationAttemptId()
  .getApplicationId();
  org.apache.hadoop.yarn.api.records.ContainerStatus containerStatus =
  container.cloneAndGetContainerStatus();
  if (containerStatus.getState() == ContainerState.COMPLETE) {
if (isApplicationStopped(applicationId)) {
  if (LOG.isDebugEnabled()) {
LOG.debug(applicationId + " is completing, " + " remove "
+ containerId + " from NM context.");
  }
  context.getContainers().remove(containerId);
  pendingCompletedContainers.put(containerId, containerStatus);
} else {
  if (!isContainerRecentlyStopped(containerId)) {
pendingCompletedContainers.put(containerId, containerStatus);
  }
}
// Adding to finished containers cache. Cache will keep it around at
// least for #durationToTrackStoppedContainers duration. In the
// subsequent call to stop container it will get removed from cache.
addCompletedContainer(containerId);
  } else {
containerStatuses.add(containerStatus);
  }
}

containerStatuses.addAll(pendingCompletedContainers.values());

if (LOG.isDebugEnabled()) {
  LOG.debug("Sending out " + containerStatuses.size()
  + " container statuses: " + containerStatuses);
}
return containerStatuses;
  }
{code}




> duplicated container completed To AM
> 
>
> Key: YARN-7214
> URL: https://issues.apache.org/jira/browse/YARN-7214
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 

[jira] [Created] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)
zhangshilong created YARN-7214:
--

 Summary: duplicated container completed To AM
 Key: YARN-7214
 URL: https://issues.apache.org/jira/browse/YARN-7214
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.7.1
 Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
Reporter: zhangshilong


env: hadoop 2.7.1  with rm recovery and nm recovery enabled
case:
 spark app(app1) running least one container(named c1) in NM1.
 1、NM1 crashed,and RM found NM1 expired in 10 minutes.
 2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive c1 
completed message.But RM can not send c1(to be removed) to NM1 because NM1 lost.
 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 is 
lost and will not handle containers from NM1.
4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will not 
removed from context of NM1.
5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. RM 
will send c1 complted message to AM of app1.  So, app1 received duplicated c1. 
once spark AM   receive one container completed from RM, it will allocate one 
new container.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-4090) Make Collections.sort() more efficient by caching resource usage

2017-08-23 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137944#comment-16137944
 ] 

zhangshilong edited comment on YARN-4090 at 8/23/17 6:17 AM:
-

[~dan...@cloudera.com] [~yufeigu]  Never mind. It is my fault. 




was (Author: zsl2007):
[~dan...@cloudera.com] [~yufeigu]  Never mind. It is my fault. 
I patch v7   
In FSAppAttmept.java  function "containerCompleted" can be called by RM 
preemption,AM release and NM release.
 RM preemption is considered in Patch v7. But AM and NM may also release same 
container.
So in my thought,
{code:java}
 // Remove from the list of containers
 RMContainer removedContainer = 
liveContainers.remove(rmContainer.getContainerId());
 if(removedContainer != null){
  this.fsQueue.decResourceUsage(removedContainer.getAllocatedResource());
 }
{code}


> Make Collections.sort() more efficient by caching resource usage
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090.007.patch, 
> YARN-4090-preview.patch, YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient by caching resource usage

2017-08-23 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137944#comment-16137944
 ] 

zhangshilong commented on YARN-4090:


[~dan...@cloudera.com] [~yufeigu]  Never mind. It is my fault. 
I patch v7   
In FSAppAttmept.java  function "containerCompleted" can be called by RM 
preemption,AM release and NM release.
 RM preemption is considered in Patch v7. But AM and NM may also release same 
container.
So in my thought,
{code:java}
 // Remove from the list of containers
 RMContainer removedContainer = 
liveContainers.remove(rmContainer.getContainerId());
 if(removedContainer != null){
  this.fsQueue.decResourceUsage(removedContainer.getAllocatedResource());
 }
{code}


> Make Collections.sort() more efficient by caching resource usage
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090.007.patch, 
> YARN-4090-preview.patch, YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient by caching resource usage

2017-08-17 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131643#comment-16131643
 ] 

zhangshilong commented on YARN-4090:


I am very sorry. I am always working on YARN project.But Job is busy so I had 
no time to finish the patch. I will try my best to finish this before 
2017.10.1. 

> Make Collections.sort() more efficient by caching resource usage
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090-preview.patch, 
> YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4752) FairScheduler should preempt for a ResourceRequest and all preempted containers should be on the same node

2017-02-26 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885081#comment-15885081
 ] 

zhangshilong commented on YARN-4752:


[~kasha]  I found one problem.
In FSLeafQueue:   I think resourceUsage of app should not be changed in 
assignContainer because  FairShareComparator uses resourceUsage to sort Apps.
{code:java}
private TreeSet fetchAppsWithDemand() {
TreeSet pendingForResourceApps =
new TreeSet<>(policy.getComparator());
readLock.lock();
try {
  for (FSAppAttempt app : runnableApps) {
Resource pending = app.getAppAttemptResourceUsage().getPending();
if (!pending.equals(none())) {
  pendingForResourceApps.add(app);
}
  }
} finally {
  readLock.unlock();
}
return pendingForResourceApps;
  }
{code}
But In FSPreemptionThread 
run->preemptContainers->app.trackContainerForPreemption  
preemptedResources of app will be changed without FairScheduler Lock.
So  getResourceUsage of App will be changed in function: assignContainer in 
FSLeafQueue.
{code:java}
@Override
  public Resource getResourceUsage() {
/*
 * getResourcesToPreempt() returns zero, except when there are containers
 * to preempt. Avoid creating an object in the common case.
 */
return getPreemptedResources().equals(Resources.none())
? getCurrentConsumption()
: Resources.subtract(getCurrentConsumption(), getPreemptedResources());
  }
{code}

> FairScheduler should preempt for a ResourceRequest and all preempted 
> containers should be on the same node
> --
>
> Key: YARN-4752
> URL: https://issues.apache.org/jira/browse/YARN-4752
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: yarn-4752-1.patch, yarn-4752.2.patch, yarn-4752.3.patch, 
> yarn-4752.4.patch, yarn-4752.4.patch, 
> YARN-4752.FairSchedulerPreemptionOverhaul.pdf, yarn-6076-branch-2.1.patch
>
>
> A number of issues have been reported with respect to preemption in 
> FairScheduler along the lines of:
> # FairScheduler preempts resources from nodes even if the resultant free 
> resources cannot fit the incoming request.
> # Preemption doesn't preempt from sibling queues
> # Preemption doesn't preempt from sibling apps under the same queue that is 
> over its fairshare
> # ...
> Filing this umbrella JIRA to group all the issues together and think of a 
> comprehensive solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-26 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885072#comment-15885072
 ] 

zhangshilong commented on YARN-4090:


[~yufeigu]  I found one problem when doing this issue.
In FSLeafQueue:   I think resourceUsage of app should not be changed in 
assignContainer because  FairShareComparator uses resourceUsage to sort Apps.
{code:java}
private TreeSet fetchAppsWithDemand() {
TreeSet pendingForResourceApps =
new TreeSet<>(policy.getComparator());
readLock.lock();
try {
  for (FSAppAttempt app : runnableApps) {
Resource pending = app.getAppAttemptResourceUsage().getPending();
if (!pending.equals(none())) {
  pendingForResourceApps.add(app);
}
  }
} finally {
  readLock.unlock();
}
return pendingForResourceApps;
  }
{code}
But In FSPreemptionThread 
run->preemptContainers->app.trackContainerForPreemption  
preemptedResources of app will be changed without FairScheduler Lock.
So  getResourceUsage of App will be changed in function: assignContainer in 
FSLeafQueue.
{code:java}
@Override
  public Resource getResourceUsage() {
/*
 * getResourcesToPreempt() returns zero, except when there are containers
 * to preempt. Avoid creating an object in the common case.
 */
return getPreemptedResources().equals(Resources.none())
? getCurrentConsumption()
: Resources.subtract(getCurrentConsumption(), getPreemptedResources());
  }
{code}

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090-preview.patch, 
> YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-13 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863277#comment-15863277
 ] 

zhangshilong commented on YARN-4090:


Thanks [~yufeigu] for remind for more information.
YARN-4691 is about the same thing for ResourceUsage.   this JIRA will solve the 
problem mentioned in YARN-4691.

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090-preview.patch, 
> YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-08 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859019#comment-15859019
 ] 

zhangshilong commented on YARN-4090:


Thanks [~yufeigu].   
when application finishes or its tasks finish, FSParentQueue and FSLeafQueue 
should  update resourceUsage.   Even in Preempte, resourceUsage should be 
updated. In [~xinxianyin]'s patch YARN-4090.003.patch,Preempte and  tasks 
finish  Have been considered. When creating the patch file, one of my commits 
is ignored by  mistake.   
In my thought, resourceUsage in FSParentQueue and FSLeafQueue  will be updated 
while allocating, taskComplete and Preempte.
As Messages from QA, I found unittests are needed, So I will  add unitests for 
calculating resourceUsage. 


> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090-preview.patch, 
> YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-06 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4090:
---
Attachment: YARN-4090.006.patch

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090.006.patch, YARN-4090-preview.patch, 
> YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-06 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855267#comment-15855267
 ] 

zhangshilong commented on YARN-4090:


so sorry for whitespace..  A new patch will be submitted.
I think there is no need for more unitTests。What do you think? [~yufeigu]

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090-preview.patch, YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-03 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4090:
---
Attachment: YARN-4090.005.patch

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090.005.patch, YARN-4090-preview.patch, YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2017-02-03 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851288#comment-15851288
 ] 

zhangshilong commented on YARN-4090:


Thanks [~yufeigu]. I will submit the new patch as soon as possible.

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: sampling1.jpg, sampling2.jpg, YARN-4090.001.patch, 
> YARN-4090.002.patch, YARN-4090.003.patch, YARN-4090.004.patch, 
> YARN-4090-preview.patch, YARN-4090-TestResult.pdf
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5188) FairScheduler performance bug

2017-02-03 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851286#comment-15851286
 ] 

zhangshilong commented on YARN-5188:


[~chenfolin]  Good idea! and how is the performance of this patch?

> FairScheduler performance bug
> -
>
> Key: YARN-5188
> URL: https://issues.apache.org/jira/browse/YARN-5188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: ChenFolin
> Attachments: YARN-5188-1.patch
>
>
>  My Hadoop Cluster has recently encountered a performance problem. Details as 
> Follows.
> There are two point which can cause this performance issue.
> 1: application sort before assign container at FSLeafQueue. TreeSet is not 
> the best, Why not keep orderly ? and then we can use binary search to help 
> keep orderly when a application's resource usage has changed.
> 2: queue sort and assignContainerPreCheck will lead to compute all leafqueue 
> resource usage ,Why can we store the leafqueue usage at memory and update it 
> when assign container op release container happen?
>
>The efficiency of assign container in the Resourcemanager may fall 
> when the number of running and pending application grows. And the fact is the 
> cluster has too many PendingMB or PengdingVcore , and the Cluster 
> current utilization rate may below 20%.
>I checked the resourcemanager logs, I found that every assign 
> container may cost 5 ~ 10 ms, but just 0 ~ 1 ms at usual time.
>  
>I use TestFairScheduler to reproduce the scene:
>  
>Just one queue: root.defalut
>  10240 apps.
>  
>assign container avg time:  6753.9 us ( 6.7539 ms)  
>  apps sort time (FSLeafQueue : Collections.sort(runnableApps, 
> comparator); ): 4657.01 us ( 4.657 ms )
>  compute LeafQueue Resource usage : 905.171 us ( 0.905171 ms )
>  
>  When just root.default, one assign container op contains : ( one apps 
> sort op ) + 2 * ( compute leafqueue usage op )
>According to the above situation, I think the assign container op has 
> a performance problem  . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-30 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong reassigned YARN-4090:
--

Assignee: zhangshilong

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: zhangshilong
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, 
> YARN-4090.004.patch, sampling1.jpg, sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-30 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787353#comment-15787353
 ] 

zhangshilong commented on YARN-4090:


no problem, thank you very much for your patch, a great help to me.

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, 
> YARN-4090.004.patch, sampling1.jpg, sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-30 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4090:
---
Attachment: YARN-4090.004.patch

fix 2.6 deadlock in FSParentQueue.java

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, 
> YARN-4090.004.patch, sampling1.jpg, sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-30 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787258#comment-15787258
 ] 

zhangshilong commented on YARN-4090:


I see. In branch-2.6, 
FSParentQueue.java: will lock FSQueue. But the code changes in version 2.7.1
{code:java}
  @Override
  public synchronized List getQueueUserAclInfo(
  UserGroupInformation user) {
List userAcls = new ArrayList();

// Add queue acls
userAcls.add(getUserAclInfo(user));

// Add children queue acls
for (FSQueue child : childQueues) {
  userAcls.addAll(child.getQueueUserAclInfo(user));
}
 
return userAcls;
  }
{code}

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6045) apps/queues that have no pending containers will still affect the efficiency of scheduling

2016-12-29 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-6045:
---
Environment: 
yarn: 2.7.1 release
jdk 1.7
kernel:2.6.32-431.20.3.el6

  was:
jdk 1.7
kernel:2.6.32-431.20.3.el6


> apps/queues that have no pending containers will still affect the efficiency 
> of scheduling
> --
>
> Key: YARN-6045
> URL: https://issues.apache.org/jira/browse/YARN-6045
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
> Environment: yarn: 2.7.1 release
> jdk 1.7
> kernel:2.6.32-431.20.3.el6
>Reporter: zhangshilong
>Assignee: zhangshilong
>
> Sorting queues/apps consumes a significant amount of time during a single 
> container allocation.
> Each time a container is assigned, all queues / apps are sorted by hierarchy.
> In practice, many queues / apps without pending container  do not need to 
> participate in the sort.
> Without the need for resources, apps / queues do not participate in sorting, 
> scheduling performance will increase a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6045) apps/queues that have no pending containers will still affect the efficiency of scheduling

2016-12-29 Thread zhangshilong (JIRA)
zhangshilong created YARN-6045:
--

 Summary: apps/queues that have no pending containers will still 
affect the efficiency of scheduling
 Key: YARN-6045
 URL: https://issues.apache.org/jira/browse/YARN-6045
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: jdk 1.7
kernel:2.6.32-431.20.3.el6
Reporter: zhangshilong
Assignee: zhangshilong


Sorting queues/apps consumes a significant amount of time during a single 
container allocation.
Each time a container is assigned, all queues / apps are sorted by hierarchy.
In practice, many queues / apps without pending container  do not need to 
participate in the sort.
Without the need for resources, apps / queues do not participate in sorting, 
scheduling performance will increase a lot.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-28 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784417#comment-15784417
 ] 

zhangshilong commented on YARN-4090:


would you please tell me yarn version you used?
In trunk:
 FairScheduler.getQueueUserAclInfo  will not lock the FSQueue object.
FSQueue object will be locked only when decResourceUsage or incrResourceUsage.
FairScheduler:
{code:java}
  @Override
  public List getQueueUserAclInfo() {
UserGroupInformation user;
try {
  user = UserGroupInformation.getCurrentUser();
} catch (IOException ioe) {
  return new ArrayList();
}

return queueMgr.getRootQueue().getQueueUserAclInfo(user);
  }
{code}
FSParentQueue.java
{code:java}
  @Override
  public List getQueueUserAclInfo(UserGroupInformation user) {
List userAcls = new ArrayList<>();

// Add queue acls
userAcls.add(getUserAclInfo(user));

// Add children queue acls
readLock.lock();
try {
  for (FSQueue child : childQueues) {
userAcls.addAll(child.getQueueUserAclInfo(user));
  }
} finally {
  readLock.unlock();
}
 
return userAcls;
  }
{code}

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-12-28 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784392#comment-15784392
 ] 

zhangshilong commented on YARN-4090:


[~xinxianyin] [~yufeigu]   This optimization works in our environment very 
well, I hope to continue this issue.

> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5969) FairShareComparator: Cache value of getResourceUsage for better performance

2016-12-27 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782248#comment-15782248
 ] 

zhangshilong commented on YARN-5969:


Thanks [~yufeigu] for advice and review and [~kasha]  for commit.
 YARN scale  reaches nearly 4000 in our company, Fairscheduler performance  
encountered many problems, I hope to submit more optimizations to the community.

> FairShareComparator: Cache value of getResourceUsage for better performance
> ---
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: 20161206.patch, 20161222.patch, YARN-5969.patch, 
> apprunning_after.png, apprunning_before.png, 
> containerAllocatedDelta_before.png, containerAllocated_after.png, 
> pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-26 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Attachment: YARN-5969.patch

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, 20161222.patch, YARN-5969.patch, 
> apprunning_after.png, apprunning_before.png, 
> containerAllocatedDelta_before.png, containerAllocated_after.png, 
> pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Attachment: 20161222.patch

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, 20161222.patch, apprunning_after.png, 
> apprunning_before.png, containerAllocatedDelta_before.png, 
> containerAllocated_after.png, pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769268#comment-15769268
 ] 

zhangshilong commented on YARN-5969:



Thanks yufei Gu for the reminder, I will improve my patch soon.

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, apprunning_after.png, 
> apprunning_before.png, containerAllocatedDelta_before.png, 
> containerAllocated_after.png, pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766471#comment-15766471
 ] 

zhangshilong edited comment on YARN-5969 at 12/21/16 8:34 AM:
--

ContainerAllocated picture means container  allocation per minute. 
After patch, Container  allocation per minute improves about 50%.
obviously, 500 apps finish faster after patch.


was (Author: zsl2007):
ContainerAllocated picture means container  allocation per minute. 
After patch, Container  allocation per minute improves about 50%.

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, apprunning_after.png, 
> apprunning_before.png, containerAllocatedDelta_before.png, 
> containerAllocated_after.png, pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766471#comment-15766471
 ] 

zhangshilong commented on YARN-5969:


ContainerAllocated picture means container  allocation per minute. 
After patch, Container  allocation per minute improves about 50%.

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, apprunning_after.png, 
> apprunning_before.png, containerAllocatedDelta_before.png, 
> containerAllocated_after.png, pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Attachment: containerAllocated_after.png
apprunning_after.png

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, apprunning_after.png, 
> apprunning_before.png, containerAllocatedDelta_before.png, 
> containerAllocated_after.png, pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Attachment: pending_before.png
pending_after.png
containerAllocatedDelta_before.png
apprunning_before.png

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch, apprunning_before.png, 
> containerAllocatedDelta_before.png, pending_after.png, pending_before.png
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-21 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766437#comment-15766437
 ] 

zhangshilong commented on YARN-5969:


Test case: 500 app,3000 nm nodes  
queue:
parent queue number: 100
leaf queue number per parent queue: 5
500 apps submitted to 155 leaf queues.  Average queue contains 4 apps.
all apps are mapreduce job.  One job contains 325 mapper and 44 reducer.  Every 
mapper/reducer  does: sleep 20 seconds.




> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: 20161206.patch
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-05 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Description: 
in FairShareComparator class, the performance of function getResourceUsage()  
is very poor. It will be executed above 100,000,000 times per second.
In our scene, It  takes 20 seconds per minute.  
A simple solution is to reduce call counts  of the function.

  was:
in FairShareComparator.java, the performance of function getResourceUsage()  is 
very poor. It will be executed above 100,000,000 times per second.
In our scene, It  takes 20 seconds per minute.  
A simple solution is to reduce call counts  of the function.


> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
> Attachments: 20161206.patch
>
>
> in FairShareComparator class, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-05 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Description: 
in FairShareComparator.java, the performance of function getResourceUsage()  is 
very poor. It will be executed above 100,000,000 times per second.
In our scene, It  takes 20 seconds per minute.  
A simple solution is to reduce call counts  of the function.

  was:
in FairShareComparator.java, the performance of function getResourceUsage()  is 
very pool. It will be executed above 100,000,000 times per second.
In our scene, It  takes 20 seconds per minute.  
A simple solution is to reduce call counts  of the function.


> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
> Attachments: 20161206.patch
>
>
> in FairShareComparator.java, the performance of function getResourceUsage()  
> is very poor. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage poor performance

2016-12-05 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Summary: FairShareComparator getResourceUsage poor performance  (was: 
FairShareComparator getResourceUsage pool performance)

> FairShareComparator getResourceUsage poor performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
> Attachments: 20161206.patch
>
>
> in FairShareComparator.java, the performance of function getResourceUsage()  
> is very pool. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5969) FairShareComparator getResourceUsage pool performance

2016-12-05 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-5969:
---
Attachment: 20161206.patch

> FairShareComparator getResourceUsage pool performance
> -
>
> Key: YARN-5969
> URL: https://issues.apache.org/jira/browse/YARN-5969
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhangshilong
> Attachments: 20161206.patch
>
>
> in FairShareComparator.java, the performance of function getResourceUsage()  
> is very pool. It will be executed above 100,000,000 times per second.
> In our scene, It  takes 20 seconds per minute.  
> A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5969) FairShareComparator getResourceUsage pool performance

2016-12-05 Thread zhangshilong (JIRA)
zhangshilong created YARN-5969:
--

 Summary: FairShareComparator getResourceUsage pool performance
 Key: YARN-5969
 URL: https://issues.apache.org/jira/browse/YARN-5969
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.1
Reporter: zhangshilong


in FairShareComparator.java, the performance of function getResourceUsage()  is 
very pool. It will be executed above 100,000,000 times per second.
In our scene, It  takes 20 seconds per minute.  
A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in secure clusters

2015-12-22 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069121#comment-15069121
 ] 

zhangshilong commented on YARN-4327:


yeah,I tried yarn.timeline-service.http-authentication.type=kerberos.  So jobs 
could be submitted, but users can not access application history from webapp.

> RM can not renew  TIMELINE_DELEGATION_TOKEN in secure clusters
> --
>
> Key: YARN-4327
> URL: https://issues.apache.org/jira/browse/YARN-4327
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.1
> Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
> kerberos security.
> conf like this:
> 
>   hadoop.security.authorization
>   true
>   Is service-level authorization enabled?
> 
> 
>   hadoop.security.authentication
>   kerberos
>   Possible values are simple (no authentication), and kerberos
>   
> 
>Reporter: zhangshilong
>
> bin hadoop 2.7.1
> ATS conf like this: 
> 
> yarn.timeline-service.http-authentication.type
> simple
> 
> 
> yarn.timeline-service.http-authentication.kerberos.principal
> HTTP/_h...@xxx.com
> 
> 
> yarn.timeline-service.http-authentication.kerberos.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.principal
> xxx/_h...@xxx.com
> 
> 
> yarn.timeline-service.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.best-effort
> true
> 
> 
> yarn.timeline-service.enabled
> true
>   
>  
> I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
> client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM 
> Context, but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application 
> to failure.
> RM logs:
> 2015-11-03 11:58:38,191 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer.
> java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
> Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
> realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
> masterKeyId=2)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [500], message [Null user]
> at 
> org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
> at 
> 

[jira] [Commented] (YARN-4325) purge app state from NM state-store should be independent of log aggregation

2015-11-09 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997932#comment-14997932
 ] 

zhangshilong commented on YARN-4325:


If permissions with hdfs is right,  is there any other problem?
If set yarn.log-aggregation-enable = false, does NM recovery work well?

> purge app state from NM state-store should be independent of log aggregation
> 
>
> Key: YARN-4325
> URL: https://issues.apache.org/jira/browse/YARN-4325
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. The reason is some wrong configuration 
> setting to log aggregation so the end of log aggregation events are not 
> received so stale apps are not purged properly. We should make sure the 
> removal of app state to be independent of log aggregation life cycle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4327:
---
Description: 
in hadoop 2.7.1
ATS conf like this: 

yarn.timeline-service.http-authentication.type
simple


yarn.timeline-service.http-authentication.kerberos.principal
HTTP/_h...@xxx.com


yarn.timeline-service.http-authentication.kerberos.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.principal
xxx/_h...@xxx.com


yarn.timeline-service.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.best-effort
true


yarn.timeline-service.enabled
true
  
 

I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM Context, 
but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application to failure.
RM logs:
2015-11-03 11:58:38,191 WARN 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: 
Unable to add the application to the delegation token renewer.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
masterKeyId=2)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [500], message [Null user]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:543)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:540)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:538)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:437)
... 6 more

ATS logs:
2015-11-03 14:47:45,407 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Creating password for identifier: owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446533265407, 

[jira] [Created] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)
zhangshilong created YARN-4327:
--

 Summary: RM can not renew  TIMELINE_DELEGATION_TOKEN in securt 
clusters
 Key: YARN-4327
 URL: https://issues.apache.org/jira/browse/YARN-4327
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.1
 Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
kerberos security.
conf like this:

  hadoop.security.authorization
  true
  Is service-level authorization enabled?



  hadoop.security.authentication
  kerberos
  Possible values are simple (no authentication), and kerberos
  


Reporter: zhangshilong


in hadoop 2.7.1
ATS conf like this: 

yarn.timeline-service.http-authentication.type
simple


yarn.timeline-service.http-authentication.kerberos.principal
HTTP/_h...@xxx.com


yarn.timeline-service.http-authentication.kerberos.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.principal
xxx/_h...@xxx.com


yarn.timeline-service.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.best-effort
true


yarn.timeline-service.enabled
true
  
 

I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM Context, 
but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application to failure.
RM logs:
2015-11-03 11:58:38,191 WARN 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: 
Unable to add the application to the delegation token renewer.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
masterKeyId=2)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [500], message [Null user]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:543)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:540)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  

[jira] [Updated] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in secure clusters

2015-11-03 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4327:
---
Summary: RM can not renew  TIMELINE_DELEGATION_TOKEN in secure clusters  
(was: RM can not renew  TIMELINE_DELEGATION_TOKEN in securt clusters)

> RM can not renew  TIMELINE_DELEGATION_TOKEN in secure clusters
> --
>
> Key: YARN-4327
> URL: https://issues.apache.org/jira/browse/YARN-4327
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.1
> Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
> kerberos security.
> conf like this:
> 
>   hadoop.security.authorization
>   true
>   Is service-level authorization enabled?
> 
> 
>   hadoop.security.authentication
>   kerberos
>   Possible values are simple (no authentication), and kerberos
>   
> 
>Reporter: zhangshilong
>
> bin hadoop 2.7.1
> ATS conf like this: 
> 
> yarn.timeline-service.http-authentication.type
> simple
> 
> 
> yarn.timeline-service.http-authentication.kerberos.principal
> HTTP/_h...@xxx.com
> 
> 
> yarn.timeline-service.http-authentication.kerberos.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.principal
> xxx/_h...@xxx.com
> 
> 
> yarn.timeline-service.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.best-effort
> true
> 
> 
> yarn.timeline-service.enabled
> true
>   
>  
> I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
> client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM 
> Context, but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application 
> to failure.
> RM logs:
> 2015-11-03 11:58:38,191 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer.
> java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
> Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
> realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
> masterKeyId=2)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [500], message [Null user]
> at 
> org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
> at 
> org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
> at 

[jira] [Commented] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987054#comment-14987054
 ] 

zhangshilong commented on YARN-4327:


yarn.timeline-service.http-authentication.type  simple,  ATS will use  
PseudoAuthenticationHandler, and generate a token like u=null=null. This make 
ATS causes java.lang.IllegalArgumentException.

> RM can not renew  TIMELINE_DELEGATION_TOKEN in securt clusters
> --
>
> Key: YARN-4327
> URL: https://issues.apache.org/jira/browse/YARN-4327
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.1
> Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
> kerberos security.
> conf like this:
> 
>   hadoop.security.authorization
>   true
>   Is service-level authorization enabled?
> 
> 
>   hadoop.security.authentication
>   kerberos
>   Possible values are simple (no authentication), and kerberos
>   
> 
>Reporter: zhangshilong
>
> in hadoop 2.7.1
> ATS conf like this: 
> 
> yarn.timeline-service.http-authentication.type
> simple
> 
> 
> yarn.timeline-service.http-authentication.kerberos.principal
> HTTP/_h...@xxx.com
> 
> 
> yarn.timeline-service.http-authentication.kerberos.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.principal
> xxx/_h...@xxx.com
> 
> 
> yarn.timeline-service.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.best-effort
> true
> 
> 
> yarn.timeline-service.enabled
> true
>   
>  
> I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
> client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM 
> Context, but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application 
> to failure.
> RM logs:
> 2015-11-03 11:58:38,191 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer.
> java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
> Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
> realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
> masterKeyId=2)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [500], message [Null user]
> at 
> org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
> at 
>