[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969197#comment-15969197
 ] 

zhihai xu edited comment on YARN-6396 at 4/14/17 4:02 PM:
--

Thanks for the review [~jianhe] and [~rkanter]! if some one deletes the remote 
log dir, all the old log will disappear. That will be a more serious issue, 
recreating the remote log dir won't save the old log data. This looks like a 
monitor problem, I think it will be better to do it in some tool outside the 
NM. It will be more efficient to do it at one place instead of on each NM, 
which could be many thousands in a large cluster. Yes, it's a trade off between 
validation and efficiency. Also restarting the NM will help recreate the remote 
log dir.


was (Author: zxu):
Thanks for the review [~jianhe] and [~rkanter]! if some one deletes the remote 
log dir, all the old log will disappear. That will be a more serious issue, 
recreating the remote log dir won't save the old log data. This looks like a 
monitor problem, I think it will be better to do it in some tool outside the 
NM. It will be more efficient to do it at one place instead of on each NM, 
which could be many thousands in a large cluster. Yes, it's a trade off between 
validation and efficiency.

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969197#comment-15969197
 ] 

zhihai xu commented on YARN-6396:
-

Thanks for the review [~jianhe] and [~rkanter]! if some one deletes the remote 
log dir, all the old log will disappear. That will be a more serious issue, 
recreating the remote log dir won't save the old log data. This looks like a 
monitor problem, I think it will be better to do it in some tool outside the 
NM. It will be more efficient to do it at one place instead of on each NM, 
which could be many thousands in a large cluster. Yes, it's a trade off between 
validation and efficiency.

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-13 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961891#comment-15961891
 ] 

zhihai xu edited comment on YARN-6396 at 4/13/17 7:47 PM:
--

Thanks for the review [~haibochen], [~jianhe], [~rkanter], [~xgong] Could you 
also help review the patch? thanks


was (Author: zxu):
Thanks for the review [~haibochen], [~rkanter], [~xgong] Could you also help 
review the patch? thanks

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961891#comment-15961891
 ] 

zhihai xu edited comment on YARN-6396 at 4/12/17 11:48 PM:
---

Thanks for the review [~haibochen], [~rkanter], [~xgong] Could you also help 
review the patch? thanks


was (Author: zxu):
Thanks for the review [~haibochen], [~rkanter][~xgong] Could you also help 
review the patch? thanks

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961891#comment-15961891
 ] 

zhihai xu edited comment on YARN-6396 at 4/12/17 11:48 PM:
---

Thanks for the review [~haibochen], [~rkanter][~xgong] Could you also help 
review the patch? thanks


was (Author: zxu):
Thanks for the review [~haibochen], [~xgong] Could you also help review the 
patch? thanks

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961891#comment-15961891
 ] 

zhihai xu edited comment on YARN-6396 at 4/8/17 8:29 PM:
-

Thanks for the review [~haibochen], [~xgong] Could you also help review the 
patch? thanks


was (Author: zxu):
[~xgong] Could you help review the patch? thanks

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-3001) RM dies because of divide by zero

2017-04-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3001:

Attachment: YARN-3001.barnch-2.7.patch

> RM dies because of divide by zero
> -
>
> Key: YARN-3001
> URL: https://issues.apache.org/jira/browse/YARN-3001
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.5.1
>Reporter: hoelog
>Assignee: Rohith Sharma K S
> Attachments: YARN-3001.barnch-2.7.patch
>
>
> RM dies because of divide by zero exception.
> {code}
> 2014-12-31 21:27:05,022 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:656)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:570)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:851)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:900)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599)
> at java.lang.Thread.run(Thread.java:745)
> 2014-12-31 21:27:05,023 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3001) RM dies because of divide by zero

2017-04-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961908#comment-15961908
 ] 

zhihai xu commented on YARN-3001:
-

We also see this issue at CDH5.7.2 which is based on hadoop 2.6 release + 
patches from hadoop 2.7 release, I studied most of the code paths I found two 
potential corner cases which may cause this issue:
1. maximum allocation can be changed based on node added and removed and total 
resource changed on the node. if maximum allocation is changed to 0 
transiently, this issue may happen. since the following code at 
CapacityScheduler.allocate will change ResourceRequest in ask to 0 if 
getMaximumResourceCapability is 0.
{code}
SchedulerUtils.normalizeRequests(
ask, getResourceCalculator(), getClusterResource(),
getMinimumResourceCapability(), getMaximumResourceCapability());
{code}
2. capability from resource request in application returned without cloning in 
LeafQueue.assignContainer and AppSchedulingInfo.cloneResourceRequest and 
AppSchedulingInfo.getResource, Potentially the capability in resource request 
returned can be changed outside.
I implemented a patch which fixed the first potential corner case based on 
branch-2.7. We already deployed this patch for more than one month, so far we 
didn't see this issue happen with the attached patch.
The stack trace for the exception is 
{code}
2017-02-09 15:36:43,062 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type NODE_UPDATE to the scheduler
java.lang.ArithmeticException: / by zero
at 
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator.computeAvailableContainers(DominantResourceCalculator.java:115)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1536)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1392)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1271)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersInternal(LeafQueue.java:830)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:734)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1027)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:691)
at java.lang.Thread.run(Thread.java:745)
{code}

> RM dies because of divide by zero
> -
>
> Key: YARN-3001
> URL: https://issues.apache.org/jira/browse/YARN-3001
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.5.1
>Reporter: hoelog
>Assignee: Rohith Sharma K S
>
> RM dies because of divide by zero exception.
> {code}
> 2014-12-31 21:27:05,022 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877)
> at 
> 

[jira] [Commented] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-04-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961891#comment-15961891
 ] 

zhihai xu commented on YARN-6396:
-

[~xgong] Could you help review the patch? thanks

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2017-04-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960233#comment-15960233
 ] 

zhihai xu commented on YARN-4095:
-

[~Feng Yuan], I think, For ShuffleHandler, we always want it to access all 
local directories which also include full local directories, since output data 
from mappers may be at the full local directories. Otherwise shuffle may fail 
due to data or index file can't be found in the good local directories.
{code}
  Path indexFileName = lDirAlloc.getLocalPathToRead(
  attemptBase + "/" + INDEX_FILE_NAME, conf);
  Path mapOutputFileName = lDirAlloc.getLocalPathToRead(
  attemptBase + "/" + DATA_FILE_NAME, conf);
public Path getLocalPathToRead(String pathStr,
Configuration conf) throws IOException {
  Context ctx = confChanged(conf);
  int numDirs = ctx.localDirs.length;
  int numDirsSearched = 0;
  //remove the leading slash from the path (to make sure that the uri
  //resolution results in a valid path on the dir being checked)
  if (pathStr.startsWith("/")) {
pathStr = pathStr.substring(1);
  }
  while (numDirsSearched < numDirs) {
Path file = new Path(ctx.localDirs[numDirsSearched], pathStr);
if (ctx.localFS.exists(file)) {
  return file;
}
numDirsSearched++;
  }

  //no path found
  throw new DiskErrorException ("Could not find " + pathStr +" in any of" +
  " the configured local directories");
}
{code}
I think This may be also the reason why we didn't want to use the same 
configuration between ShuffleHandler and LocalDirHandlerService.

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-03-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942651#comment-15942651
 ] 

zhihai xu commented on YARN-6396:
-

I attached a patch which call verifyAndCreateRemoteLogDir in serviceStart, only 
call verifyAndCreateRemoteLogDir in initApp, when remote log directory is 
failed to create.

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-03-26 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-6396:

Attachment: YARN-6396.000.patch

> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node
> ---
>
> Key: YARN-6396
> URL: https://issues.apache.org/jira/browse/YARN-6396
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6396.000.patch
>
>
> Call verifyAndCreateRemoteLogDir at service initialization instead of 
> application initialization to decrease load for name node.
> Currently for every application at each Node, verifyAndCreateRemoteLogDir 
> will be called before doing log aggregation, This will be a non trivial 
> overhead for name node in a large cluster since verifyAndCreateRemoteLogDir 
> calls getFileStatus. Once the remote log directory is created successfully, 
> it is not necessary to call it again. It will be better to call 
> verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-03-26 Thread zhihai xu (JIRA)
zhihai xu created YARN-6396:
---

 Summary: Call verifyAndCreateRemoteLogDir at service 
initialization instead of application initialization to decrease load for name 
node
 Key: YARN-6396
 URL: https://issues.apache.org/jira/browse/YARN-6396
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 3.0.0-alpha2
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


Call verifyAndCreateRemoteLogDir at service initialization instead of 
application initialization to decrease load for name node.
Currently for every application at each Node, verifyAndCreateRemoteLogDir will 
be called before doing log aggregation, This will be a non trivial overhead for 
name node in a large cluster since verifyAndCreateRemoteLogDir calls 
getFileStatus. Once the remote log directory is created successfully, it is not 
necessary to call it again. It will be better to call 
verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6392) add submit time to Application Summary log

2017-03-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942636#comment-15942636
 ] 

zhihai xu edited comment on YARN-6392 at 3/27/17 4:45 AM:
--

The test failures are not related to my change.


was (Author: zxu):
The test failures are related to my change.

> add submit time to Application Summary log
> --
>
> Key: YARN-6392
> URL: https://issues.apache.org/jira/browse/YARN-6392
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6392.000.patch
>
>
> add submit time to Application Summary log, application submit time will be 
> passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a 
> very important parameter, So it will be useful to log it in Application 
> Summary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6392) add submit time to Application Summary log

2017-03-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942636#comment-15942636
 ] 

zhihai xu commented on YARN-6392:
-

The test failures are related to my change.

> add submit time to Application Summary log
> --
>
> Key: YARN-6392
> URL: https://issues.apache.org/jira/browse/YARN-6392
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6392.000.patch
>
>
> add submit time to Application Summary log, application submit time will be 
> passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a 
> very important parameter, So it will be useful to log it in Application 
> Summary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6392) add submit time to Application Summary log

2017-03-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942589#comment-15942589
 ] 

zhihai xu commented on YARN-6392:
-

I attached a patch YARN-6392.000.patch which will log submitTime in Application 
Summary.

> add submit time to Application Summary log
> --
>
> Key: YARN-6392
> URL: https://issues.apache.org/jira/browse/YARN-6392
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6392.000.patch
>
>
> add submit time to Application Summary log, application submit time will be 
> passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a 
> very important parameter, So it will be useful to log it in Application 
> Summary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6392) add submit time to Application Summary log

2017-03-26 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-6392:

Attachment: YARN-6392.000.patch

> add submit time to Application Summary log
> --
>
> Key: YARN-6392
> URL: https://issues.apache.org/jira/browse/YARN-6392
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-6392.000.patch
>
>
> add submit time to Application Summary log, application submit time will be 
> passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a 
> very important parameter, So it will be useful to log it in Application 
> Summary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6392) add submit time to Application Summary log

2017-03-26 Thread zhihai xu (JIRA)
zhihai xu created YARN-6392:
---

 Summary: add submit time to Application Summary log
 Key: YARN-6392
 URL: https://issues.apache.org/jira/browse/YARN-6392
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 3.0.0-alpha2
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


add submit time to Application Summary log, application submit time will be 
passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a 
very important parameter, So it will be useful to log it in Application Summary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5288) Resource Localization fails due to leftover files

2016-06-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344910#comment-15344910
 ] 

zhihai xu commented on YARN-5288:
-

Thanks for reporting this issue [~yufeigu]! Can YARN-3727 fix your issue? 
YARN-3727 will delete the leftover files and move to the next directory if the 
leftover files is there.

> Resource Localization fails due to leftover files
> -
>
> Key: YARN-5288
> URL: https://issues.apache.org/jira/browse/YARN-5288
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>
> NM restart didn't clean up all user cache. The leftover files can cause 
> resource localization failure.
> {code}
> 2016-06-14 23:09:12,717 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  
> java.io.IOException: Rename cannot overwrite non empty destination directory 
> /data/5/yarn/nm/usercache/xxx/filecache/4567
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
> at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:236)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:912)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4979) FSAppAttempt demand calculation considers demands at multiple locality levels different

2016-05-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297200#comment-15297200
 ] 

zhihai xu commented on YARN-4979:
-

thanks [~kasha] for reviewing and committing the patch!

> FSAppAttempt demand calculation considers demands at multiple locality levels 
> different
> ---
>
> Key: YARN-4979
> URL: https://issues.apache.org/jira/browse/YARN-4979
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.0, 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.9.0
>
> Attachments: YARN-4979.001.patch
>
>
> FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
> should only count ResourceRequest for ResourceRequest.ANY when calculate 
> demand.
> Because {{hasContainerForNode}} will return false if no container request for 
> ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
> will also decrease the number of containers for ResourceRequest.ANY.
> This issue may cause current memory demand overflow(integer) because 
> duplicate requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252087#comment-15252087
 ] 

zhihai xu commented on YARN-1458:
-

Ok, no problem, you can try it at your convenience. thanks for finding this 
issue!

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251356#comment-15251356
 ] 

zhihai xu commented on YARN-1458:
-

I think FSAppAttempt may add duplicate ResourceRequest to demand, which may 
cause current memory demand Integer Overflow. I created YARN-4979 to fix the 
wrong demand calculation issue for FSAppAttempt. The root cause may be 
YARN-4979.

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.

2016-04-21 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4979:

Attachment: YARN-4979.001.patch

> FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
> --
>
> Key: YARN-4979
> URL: https://issues.apache.org/jira/browse/YARN-4979
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.0, 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4979.001.patch
>
>
> FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
> should only count ResourceRequest for ResourceRequest.ANY when calculate 
> demand.
> Because {{hasContainerForNode}} will return false if no container request for 
> ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
> will also decrease the number of containers for ResourceRequest.ANY.
> This issue may cause current memory demand overflow(integer) because 
> duplicate requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.

2016-04-21 Thread zhihai xu (JIRA)
zhihai xu created YARN-4979:
---

 Summary: FSAppAttempt adds duplicate ResourceRequest to demand in 
updateDemand.
 Key: YARN-4979
 URL: https://issues.apache.org/jira/browse/YARN-4979
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.2, 2.8.0
Reporter: zhihai xu
Assignee: zhihai xu


FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
should only count ResourceRequest for ResourceRequest.ANY when calculate demand.
Because {{hasContainerForNode}} will return false if no container request for 
ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
will also decrease the number of containers for ResourceRequest.ANY.
This issue may cause current memory demand overflow(integer) because duplicate 
requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251016#comment-15251016
 ] 

zhihai xu commented on YARN-1458:
-

Hi [~dwatzke], thanks for reporting this issue, I double check the code, I find 
one corner case which can cause this issue. Hopefully this is the only case 
which isn't handled.
The corner case is when current memory demand for the app is Integer Overflow. 
If that happen, the weight will become 
[NaN|https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#log1p(double)]
 because current memory demand is a negative value.
{code}
weight = Math.log1p(app.getDemand().getMemory()) / Math.log(2); 
{code}
{{getFairShareIfFixed}} will treat NaN weight same as positive weight.
{{computeShare}} will always return 0 if the weight is NaN because {{share}} is 
NaN and {{(int)NaN}} is 0.
I attached a addendum patch YARN-1458.addendum.patch, Could you verify whether 
this patch can fix your issue?
thanks



> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:

Attachment: YARN-1458.addendum.patch

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2016-04-19 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248637#comment-15248637
 ] 

zhihai xu commented on YARN-2910:
-

linked YARN-2975 to this issue, It looks like we need both YARN-2910 and 
YARN-2975 to fix this issue completely.

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0, 2.6.1
>
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.7.patch, 
> YARN-2910.8.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182647#comment-15182647
 ] 

zhihai xu commented on YARN-4761:
-

I just committed it to trunk, branch-2, branch-2.8, branch-2.7 and branch-2.6. 
thanks [~sjlee0] for the contribution and thanks [~rohithsharma] for the review!

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Fix For: 2.8.0, 2.7.3, 2.6.5
>
> Attachments: YARN-4761.01.patch, YARN-4761.02.patch
>
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182514#comment-15182514
 ] 

zhihai xu commented on YARN-4761:
-

+1 for the latest patch, the test failures are not elated to the patch and one 
test failure is the same as YARN-4306. Will commit the patch shortly.

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4761.01.patch, YARN-4761.02.patch
>
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-03 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179066#comment-15179066
 ] 

zhihai xu commented on YARN-4761:
-

Good Finding [~sjlee0]! the same issue could also happen for fair scheduler. we 
should decouple RMNode status from fair scheduler also.

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-27 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170481#comment-15170481
 ] 

zhihai xu commented on YARN-4728:
-

Yes, MAPREDUCE-6513 is possible, but YARN-1680 may be more possible. Because 
blacklisted nodes can happen easier in your environment than MAPREDUCE-6513 
especially with mapreduce.job.reduce.slowstart.completedmaps=1. To see whether 
it is MAPREDUCE-6513 or YARN-1680, you need check the log to see wether reduce 
task is preempted. If reduce task is preempted and map task still can't get 
resource, it is MAPREDUCE-6513/MAPREDUCE-6514. Otherwise, it is YARN-1680. Even 
YARN-1680 is fixed, which trigger the preemption, MAPREDUCE-6513 still will 
happen.

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160305#comment-15160305
 ] 

zhihai xu commented on YARN-4728:
-

Thanks for reporting this issue [~Silnov]! 
It looks like this issue is caused by the long timeout at two level. This issue 
is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work 
around this issue by changing the configuration values: 
"ipc.client.connect.max.retries.on.timeouts" (default is 45),  
"ipc.client.connect.timeout"(default is 2ms) and 
"yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms).

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4502) Fix two AM containers get allocated when AM restart

2016-02-02 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129324#comment-15129324
 ] 

zhihai xu commented on YARN-4502:
-

+1 also. This patch also covers the case when a container receives 
RMContainerEventType.EXPIRE event at state RMContainerState.ALLOCATED, which 
was not covered by YARN-3535.
Based on the original suggestion by [~leftnoteasy], It looks like the 
implementation for 
{{AbstractYarnScheduler#getApplicationAttempt(ApplicationAttemptId 
applicationAttemptId)}} is also confusing. It always returns the current 
application attempt even the current application attempt doesn't match the 
given {{applicationAttemptId}}.
In contrast, {{RMAppImpl#getRMAppAttempt(ApplicationAttemptId appAttemptId)}} 
always returns the matched {{RMAppAttempt}}.
Should we fix it in a follow-up JIRA?

> Fix two AM containers get allocated when AM restart
> ---
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4502) gjfbndbfcjenrgccriejuvcnktllcc

2016-02-02 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4502:

Summary: gjfbndbfcjenrgccriejuvcnktllcc  (was: cfjgdgcejkrbvgluuehgnkj)

> gjfbndbfcjenrgccriejuvcnktllcc
> --
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4502) cfjgdgcejkrbvgluuehgnkj

2016-02-02 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4502:

Summary: cfjgdgcejkrbvgluuehgnkj  (was: Fix two AM containers get allocated 
when AM restart)

> cfjgdgcejkrbvgluuehgnkj
> ---
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4502) Fix two AM containers get allocated when AM restart

2016-02-02 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4502:

Summary: Fix two AM containers get allocated when AM restart  (was: 
gjfbndbfcjenrgccriejuvcnktllcc)

> Fix two AM containers get allocated when AM restart
> ---
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby

2016-01-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118816#comment-15118816
 ] 

zhihai xu commented on YARN-4646:
-

Is this issue fixed in MAPREDUCE-6439? They have same stack trace.

> AMRMClient crashed when RM transition from active to standby
> 
>
> Key: YARN-4646
> URL: https://issues.apache.org/jira/browse/YARN-4646
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>
> when RM transition to standby, ApplicationMasterService#allocate() is 
> interrupted and the exception is passed to AM.
> the following is the exception msg: 
> {quote}
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:258)
> ... 11 more
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
> at com.sun.proxy.$Proxy35.allocate(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:274)
> at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:237)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException):
>  java.lang.InterruptedException
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at 

[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098347#comment-15098347
 ] 

zhihai xu commented on YARN-3446:
-

The test failures for TestClientRMTokens and TestAMAuthorizatio are not related 
to the patch. Both tests are passed in my local build.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, 
> YARN-3446.005.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler headroom calculation should exclude nodes in the blacklist

2016-01-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098617#comment-15098617
 ] 

zhihai xu commented on YARN-3446:
-

[~kasha], thanks for the review and committing the patch!

> FairScheduler headroom calculation should exclude nodes in the blacklist
> 
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.9.0
>
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, 
> YARN-3446.005.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-13 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097605#comment-15097605
 ] 

zhihai xu commented on YARN-3446:
-

Thanks for the review [~kasha]! That is a good suggestion. I attached a new 
patch YARN-3446.005.patch, which addressed your comments. Please review it.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-13 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.005.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, 
> YARN-3446.005.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown

2016-01-05 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082641#comment-15082641
 ] 

zhihai xu commented on YARN-3697:
-

[~djp], yes, I just committed it to branch-2.6. thanks

> FairScheduler: ContinuousSchedulingThread can fail to shutdown
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3697.000.patch, YARN-3697.001.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> 

[jira] [Updated] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown

2016-01-05 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3697:

Fix Version/s: 2.6.4

> FairScheduler: ContinuousSchedulingThread can fail to shutdown
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3697.000.patch, YARN-3697.001.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> 

[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-04 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.004.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-04 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082601#comment-15082601
 ] 

zhihai xu commented on YARN-3446:
-

thanks for the review! Just updated the patch at YARN-3446.004.patch.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time

2015-12-18 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065066#comment-15065066
 ] 

zhihai xu commented on YARN-4440:
-

yes, thanks [~leftnoteasy] for committing it to branch-2.8!

> FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
> -
>
> Key: YARN-4440
> URL: https://issues.apache.org/jira/browse/YARN-4440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Fix For: 2.8.0
>
> Attachments: YARN-4440.001.patch, YARN-4440.002.patch, 
> YARN-4440.003.patch
>
>
> It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} 
> method
> {code}
> // default level is NODE_LOCAL
> if (! allowedLocalityLevel.containsKey(priority)) {
>   allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
>   return NodeType.NODE_LOCAL;
> }
> {code}
> If you first invoke this method, it doesn't init  time in 
> lastScheduledContainer and this will lead to execute these code for next 
> invokation:
> {code}
> // check waiting time
> long waitTime = currentTimeMs;
> if (lastScheduledContainer.containsKey(priority)) {
>   waitTime -= lastScheduledContainer.get(priority);
> } else {
>   waitTime -= getStartTime();
> }
> {code}
> the waitTime will subtract to FsApp startTime, and this will be easily more 
> than the delay time and allowedLocality degrade. Because FsApp startTime will 
> be start earlier than currentTimeMs. So we should add the initial time of 
> priority to prevent comparing with FsApp startTime and allowedLocalityLevel 
> degrade. And this problem will have more negative influence for small-jobs. 
> The YARN-4399 also discuss some problem in aspect of locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058263#comment-15058263
 ] 

zhihai xu commented on YARN-4439:
-

Hi [~jianhe], Could you revert the old patch and create a new patch for 
branch-2.7 to fix the compilation error?

> Clarify NMContainerStatus#toString method.
> --
>
> Key: YARN-4439
> URL: https://issues.apache.org/jira/browse/YARN-4439
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.7.3
>
> Attachments: YARN-4439.1.patch, YARN-4439.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058256#comment-15058256
 ] 

zhihai xu commented on YARN-4458:
-

Thanks [~jlowe]! yes, It makes sense, which will make cherry-pick easier.

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl. This issue only happens for branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3857:

Affects Version/s: 2.6.2

> Memory leak in ResourceManager with SIMPLE mode
> ---
>
> Key: YARN-3857
> URL: https://issues.apache.org/jira/browse/YARN-3857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0, 2.6.2
>Reporter: mujunchao
>Assignee: mujunchao
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
> YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch
>
>
>  We register the ClientTokenMasterKey to avoid client may hold an invalid 
> ClientToken after RM restarts. In SIMPLE mode, we register 
> Pair ,  But we never remove it from HashMap, as 
> unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058460#comment-15058460
 ] 

zhihai xu commented on YARN-4439:
-

Good Catch [~jlowe]! Will clean it up! thanks.

> Clarify NMContainerStatus#toString method.
> --
>
> Key: YARN-4439
> URL: https://issues.apache.org/jira/browse/YARN-4439
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.7.3
>
> Attachments: YARN-4439.1.patch, YARN-4439.2.patch, 
> YARN-4439.appendum-2.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Attachment: YARN-4458.000.patch

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057613#comment-15057613
 ] 

zhihai xu commented on YARN-4440:
-

Committed it to trunk and branch-2. thanks [~linyiqun] for the contributions!

> FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
> -
>
> Key: YARN-4440
> URL: https://issues.apache.org/jira/browse/YARN-4440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Attachments: YARN-4440.001.patch, YARN-4440.002.patch, 
> YARN-4440.003.patch
>
>
> It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} 
> method
> {code}
> // default level is NODE_LOCAL
> if (! allowedLocalityLevel.containsKey(priority)) {
>   allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
>   return NodeType.NODE_LOCAL;
> }
> {code}
> If you first invoke this method, it doesn't init  time in 
> lastScheduledContainer and this will lead to execute these code for next 
> invokation:
> {code}
> // check waiting time
> long waitTime = currentTimeMs;
> if (lastScheduledContainer.containsKey(priority)) {
>   waitTime -= lastScheduledContainer.get(priority);
> } else {
>   waitTime -= getStartTime();
> }
> {code}
> the waitTime will subtract to FsApp startTime, and this will be easily more 
> than the delay time and allowedLocality degrade. Because FsApp startTime will 
> be start earlier than currentTimeMs. So we should add the initial time of 
> priority to prevent comparing with FsApp startTime and allowedLocalityLevel 
> degrade. And this problem will have more negative influence for small-jobs. 
> The YARN-4399 also discuss some problem in aspect of locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Release Note: Compilation error at branch-2.7 due to getNodeLabelExpression 
not defined in NMContainerStatusPBImpl.  (was: Compilation error at branch-2.7 
due to {{getNodeLabelExpression}} not defined in NMContainerStatusPBImpl.)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057608#comment-15057608
 ] 

zhihai xu commented on YARN-3535:
-

You are welcome! I think this will be a very critical fix for 2.6.4 release.

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-4458:
---

 Summary: Compilation error at branch-2.7 due to 
getNodeLabelExpression not defined in NMContainerStatusPBImpl.
 Key: YARN-4458
 URL: https://issues.apache.org/jira/browse/YARN-4458
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Description: Compilation error at branch-2.7 due to getNodeLabelExpression 
not defined in NMContainerStatusPBImpl.

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Description: Compilation error at branch-2.7 due to getNodeLabelExpression 
not defined in NMContainerStatusPBImpl. This issue only happens for branch-2.7. 
 (was: Compilation error at branch-2.7 due to getNodeLabelExpression not 
defined in NMContainerStatusPBImpl.)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl. This issue only happens for branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Attachment: YARN-4458.branch-2.7.patch

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Attachment: (was: YARN-4458.000.patch)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Release Note:   (was: Compilation error at branch-2.7 due to 
getNodeLabelExpression not defined in NMContainerStatusPBImpl.)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl. This issue only happens for branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-12-14 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3857:

Fix Version/s: 2.6.4

> Memory leak in ResourceManager with SIMPLE mode
> ---
>
> Key: YARN-3857
> URL: https://issues.apache.org/jira/browse/YARN-3857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: mujunchao
>Assignee: mujunchao
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
> YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch
>
>
>  We register the ClientTokenMasterKey to avoid client may hold an invalid 
> ClientToken after RM restarts. In SIMPLE mode, we register 
> Pair ,  But we never remove it from HashMap, as 
> unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-12-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057421#comment-15057421
 ] 

zhihai xu commented on YARN-3857:
-

Yes, this issue exists in 2.6.x, I just committed this patch to branch-2.6.

> Memory leak in ResourceManager with SIMPLE mode
> ---
>
> Key: YARN-3857
> URL: https://issues.apache.org/jira/browse/YARN-3857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: mujunchao
>Assignee: mujunchao
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
> YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch
>
>
>  We register the ClientTokenMasterKey to avoid client may hold an invalid 
> ClientToken after RM restarts. In SIMPLE mode, we register 
> Pair ,  But we never remove it from HashMap, as 
> unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-12-14 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3535:

Fix Version/s: 2.6.4

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-12-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057536#comment-15057536
 ] 

zhihai xu commented on YARN-3535:
-

Yes, this issue exists in 2.6.x, I just committed this patch to branch-2.6.

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-12-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056745#comment-15056745
 ] 

zhihai xu commented on YARN-4209:
-

This issue won't affect 2.6.x branch, since RMStateStoreState.FENCED state is 
only added at 2.7.x branch.

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-4209.000.patch, YARN-4209.001.patch, 
> YARN-4209.002.patch, YARN-4209.branch-2.7.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time

2015-12-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056895#comment-15056895
 ] 

zhihai xu commented on YARN-4440:
-

Good catch! thanks for working on this issue [~linyiqun]!
+1 for the latest patch, The test failures are not related to the patch, These 
failures were already reported at YARN-4318 and YARN-4306.
Will commit it tomorrow if no one objects.


> FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
> -
>
> Key: YARN-4440
> URL: https://issues.apache.org/jira/browse/YARN-4440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Attachments: YARN-4440.001.patch, YARN-4440.002.patch, 
> YARN-4440.003.patch
>
>
> It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} 
> method
> {code}
> // default level is NODE_LOCAL
> if (! allowedLocalityLevel.containsKey(priority)) {
>   allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
>   return NodeType.NODE_LOCAL;
> }
> {code}
> If you first invoke this method, it doesn't init  time in 
> lastScheduledContainer and this will lead to execute these code for next 
> invokation:
> {code}
> // check waiting time
> long waitTime = currentTimeMs;
> if (lastScheduledContainer.containsKey(priority)) {
>   waitTime -= lastScheduledContainer.get(priority);
> } else {
>   waitTime -= getStartTime();
> }
> {code}
> the waitTime will subtract to FsApp startTime, and this will be easily more 
> than the delay time and allowedLocality degrade. Because FsApp startTime will 
> be start earlier than currentTimeMs. So we should add the initial time of 
> priority to prevent comparing with FsApp startTime and allowedLocalityLevel 
> degrade. And this problem will have more negative influence for small-jobs. 
> The YARN-4399 also discuss some problem in aspect of locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4344) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations

2015-11-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003181#comment-15003181
 ] 

zhihai xu commented on YARN-4344:
-

+1 for Jason Lowe's suggestion to fix the issue at scheduler side. Using 
{{SchedulerNode.getTotalResource()}} instead of {{RMNode.getTotalCapability()}} 
inside Scheduler can better decouple Scheduler from RMNodeImpl state machine. 
It may also fix some other potential issues. For example, 
{{CapacityScheduler#addNode}} uses {{nodeManager.getTotalCapability()}} after 
creating {{FiCaSchedulerNode}}, if {{nodeManager.totalCapability}} is changed 
by RMNodeImpl state machine right after {{FiCaSchedulerNode}} was created, 
similar issue may happen.

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations
> --
>
> Key: YARN-4344
> URL: https://issues.apache.org/jira/browse/YARN-4344
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Critical
> Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities, 
> there can arise situations where the overall cluster resource calculation for 
> the cluster will be incorrect leading to inconsistencies in scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4344) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations

2015-11-11 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001800#comment-15001800
 ] 

zhihai xu commented on YARN-4344:
-

Thanks for reporting this issue [~vvasudev]! Thanks for the review [~Jason 
Lowe]! 
[~rohithsharma] tried to clean up the code at YARN-3286. Based on the following 
comment from [~jianhe] at YARN-3286,
{code}
I think this has changed the behavior that without any RM/NM restart features 
enabled, earlier restarting a node will trigger RM to kill all the containers 
on this node, but now it won't ?
{code}
The patch may cause compatibility issue. Maybe we can merge the case 
{{rmNode.getHttpPort() == newNode.getHttpPort()}} with {{rmNode.getHttpPort() 
!= newNode.getHttpPort()}} for noRunningApps.
Thoughts?

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations
> --
>
> Key: YARN-4344
> URL: https://issues.apache.org/jira/browse/YARN-4344
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Critical
> Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities, 
> there can arise situations where the overall cluster resource calculation for 
> the cluster will be incorrect leading to inconsistencies in scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4256:

Hadoop Flags: Reviewed

> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Jun Gong
>Priority: Minor
> Fix For: 2.7.2
>
> Attachments: YARN-4256.001.patch, YARN-4256.002.patch
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4256:

Target Version/s: 2.8.0  (was: 2.7.2)

> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Jun Gong
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4256.001.patch, YARN-4256.002.patch
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969748#comment-14969748
 ] 

zhihai xu commented on YARN-4256:
-

committed it to trunk and branch-2,  Thanks [~Prabhu Joseph] for reporting this 
issue, thanks [~hex108] for the patch and thanks [~brahmareddy] for additional 
review!

> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Jun Gong
>Priority: Minor
> Fix For: 2.7.2
>
> Attachments: YARN-4256.001.patch, YARN-4256.002.patch
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967355#comment-14967355
 ] 

zhihai xu commented on YARN-4256:
-

+1 LGTM, Will commit tomorrow if no one objects.

> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Jun Gong
>Priority: Minor
> Fix For: 2.7.2
>
> Attachments: YARN-4256.001.patch, YARN-4256.002.patch
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965488#comment-14965488
 ] 

zhihai xu commented on YARN-4256:
-

Thanks for reporting this issue [~Prabhu Joseph]! Thanks for the patch 
[~hex108]! The patch looks most good. Can we change '+' to '*'
(\\.\\d+)? => (\\.\\d*)? So we can relax the condition to support 1024. mb.


> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Jun Gong
>Priority: Minor
> Fix For: 2.7.2
>
> Attachments: YARN-4256.001.patch
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

2015-10-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958402#comment-14958402
 ] 

zhihai xu commented on YARN-4227:
-

Is it possible the root cause of this issue is YARN-3675? I think YARN-3675 may 
cause this issue. If we can get the complete logs for 
container_1436927988321_1307950_01_12, we may confirm it. Once the node is 
removed, all the containers allocated on the node are supposed to be killed. 
The race condition at YARN-3675 may cause a container allocated on a just 
removed node.


> FairScheduler: RM quits processing expired container from a removed node
> 
>
> Key: YARN-4227
> URL: https://issues.apache.org/jira/browse/YARN-4227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.5.0, 2.7.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_12 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_12 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_12
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster

2015-10-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952738#comment-14952738
 ] 

zhihai xu commented on YARN-4201:
-

Committed it to branch-2 and trunk, thanks [~hex108] for the contribution!

> AMBlacklist does not work for minicluster
> -
>
> Key: YARN-4201
> URL: https://issues.apache.org/jira/browse/YARN-4201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Fix For: 2.8.0
>
> Attachments: YARN-4021.001.patch, YARN-4201.002.patch, 
> YARN-4201.003.patch
>
>
> For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
> AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
> whether scheduler.include-port-in-node-name is set or not. In fact RM should 
> put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4201) AMBlacklist does not work for minicluster

2015-10-12 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4201:

Hadoop Flags: Reviewed

> AMBlacklist does not work for minicluster
> -
>
> Key: YARN-4201
> URL: https://issues.apache.org/jira/browse/YARN-4201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4021.001.patch, YARN-4201.002.patch, 
> YARN-4201.003.patch
>
>
> For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
> AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
> whether scheduler.include-port-in-node-name is set or not. In fact RM should 
> put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events

2015-10-11 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952328#comment-14952328
 ] 

zhihai xu commented on YARN-4247:
-

[~adhoot], thanks for working on this issue, Is this issue fixed by YARN-3361? 
YARN-3361 removed {{readLock}} from {{RMAppAttemptImpl #getMasterContainer}}.

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> -
>
> Key: YARN-4247
> URL: https://issues.apache.org/jira/browse/YARN-4247
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Blocker
> Attachments: YARN-4247.001.patch, YARN-4247.001.patch
>
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-10-11 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.003.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-10-11 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: (was: YARN-3446.003.patch)

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster

2015-10-09 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951393#comment-14951393
 ] 

zhihai xu commented on YARN-4201:
-

+1 for the latest patch. I will wait for one or two days before committing for 
others to look at the patch.

> AMBlacklist does not work for minicluster
> -
>
> Key: YARN-4201
> URL: https://issues.apache.org/jira/browse/YARN-4201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4021.001.patch, YARN-4201.002.patch, 
> YARN-4201.003.patch
>
>
> For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
> AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
> whether scheduler.include-port-in-node-name is set or not. In fact RM should 
> put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster

2015-10-09 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949991#comment-14949991
 ] 

zhihai xu commented on YARN-4201:
-

Thanks for the new patch [~hex108], I think it will be better to check 
{{scheduler.getSchedulerNode(nodeId)}} not null to avoid NPE.
If {{scheduler.getSchedulerNode(nodeId)}} return null, it means the blacklisted 
node is just removed from scheduler, I think it will be ok to not add a removed 
node to black List.

> AMBlacklist does not work for minicluster
> -
>
> Key: YARN-4201
> URL: https://issues.apache.org/jira/browse/YARN-4201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4021.001.patch, YARN-4201.002.patch
>
>
> For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
> AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
> whether scheduler.include-port-in-node-name is set or not. In fact RM should 
> put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948870#comment-14948870
 ] 

zhihai xu commented on YARN-3943:
-

The checkstyle issues and release audit warnings for the new patch 
YARN-3943.002.patch were pre-existing.

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch, 
> YARN-3943.002.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster

2015-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948931#comment-14948931
 ] 

zhihai xu commented on YARN-4201:
-

Currently {{getSchedulerNode}} is defined at {{AbstractYarnScheduler}}. 
{{SchedulerAppUtils.isBlacklisted}} uses {{node.getNodeName()}} to check 
blacklisted node. So it will be good to use the same way to get blacklisted 
node name. All the configuration and format related to node name will be only 
in SchedulerNode.java.

> AMBlacklist does not work for minicluster
> -
>
> Key: YARN-4201
> URL: https://issues.apache.org/jira/browse/YARN-4201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4021.001.patch
>
>
> For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
> AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
> whether scheduler.include-port-in-node-name is set or not. In fact RM should 
> put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster

2015-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948914#comment-14948914
 ] 

zhihai xu commented on YARN-4201:
-

Thanks for the patch [~hex108]! It is a good catch.
Should we use {{SchedulerNode#getNodeName}} to get the blacklisted node name?
We can add {{getSchedulerNode}} to {{YarnScheduler}}, So we can call 
{{getSchedulerNode}} to look up the the SchedulerNode using NodeId in 
{{RMAppAttemptImpl}}.


> AMBlacklist does not work for minicluster
> -
>
> Key: YARN-4201
> URL: https://issues.apache.org/jira/browse/YARN-4201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4021.001.patch
>
>
> For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
> AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
> whether scheduler.include-port-in-node-name is set or not. In fact RM should 
> put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949536#comment-14949536
 ] 

zhihai xu commented on YARN-3943:
-

Thanks [~jlowe] for the review and committing the patch, greatly appreciated!

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch, 
> YARN-3943.002.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-07 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: YARN-3943.002.patch

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch, 
> YARN-3943.002.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-07 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947996#comment-14947996
 ] 

zhihai xu commented on YARN-3943:
-

Thanks [~jlowe]! yes, the comments are great. Nice catch for the backwards 
compatibility problem! I uploaded a new patch YARN-3943.002.patch, which 
addressed all your comments, Please review it.

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch, 
> YARN-3943.002.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-07 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947014#comment-14947014
 ] 

zhihai xu commented on YARN-3943:
-

Hi [~jlowe], Could you help review the patch thanks?

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: (was: YARN-3446.003.patch)

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.003.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-10-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946207#comment-14946207
 ] 

zhihai xu commented on YARN-4209:
-

Thanks [~rohithsharma] for reviewing and committing the patch!

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-4209.000.patch, YARN-4209.001.patch, 
> YARN-4209.002.patch, YARN-4209.branch-2.7.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946210#comment-14946210
 ] 

zhihai xu commented on YARN-3943:
-

The checkstyle issues and release audit warnings were pre-existing.

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Attachment: YARN-4209.branch-2.7.patch

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch, YARN-4209.001.patch, 
> YARN-4209.002.patch, YARN-4209.branch-2.7.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: (was: YARN-3943.001.patch)

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: YARN-3943.001.patch

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: (was: YARN-3943.001.patch)

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: YARN-3943.001.patch

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.003.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-10-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: (was: YARN-3446.003.patch)

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-10-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945553#comment-14945553
 ] 

zhihai xu commented on YARN-4209:
-

thanks [~rohithsharma]! Yes, I attached the patch YARN-4209.branch-2.7.patch 
for branch-2.7.

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch, YARN-4209.001.patch, 
> YARN-4209.002.patch, YARN-4209.branch-2.7.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >