date:20150505


[ 
https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528371#comment-14528371
 ] 

Junping Du commented on YARN-3396:
--

This is a simple fix which doesn't need unit test. I will go ahead to commit 
this later today.

 Handle URISyntaxException in ResourceLocalizationService
 

 Key: YARN-3396
 URL: https://issues.apache.org/jira/browse/YARN-3396
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chengbing Liu
Assignee: Brahma Reddy Battula
  Labels: newbie
 Attachments: YARN-3396-002.patch, YARN-3396.patch


 There are two occurrences of the following code snippet:
 {code}
 //TODO fail? Already translated several times...
 {code}
 It should be handled correctly in case that the resource URI is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3577) Misspelling of threshold in log4j.properties for tests

2015-05-05 Thread Brahma Reddy Battula (JIRA)

Brahma Reddy Battula created YARN-3577:
--

 Summary: Misspelling of threshold in log4j.properties for tests
 Key: YARN-3577
 URL: https://issues.apache.org/jira/browse/YARN-3577
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
Priority: Minor


log4j.properties file for test contains misspelling log4j.threshhold.
We should use log4j.threshold correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3538) TimelineServer doesn't catch/translate all exceptions raised


[ 
https://issues.apache.org/jira/browse/YARN-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528442#comment-14528442
 ] 

Junping Du commented on YARN-3538:
--

May be it sounds slightly better if we handle RuntimeException in the same way 
as IOException here? At least, we add Error putting domain info. :)

 TimelineServer doesn't catch/translate all exceptions raised
 

 Key: YARN-3538
 URL: https://issues.apache.org/jira/browse/YARN-3538
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Attachments: YARN-3538-001.patch


 Not all exceptions in TimelineServer are uprated to web exceptions; only IOEs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService


[ 
https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528282#comment-14528282
 ] 

Junping Du commented on YARN-3396:
--

Thanks [~brahmareddy] for updating the patch!
v2 patch LGTM. +1 based on Jenkins' result.

 Handle URISyntaxException in ResourceLocalizationService
 

 Key: YARN-3396
 URL: https://issues.apache.org/jira/browse/YARN-3396
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chengbing Liu
Assignee: Brahma Reddy Battula
  Labels: newbie
 Attachments: YARN-3396-002.patch, YARN-3396.patch


 There are two occurrences of the following code snippet:
 {code}
 //TODO fail? Already translated several times...
 {code}
 It should be handled correctly in case that the resource URI is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby

2015-05-05 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528350#comment-14528350
 ] 

Rohith commented on YARN-3574:
--

Very interesting bug!!  Going back to Java basics, Thread.interrupt() does not 
guarentee the interrupt for running thread unless thread is waintig/sleeping 
for something. 
In this issue I think {{queue.consumeAll(this);}} processing something which 
never given chance to interrupt it.

Just to reproduce this, small program wrote below code. If we run below code 
withoug commenting Thread.sleep, thread never get interrupted. Adding small 
sleep , result in thread get interrupted.
{code}
package com.test.basic;

public class Test1 {
  Thread sinkThread;
  private volatile boolean stopping = false;

  public void start() {
sinkThread = new Thread() {
  public void run() {
while (!stopping) {
  try {
while (true) {
  // Thread.sleep(1);
}
  } catch (Exception e) {
System.out.println(Interuppted..);
  }
}
  };
};
sinkThread.setDaemon(true);
sinkThread.start();
  }

  public void stop() {
stopping = true;
System.out.println(Interrupting.. );
sinkThread.interrupt();
try {
  System.out.println(Joining.. );
  sinkThread.join();
} catch (InterruptedException e) {
  System.out.println(Stop interrupted  + e);
}
System.out.println(Stopped successfully);
  }

  public static void main(String[] args) throws InterruptedException {
Test1 t1 = new Test1();
t1.start();
Thread.sleep(2000);
t1.stop();
  }
}
{code}

 RM hangs on stopping MetricsSinkAdapter when transitioning to standby
 -

 Key: YARN-3574
 URL: https://issues.apache.org/jira/browse/YARN-3574
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Brahma Reddy Battula

 We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter
 {code}
 main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in 
 Object.wait() [0x7f9afe7eb000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1281)
 - locked 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1355)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 - locked 0xc0503568 (a java.lang.Object)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076)
 - locked 0xc03fe3b8 (a 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
 - locked 0xc0502b10 (a 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428)
 - locked 0xc0718940 (a 
 org.apache.hadoop.ha.ActiveStandbyElector)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
 at

[jira] [Updated] (YARN-3552) RM Web UI shows -1 running containers for completed apps

2015-05-05 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-3552:
-
Affects Version/s: 2.8.0

 RM Web UI shows -1 running containers for completed apps
 

 Key: YARN-3552
 URL: https://issues.apache.org/jira/browse/YARN-3552
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Rohith
Assignee: Rohith
Priority: Trivial
  Labels: newbie
 Fix For: 2.8.0

 Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, 
 0002-YARN-3552.patch, yarn-3352.PNG


 In the RMServerUtils, the default values are negative number which results in 
 the displayiing the RM web UI also negative number. 
 {code}
   public static final ApplicationResourceUsageReport
 DUMMY_APPLICATION_RESOURCE_USAGE_REPORT =
   BuilderUtils.newApplicationResourceUsageReport(-1, -1,
   Resources.createResource(-1, -1), Resources.createResource(-1, -1),
   Resources.createResource(-1, -1), 0, 0);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3576) In Log - Container getting killed by AM even when work preserving is enabled

2015-05-05 Thread Anushri (JIRA)

Anushri created YARN-3576:
-

 Summary: In Log - Container getting killed by AM even when work 
preserving is enabled 
 Key: YARN-3576
 URL: https://issues.apache.org/jira/browse/YARN-3576
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: SUSE11 SP3
3 nodes cluster
Reporter: Anushri
Priority: Minor


RM in HA mode
NM running on one node
work preserving enabled

RM in HA mode one NM running work preserving is enabled An application is 
submitted and RM switch over happens. In the NM log it is found that AM kills 
some of the containers and those containers have exit code as 143. but in the 
container logs , logs are found for the same container. 
Problem : 
if work preserving is enabled why is it killing and cleaning the container? 
and if the container is getting killed , why is its log present in container 
logs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528322#comment-14528322
 ] 

Hudson commented on YARN-3097:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #918 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/918/])
YARN-3097. Logging of resource recovery on NM restart has redundancies. 
Contributed by Eric Payne (jlowe: rev 8f65c793f2930bfd16885a2ab188a9970b754974)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-yarn-project/CHANGES.txt


 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528328#comment-14528328
 ] 

Hudson commented on YARN-2725:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #918 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/918/])
YARN-2725. Added test cases of retrying creating znode in ZKRMStateStore. 
Contributed by Tsuyoshi Ozawa (jianhe: rev 
d701acc9c67adc578ba18035edde1166eedaae98)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java


 Adding test cases of retrying requests about ZKRMStateStore
 ---

 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi Ozawa
Assignee: Tsuyoshi Ozawa
 Fix For: 2.8.0

 Attachments: YARN-2725.1.patch, YARN-2725.1.patch


 YARN-2721 found a race condition for ZK-specific retry semantics. We should 
 add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3375) NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner


[ 
https://issues.apache.org/jira/browse/YARN-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528326#comment-14528326
 ] 

Hudson commented on YARN-3375:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #918 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/918/])
YARN-3375. NodeHealthScriptRunner.shouldRun() check is performing 3 times for 
starting NodeHealthScriptRunner (Devaraj K via wangda) (wangda: rev 
71f4de220c74bf2c90630bd0442979d92380d304)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NodeHealthScriptRunner.java
* hadoop-yarn-project/CHANGES.txt


 NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting 
 NodeHealthScriptRunner
 --

 Key: YARN-3375
 URL: https://issues.apache.org/jira/browse/YARN-3375
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Devaraj K
Assignee: Devaraj K
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3375.patch


 1. NodeHealthScriptRunner.shouldRun() check is happening 3 times for starting 
 the NodeHealthScriptRunner.
 {code:title=NodeManager.java|borderStyle=solid}
 if(!NodeHealthScriptRunner.shouldRun(nodeHealthScript)) {
   LOG.info(Abey khali);
   return null;
 }
 {code}
 {code:title=NodeHealthCheckerService.java|borderStyle=solid}
 if (NodeHealthScriptRunner.shouldRun(
 conf.get(YarnConfiguration.NM_HEALTH_CHECK_SCRIPT_PATH))) {
   addService(nodeHealthScriptRunner);
 }
 {code}
 {code:title=NodeHealthScriptRunner.java|borderStyle=solid}
 if (!shouldRun(nodeHealthScript)) {
   LOG.info(Not starting node health monitor);
   return;
 }
 {code}
 2. If we don't configure node health script or configured health script 
 doesn't execute permission, NM logs with the below message.
 {code:xml}
 2015-03-19 19:55:45,713 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Abey khali
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528288#comment-14528288
 ] 

Hudson commented on YARN-3097:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #184 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/184/])
YARN-3097. Logging of resource recovery on NM restart has redundancies. 
Contributed by Eric Payne (jlowe: rev 8f65c793f2930bfd16885a2ab188a9970b754974)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-yarn-project/CHANGES.txt


 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528294#comment-14528294
 ] 

Hudson commented on YARN-2725:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #184 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/184/])
YARN-2725. Added test cases of retrying creating znode in ZKRMStateStore. 
Contributed by Tsuyoshi Ozawa (jianhe: rev 
d701acc9c67adc578ba18035edde1166eedaae98)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java


 Adding test cases of retrying requests about ZKRMStateStore
 ---

 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi Ozawa
Assignee: Tsuyoshi Ozawa
 Fix For: 2.8.0

 Attachments: YARN-2725.1.patch, YARN-2725.1.patch


 YARN-2721 found a race condition for ZK-specific retry semantics. We should 
 add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3484) Fix up yarn top shell code


[ 
https://issues.apache.org/jira/browse/YARN-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528397#comment-14528397
 ] 

Junping Du commented on YARN-3484:
--

Latest patch LGTM. [~aw], do you have any further comments? If not, I will go 
ahead to commit v2 patch soon. Thx!

 Fix up yarn top shell code
 --

 Key: YARN-3484
 URL: https://issues.apache.org/jira/browse/YARN-3484
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Affects Versions: 3.0.0
Reporter: Allen Wittenauer
Assignee: Varun Vasudev
  Labels: newbie
 Attachments: YARN-3484.001.patch, YARN-3484.002.patch


 We need to do some work on yarn top's shell code.
 a) Just checking for TERM isn't good enough.  We really need to check the 
 return on tput, especially since the output will not be a number but an error 
 string which will likely blow up the java code in horrible ways.
 b) All the single bracket tests should be double brackets to force the bash 
 built-in.
 c) I'd think I'd rather see the shell portion in a function since it's rather 
 large.  This will allow for args, etc, to get local'ized and clean up the 
 case statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3477) TimelineClientImpl swallows exceptions


[ 
https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528420#comment-14528420
 ] 

Junping Du commented on YARN-3477:
--

Adding a little background (for timeline service v2) here why we could prefer 
DEBUG than INFO here in retry logic: 
In timeline service version 2, the timeline service address (per application 
per agent - we call it AppTimelineCollector) is automatically discovered, the 
current flow is: 
1. for a new application, when AM get launched in NM, the auxiliary service of 
container launch will trigger initializing of AppTimelineCollector, which 
report its bind address to NM (we add new RPC there); 
2. NM will notify RM about this new AppTimelineCollector address in next 
heartbeat; 
3. Other NMs (has container running against this app) get this address from RM.
Both AM and NM leverage TimelineClient to publish events/metrics info to 
timeline service and this auto-discovery process do need some time (several 
heartbeat intervals) to figure out rather than a static pre-configured address. 
And we will always see some disturbed info if we put INFO level message there. 
Thoughts? 

 TimelineClientImpl swallows exceptions
 --

 Key: YARN-3477
 URL: https://issues.apache.org/jira/browse/YARN-3477
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0, 2.7.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-3477-001.patch, YARN-3477-002.patch


 If timeline client fails more than the retry count, the original exception is 
 not thrown. Instead some runtime exception is raised saying retries run out
 # the failing exception should be rethrown, ideally via 
 NetUtils.wrapException to include URL of the failing endpoing
 # Otherwise, the raised RTE should (a) state that URL and (b) set the 
 original fault as the inner cause



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2123) Progress bars in Web UI always at 100% due to non-US locale

2015-05-05 Thread Akira AJISAKA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-2123:

Attachment: YARN-2123-branch-2.7.001.patch

Thanks [~xgong] for review. Attaching a patch for branch-2.7.

 Progress bars in Web UI always at 100% due to non-US locale
 ---

 Key: YARN-2123
 URL: https://issues.apache.org/jira/browse/YARN-2123
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.3.0
Reporter: Johannes Simon
Assignee: Akira AJISAKA
 Attachments: NaN_after_launching_RM.png, YARN-2123-001.patch, 
 YARN-2123-002.patch, YARN-2123-003.patch, YARN-2123-004.patch, 
 YARN-2123-branch-2.7.001.patch, fair-scheduler-ajisaka.xml, 
 screenshot-noPatch.png, screenshot-patch.png, screenshot.png, 
 yarn-site-ajisaka.xml


 In our cluster setup, the YARN web UI always shows progress bars at 100% (see 
 screenshot, progress of the reduce step is roughly at 32.82%). I opened the 
 HTML source code to check (also see screenshot), and it seems the problem is 
 that it uses a comma as decimal mark, where most browsers expect a dot for 
 floating-point numbers. This could possibly be due to localized number 
 formatting being used in the wrong place, which would also explain why this 
 bug is not always visible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.


[ 
https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528392#comment-14528392
 ] 

Junping Du commented on YARN-3549:
--

Hi [~zxu] and [~cnauroth], this sounds like a change need to happen in 
hadoop-common project instead of YARN. Shall we move this JIRA from YARN to 
HADOOP?

 use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 

 Key: YARN-3549
 URL: https://issues.apache.org/jira/browse/YARN-3549
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu

 Use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 As discussed in YARN-3491, shell-based implementation getPermission runs 
 shell command ls -ld to get permission, which take 4 or 5 ms(very slow).
 We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in 
 RawLocalFileSystem to get rid of shell-based implementation for FileStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3552) RM Web UI shows -1 running containers for completed apps

2015-05-05 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528422#comment-14528422
 ] 

Jason Lowe commented on YARN-3552:
--

+1 lgtm.  Committing this.

 RM Web UI shows -1 running containers for completed apps
 

 Key: YARN-3552
 URL: https://issues.apache.org/jira/browse/YARN-3552
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: Rohith
Assignee: Rohith
Priority: Trivial
  Labels: newbie
 Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, 
 0002-YARN-3552.patch, yarn-3352.PNG


 In the RMServerUtils, the default values are negative number which results in 
 the displayiing the RM web UI also negative number. 
 {code}
   public static final ApplicationResourceUsageReport
 DUMMY_APPLICATION_RESOURCE_USAGE_REPORT =
   BuilderUtils.newApplicationResourceUsageReport(-1, -1,
   Resources.createResource(-1, -1), Resources.createResource(-1, -1),
   Resources.createResource(-1, -1), 0, 0);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control


[ 
https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529728#comment-14529728
 ] 

Hadoop QA commented on YARN-1912:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 38s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 59s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 47s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 32s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 26s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   5m 53s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 12s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730662/YARN-1912.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 90b3845 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7717/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7717/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7717/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7717/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7717/console |


This message was automatically generated.

 ResourceLocalizer started without any jvm memory control
 

 Key: YARN-1912
 URL: https://issues.apache.org/jira/browse/YARN-1912
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: stanley shi
Assignee: Masatake Iwasaki
 Attachments: YARN-1912-0.patch, YARN-1912-1.patch, YARN-1912.003.patch


 In the LinuxContainerExecutor.java#startLocalizer, it does not specify any 
 -Xmx configurations in the command, this caused the ResourceLocalizer to be 
 started with default memory setting.
 In an server-level hardware, it will use 25% of the system memory as the max 
 heap size, this will cause memory issue in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs


 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2172:
---
Labels: BB2015-05-TBR hadoop jobs resume suspend  (was: hadoop jobs resume 
suspend)

 Suspend/Resume Hadoop Jobs
 --

 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
  Labels: BB2015-05-TBR, hadoop, jobs, resume, suspend
 Attachments: Hadoop Job Suspend Resume Design.docx, 
 hadoop_job_suspend_resume.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 In a multi-application cluster environment, jobs running inside Hadoop YARN 
 may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
 give way to other higher-priority jobs inside Hadoop, a user or some 
 cluster-level resource scheduling service should be able to suspend and/or 
 resume some particular jobs within Hadoop YARN.
 When target jobs inside Hadoop are suspended, those already allocated and 
 running task containers will continue to run until their completion or active 
 preemption by other ways. But no more new containers would be allocated to 
 the target jobs. In contrast, when suspended jobs are put into resume mode, 
 they will continue to run from the previous job progress and have new task 
 containers allocated to complete the rest of the jobs.
 My team has completed its implementation and our tests showed it works in a 
 rather solid  and convenient way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3259) FairScheduler: Update to fairShare could be triggered early on node events instead of waiting for update interval


 [ 
https://issues.apache.org/jira/browse/YARN-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3259:
---
Labels: BB2015-05-TBR  (was: )

 FairScheduler: Update to fairShare could be triggered early on node events 
 instead of waiting for update interval 
 --

 Key: YARN-3259
 URL: https://issues.apache.org/jira/browse/YARN-3259
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
  Labels: BB2015-05-TBR
 Attachments: YARN-3259.001.patch


 Instead of waiting for update interval unconditionally, we can trigger early 
 updates on important events - for eg node join and leave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits


 [ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2069:
---
Labels: BB2015-05-TBR  (was: )

 CS queue level preemption should respect user-limits
 

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
  Labels: BB2015-05-TBR
 Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-10.patch, 
 YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, 
 YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, 
 YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch


 This is different from (even if related to, and likely share code with) 
 YARN-2113.
 YARN-2113 focuses on making sure that even if queue has its guaranteed 
 capacity, it's individual users are treated in-line with their limits 
 irrespective of when they join in.
 This JIRA is about respecting user-limits while preempting containers to 
 balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2981) DockerContainerExecutor must support a Cluster-wide default Docker image


 [ 
https://issues.apache.org/jira/browse/YARN-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2981:
---
Labels: BB2015-05-TBR  (was: )

 DockerContainerExecutor must support a Cluster-wide default Docker image
 

 Key: YARN-2981
 URL: https://issues.apache.org/jira/browse/YARN-2981
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Abin Shahab
Assignee: Abin Shahab
  Labels: BB2015-05-TBR
 Attachments: YARN-2981.patch, YARN-2981.patch, YARN-2981.patch, 
 YARN-2981.patch


 This allows the yarn administrator to add a cluster-wide default docker image 
 that will be used when there are no per-job override of docker images. With 
 this features, it would be convenient for newer applications like slider to 
 launch inside a cluster-default docker container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3282) DockerContainerExecutor should support environment variables setting


 [ 
https://issues.apache.org/jira/browse/YARN-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3282:
---
Labels: BB2015-05-TBR  (was: )

 DockerContainerExecutor should support environment variables setting
 

 Key: YARN-3282
 URL: https://issues.apache.org/jira/browse/YARN-3282
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications, nodemanager
Affects Versions: 2.6.0
Reporter: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: YARN-3282.01.patch


 Currently, DockerContainerExecutor will mount yarn.nodemanager.local-dirs 
 and yarn.nodemanager.log-dirs to containers automatically. However 
 applications maybe need set more environment variables before launching 
 containers. 
 In our applications, just as the following command, we need to attach several 
 directories and set some environment variables to docker containers. 
 {code}
 docker run -i -t -v /data/transcode:/data/tmp -v /etc/qcs:/etc/qcs -v 
 /mnt:/mnt -e VTC_MQTYPE=rabbitmq -e VTC_APP=ugc -e VTC_LOCATION=sh -e 
 VTC_RUNTIME=vtc sequenceiq/hadoop-docker:2.6.0 /bin/bash
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3176) In Fair Scheduler, child queue should inherit maxApp from its parent


 [ 
https://issues.apache.org/jira/browse/YARN-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3176:
---
Labels: BB2015-05-TBR  (was: )

 In Fair Scheduler, child queue should inherit maxApp from its parent
 

 Key: YARN-3176
 URL: https://issues.apache.org/jira/browse/YARN-3176
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li
Assignee: Siqi Li
  Labels: BB2015-05-TBR
 Attachments: YARN-3176.v1.patch


 if the child queue does not have a maxRunningApp limit, it will use the 
 queueMaxAppsDefault. This behavior is not quite right, since 
 queueMaxAppsDefault is normally a small number, whereas some parent queues do 
 have maxRunningApp set to be more than the default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1912) ResourceLocalizer started without any jvm memory control


 [ 
https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1912:
---
Labels: BB2015-05-TBR  (was: )

 ResourceLocalizer started without any jvm memory control
 

 Key: YARN-1912
 URL: https://issues.apache.org/jira/browse/YARN-1912
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: stanley shi
Assignee: Masatake Iwasaki
  Labels: BB2015-05-TBR
 Attachments: YARN-1912-0.patch, YARN-1912-1.patch, YARN-1912.003.patch


 In the LinuxContainerExecutor.java#startLocalizer, it does not specify any 
 -Xmx configurations in the command, this caused the ResourceLocalizer to be 
 started with default memory setting.
 In an server-level hardware, it will use 25% of the system memory as the max 
 heap size, this will cause memory issue in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3554) Default value for maximum nodemanager connect wait time is too high


 [ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3554:
---
Labels: BB2015-05-TBR newbie  (was: newbie)

 Default value for maximum nodemanager connect wait time is too high
 ---

 Key: YARN-3554
 URL: https://issues.apache.org/jira/browse/YARN-3554
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
  Labels: BB2015-05-TBR, newbie
 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch


 The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
 msec or 15 minutes, which is way too high.  The default container expiry time 
 from the RM and the default task timeout in MapReduce are both only 10 
 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2618) Avoid over-allocation of disk resources


 [ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2618:
---
Labels: BB2015-05-TBR  (was: )

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
  Labels: BB2015-05-TBR
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins


[ 
https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529837#comment-14529837
 ] 

Hadoop QA commented on YARN-3562:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 48s | Pre-patch YARN-2928 compilation 
is healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   7m 38s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 43s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 58s | The applied patch generated 
16 release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 34s | The applied patch generated  1 
new checkstyle issues (total was 9, now 10). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 24s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |  52m 40s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   2m 33s | Tests passed in 
hadoop-yarn-server-tests. |
| {color:green}+1{color} | yarn tests |   0m 21s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  94m 50s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730677/YARN-3562-YARN-2928.002.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / 557a395 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/diffcheckstylehadoop-yarn-server-timelineservice.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7720/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7720/console |


This message was automatically generated.

 unit tests failures and issues found from findbug from earlier ATS checkins
 ---

 Key: YARN-3562
 URL: https://issues.apache.org/jira/browse/YARN-3562
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: YARN-3562-YARN-2928.001.patch, 
 YARN-3562-YARN-2928.002.patch


 *Issues reported from MAPREDUCE-6337* :
 A bunch of MR unit tests are failing on our branch whenever the mini YARN 
 cluster needs to bring up multiple node managers.
 For example, see 
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/
 It is because the NMCollectorService is using a fixed port for the RPC (8048).
 *Issues reported from YARN-3044* :
 Test case failures and tools(FB  CS) issues found :
 # find bugs issue : Comparison of String objects using == or != in 
 ResourceTrackerService.updateAppCollectorsMap
 # find bugs issue : Boxing/unboxing to parse a primitive 
 RMTimelineCollectorManager.postPut. Called method Long.longValue()
 Should call Long.parseLong(String) instead.
 # find bugs issue : DM_DEFAULT_ENCODING Called method new 
 java.io.FileWriter(String, boolean) At 
 FileSystemTimelineWriterImpl.java:\[line 86\]
 # hadoop.yarn.server.resourcemanager.TestAppManager, 
 hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, 
 hadoop.yarn.server.resourcemanager.TestClientRMService

[jira] [Updated] (YARN-3381) A typographical error in InvalidStateTransitonException


 [ 
https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3381:
---
Labels: BB2015-05-TBR  (was: )

 A typographical error in InvalidStateTransitonException
 -

 Key: YARN-3381
 URL: https://issues.apache.org/jira/browse/YARN-3381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Xiaoshuang LU
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3381-002.patch, YARN-3381-003.patch, YARN-3381.patch


 Appears that InvalidStateTransitonException should be 
 InvalidStateTransitionException.  Transition was misspelled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.


 [ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3385:
---
Labels: BB2015-05-TBR  (was: )

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-3385.000.patch, YARN-3385.001.patch, 
 YARN-3385.002.patch, YARN-3385.003.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3580) [JDK 8] TestClientRMService.testGetLabelsToNodes fails


 [ 
https://issues.apache.org/jira/browse/YARN-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3580:
---
Labels: BB2015-05-TBR  (was: )

 [JDK 8] TestClientRMService.testGetLabelsToNodes fails
 --

 Key: YARN-3580
 URL: https://issues.apache.org/jira/browse/YARN-3580
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
 Environment: JDK 8
Reporter: Robert Kanter
Assignee: Robert Kanter
  Labels: BB2015-05-TBR
 Attachments: YARN-3580.001.patch


 When using JDK 8, {{TestClientRMService.testGetLabelsToNodes}} fails:
 {noformat}
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetLabelsToNodes(TestClientRMService.java:1499)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3582) NPE in WebAppProxyServlet


 [ 
https://issues.apache.org/jira/browse/YARN-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3582:
---
Labels: BB2015-05-TBR  (was: )

 NPE in WebAppProxyServlet
 -

 Key: YARN-3582
 URL: https://issues.apache.org/jira/browse/YARN-3582
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
  Labels: BB2015-05-TBR
 Attachments: YARN-3582.1.patch


 {code}
 HTTP ERROR 500
 Problem accessing /proxy. Reason:
 INTERNAL_SERVER_ERROR
 Caused by:
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
 at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
 at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
 at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
 at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3513) Remove unused variables in ContainersMonitorImpl


 [ 
https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3513:
---
Labels: BB2015-05-TBR newbie  (was: newbie)

 Remove unused variables in ContainersMonitorImpl
 

 Key: YARN-3513
 URL: https://issues.apache.org/jira/browse/YARN-3513
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Trivial
  Labels: BB2015-05-TBR, newbie
 Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, 
 YARN-3513.20150506-1.patch


 class members :  {{private final Context context;}}
 and some local variables in MonitoringThread.run()  : {{vmemStillInUsage and 
 pmemStillInUsage}} are not used and just updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend


 [ 
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3134:
---
Labels: BB2015-05-TBR  (was: )

 [Storage implementation] Exploiting the option of using Phoenix to access 
 HBase backend
 ---

 Key: YARN-3134
 URL: https://issues.apache.org/jira/browse/YARN-3134
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Li Lu
  Labels: BB2015-05-TBR
 Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, 
 YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, 
 YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, 
 YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, 
 YARN-3134-YARN-2928.003.patch, YARN-3134-YARN-2928.004.patch, 
 YARN-3134DataSchema.pdf


 Quote the introduction on Phoenix web page:
 {code}
 Apache Phoenix is a relational database layer over HBase delivered as a 
 client-embedded JDBC driver targeting low latency queries over HBase data. 
 Apache Phoenix takes your SQL query, compiles it into a series of HBase 
 scans, and orchestrates the running of those scans to produce regular JDBC 
 result sets. The table metadata is stored in an HBase table and versioned, 
 such that snapshot queries over prior versions will automatically use the 
 correct schema. Direct use of the HBase API, along with coprocessors and 
 custom filters, results in performance on the order of milliseconds for small 
 queries, or seconds for tens of millions of rows.
 {code}
 It may simply our implementation read/write data from/to HBase, and can 
 easily build index and compose complex query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

[
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529840#comment-14529840
]

Jian He commented on YARN-1680:
---

On my thinking, even if we do the headroom calculation on the client side,
scheduler still requires some corresponding per-app logic for the headroom
calculation. And that scheduler piece of logic may end up duplicating a subset
of the client side logic plus corresponding protocol changes. In that sense, I
think it's simpler to do this inside scheduler. Doing the calculation in one
place is still a more accurate snapshot than doing the calculations in multiple
places.
Also, changing MapReduce to use AMRMClient is non-trivial work.

availableResources sent to applicationMaster in heartbeat should exclude
blacklistedNodes free memory.
--

Key: YARN-1680
URL: https://issues.apache.org/jira/browse/YARN-1680
Project: Hadoop YARN
Issue Type: Sub-task
Components: capacityscheduler
Affects Versions: 2.2.0, 2.3.0
Environment: SuSE 11 SP2 + Hadoop-2.3
Reporter: Rohith
Assignee: Craig Welch
Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch,
YARN-1680-v2.patch, YARN-1680.patch

There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster
slow start is set to 1.
Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is
become unstable(3 Map got killed), MRAppMaster blacklisted unstable
NodeManager(NM-4). All reducer task are running in cluster now.
MRAppMaster does not preempt the reducers because for Reducer preemption
calculation, headRoom is considering blacklisted nodes memory. This makes
jobs to hang forever(ResourceManager does not assing any new containers on
blacklisted nodes but returns availableResouce considers cluster free
memory).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience


 [ 
https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3523:
---
Labels: BB2015-05-TBR newbie  (was: newbie)

 Cleanup ResourceManagerAdministrationProtocol interface audience
 

 Key: YARN-3523
 URL: https://issues.apache.org/jira/browse/YARN-3523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
  Labels: BB2015-05-TBR, newbie
 Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch, 
 YARN-3523.20150505-1.patch


 I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
 class and @Public audience for methods. It doesn't make sense to me. We 
 should make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins


 [ 
https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3562:
---
Labels: BB2015-05-TBR  (was: )

 unit tests failures and issues found from findbug from earlier ATS checkins
 ---

 Key: YARN-3562
 URL: https://issues.apache.org/jira/browse/YARN-3562
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: YARN-3562-YARN-2928.001.patch, 
 YARN-3562-YARN-2928.002.patch


 *Issues reported from MAPREDUCE-6337* :
 A bunch of MR unit tests are failing on our branch whenever the mini YARN 
 cluster needs to bring up multiple node managers.
 For example, see 
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/
 It is because the NMCollectorService is using a fixed port for the RPC (8048).
 *Issues reported from YARN-3044* :
 Test case failures and tools(FB  CS) issues found :
 # find bugs issue : Comparison of String objects using == or != in 
 ResourceTrackerService.updateAppCollectorsMap
 # find bugs issue : Boxing/unboxing to parse a primitive 
 RMTimelineCollectorManager.postPut. Called method Long.longValue()
 Should call Long.parseLong(String) instead.
 # find bugs issue : DM_DEFAULT_ENCODING Called method new 
 java.io.FileWriter(String, boolean) At 
 FileSystemTimelineWriterImpl.java:\[line 86\]
 # hadoop.yarn.server.resourcemanager.TestAppManager, 
 hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, 
 hadoop.yarn.server.resourcemanager.TestClientRMService  
 hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus,
  refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI


 [ 
https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3301:
---
Labels: BB2015-05-TBR  (was: )

 Fix the format issue of the new RM web UI and AHS web UI
 

 Key: YARN-3301
 URL: https://issues.apache.org/jira/browse/YARN-3301
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
  Labels: BB2015-05-TBR
 Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 
 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, 
 YARN-3301.3.patch, YARN-3301.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI

2015-05-05 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3301:

Attachment: YARN-3301.4.patch

 Fix the format issue of the new RM web UI and AHS web UI
 

 Key: YARN-3301
 URL: https://issues.apache.org/jira/browse/YARN-3301
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 
 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, 
 YARN-3301.3.patch, YARN-3301.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high

2015-05-05 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529591#comment-14529591
 ] 

Naganarasimha G R commented on YARN-3554:
-

Hi [~vinodkv]  [~jlowe],
So would configuring yarn.client.nodemanager-connect.max-wait-ms as 1 min 
better ? 

 Default value for maximum nodemanager connect wait time is too high
 ---

 Key: YARN-3554
 URL: https://issues.apache.org/jira/browse/YARN-3554
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch


 The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
 msec or 15 minutes, which is way too high.  The default container expiry time 
 from the RM and the default task timeout in MapReduce are both only 10 
 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2015-05-05 Thread Craig Welch (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529644#comment-14529644
]

Craig Welch commented on YARN-1680:
---

bq. Please leave out the head-room concerns w.r.t node-labels. IIRC, we had
tickets at YARN-796 tracking that. It is very likely a completely different
solution, so.

I'm not sure that's so - there is already a process of calculating headroom for
labels associated with an application, the above is an extension of that to
blacklisted nodes to handle label cases. If we leave it out, then the solution
won't work for node-labels, and it can be made to do so, so that would be a
loss.

bq. When I said node-labels above, I meant partitions. Clearly the problem and
the corresponding solution will likely be very similar for node-constraints
(one type of node-labels). After all, blacklisting is a type of (anti)
node-constraint.

It could be modeled that way, but then it will be qualitatively different from
the solution for non-label cases, which is not a good thing...

bq. There is no notion of a cluster-level blacklisting in YARN. We have notions
of unhealthy/lost/decommissioned nodes in a cluster.

This is what I am referring when I say:
bq. addition/removal at the cluster level
I'm not suggesting/referring to anything other than nodes entering/leaving the
cluster

bq. Coming to the app-level blacklisting, clearly, the solution proposed is
better than dead-locks. But blindly reducing the resources corresponding to
blacklisted nodes will result in under-utilization (sometimes massively) and
over-conservative scheduling requests by apps.

So, that's the point of the recommended approach. The idea is to detect when
it is necessary to recalculate the impact of the blacklisting on app headroom,
which is when either blacklisting from the app has changed or the node
composition of the cluster has changed (each of which should be relatively
infrequent, certainly in relation to headroom calculation), and at that time to
accurately calculate the impact by only adding the resource value nodes which
actually exist from the blacklist into the value of the deduction. It isn't
blindly reducing resources, it's doing it accurately, and should both prevent
deadlocks and under-utilization

bq. One way to resolve this is to get the apps (or optionally in the AMRMClient
library) to deduct the resource unusable on blacklisted nodes

It could be moved into the AM's or client library, but then they would have to
do the same sort of thing, and then the logic needs to be duplicated amongst
the AM's or will only be available to those which use the library (do they
all?). It's worth considering if it can be made to cover them all via the
library, but I'm not sure this isn't something which should be handled as part
of the headroom calculation in the rm, as it is meant to provide this
accurately, and is otherwise aware of the blacklist. Which suggested to me
that we already have the blacklist for the application in the RM/available to
the scheduler (I'm not sure why that wasn't obvious to me before...), which
does appear to be the case and which therefore drops out concerns about adding
it - it's already there...

availableResources sent to applicationMaster in heartbeat should exclude
blacklistedNodes free memory.
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3580) [JDK 8] TestClientRMService.testGetLabelsToNodes fails

2015-05-05 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-3580:

Attachment: YARN-3580.001.patch

When setting node labels, port 0 is considered a wildcard port, and the 
{{CommonNodeLabelsManager}} applies the given label to all NMs that previously 
had a label on that host.  Due to the iteration ordering being different 
between JDK 7 and JDK 8, this was changing the labeling from:
{noformat:title=JDK7}
z
host1:1
host3:1
y
host2:0
host3:0
x
host1:0
{noformat}
to
{noformat:title=JDK8}
x
host1:1
host1:0
y
host3:0
host2:0
z
host3:1
{noformat}

The patch fixes the problem by using different port numbers.  It also cleans 
the test up a little.

 [JDK 8] TestClientRMService.testGetLabelsToNodes fails
 --

 Key: YARN-3580
 URL: https://issues.apache.org/jira/browse/YARN-3580
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
 Environment: JDK 8
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: YARN-3580.001.patch


 When using JDK 8, {{TestClientRMService.testGetLabelsToNodes}} fails:
 {noformat}
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetLabelsToNodes(TestClientRMService.java:1499)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3582) NPE in WebAppProxyServlet


 [ 
https://issues.apache.org/jira/browse/YARN-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3582:
--
Attachment: YARN-3582.1.patch

upload a patch to fix the npe

 NPE in WebAppProxyServlet
 -

 Key: YARN-3582
 URL: https://issues.apache.org/jira/browse/YARN-3582
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3582.1.patch


 {code}
 HTTP ERROR 500
 Problem accessing /proxy. Reason:
 INTERNAL_SERVER_ERROR
 Caused by:
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
 at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
 at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
 at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
 at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler

2015-05-05 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529827#comment-14529827
 ] 

Naganarasimha G R commented on YARN-3557:
-

Thats an interesting point here .  {{lables GPU, FPGA, LINUX, WINDOWS,}} these 
are more like constraints of a node for which a new jira is coming up and so 
may be once it comes in we need to support in such a way that constraints 
should be supported to be added from both  RM and NM and partitions (existing 
labels) should be allowed either by RM or NM ... thoughts ? 

 Support Intel Trusted Execution Technology(TXT) in YARN scheduler
 -

 Key: YARN-3557
 URL: https://issues.apache.org/jira/browse/YARN-3557
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Dian Fu
 Attachments: Support TXT in YARN high level design doc.pdf


 Intel TXT defines platform-level enhancements that provide the building 
 blocks for creating trusted platforms. A TXT aware YARN scheduler can 
 schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 
 provides the capacity to restrict YARN applications to run only on cluster 
 nodes that have a specified node label. This is a good mechanism that be 
 utilized for TXT aware YARN scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state


 [ 
https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3212:
---
Labels: BB2015-05-TBR  (was: )

 RMNode State Transition Update with DECOMMISSIONING state
 -

 Key: YARN-3212
 URL: https://issues.apache.org/jira/browse/YARN-3212
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du
  Labels: BB2015-05-TBR
 Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, 
 YARN-3212-v2.patch, YARN-3212-v3.patch


 As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and 
 can transition from “running” state triggered by a new event - 
 “decommissioning”. 
 This new state can be transit to state of “decommissioned” when 
 Resource_Update if no running apps on this NM or NM reconnect after restart. 
 Or it received DECOMMISSIONED event (after timeout from CLI).
 In addition, it can back to “running” if user decides to cancel previous 
 decommission by calling recommission on the same node. The reaction to other 
 events is similar to RUNNING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1572) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal


 [ 
https://issues.apache.org/jira/browse/YARN-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1572:
---
Labels: BB2015-05-TBR  (was: )

 Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal
 --

 Key: YARN-1572
 URL: https://issues.apache.org/jira/browse/YARN-1572
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
  Labels: BB2015-05-TBR
 Attachments: YARN-1572-branch-2.3.0.001.patch, YARN-1572-log.tar.gz, 
 conf.tar.gz, log.tar.gz


 we have lower chance to hit NPE in allocateNodeLocal  when run benchmark(hit 
 4 in 20 times).
 {code}
 2014-07-31 04:18:19,653 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
 Assigned container container_1406794589275_0001_01_21 of capacity 
 memory:1024, vCores:1 on host datanode10:57281, which has 6 containers, 
 memory:6144, vCores:6 used and memory:2048, vCores:2 available after 
 allocation
 2014-07-31 04:18:19,654 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:311)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:268)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.allocate(FiCaSchedulerApp.java:136)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainer(FifoScheduler.java:683)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignNodeLocalContainers(FifoScheduler.java:602)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainersOnNode(FifoScheduler.java:560)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:488)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:729)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:774)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:101)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599)
 at java.lang.Thread.run(Thread.java:662)
 2014-07-31 04:18:19,655 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS


 [ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-160:
--
Labels: BB2015-05-TBR  (was: )

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
  Labels: BB2015-05-TBR
 Attachments: apache-yarn-160.0.patch, apache-yarn-160.1.patch, 
 apache-yarn-160.2.patch, apache-yarn-160.3.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()


 [ 
https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-868:
--
Labels: BB2015-05-TBR  (was: )

 YarnClient should set the service address in tokens returned by 
 getRMDelegationToken()
 --

 Key: YARN-868
 URL: https://issues.apache.org/jira/browse/YARN-868
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Varun Saxena
  Labels: BB2015-05-TBR
 Attachments: YARN-868.patch


 Either the client should set this information into the token or the client 
 layer should expose an api that returns the service address.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2380) The normalizeRequests method in SchedulerUtils always resets the vCore to 1


 [ 
https://issues.apache.org/jira/browse/YARN-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2380:
---
Labels: BB2015-05-TBR  (was: )

 The normalizeRequests method in SchedulerUtils always resets the vCore to 1
 ---

 Key: YARN-2380
 URL: https://issues.apache.org/jira/browse/YARN-2380
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jian Fang
Assignee: Jian Fang
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-2380.patch


 I added some log info to the method normalizeRequest() as follows.
   public static void normalizeRequest(
   ResourceRequest ask, 
   ResourceCalculator resourceCalculator, 
   Resource clusterResource,
   Resource minimumResource,
   Resource maximumResource,
   Resource incrementResource) {
 LOG.info(Before request normalization, the ask capacity:  + 
 ask.getCapability());
 Resource normalized = 
 Resources.normalize(
 resourceCalculator, ask.getCapability(), minimumResource,
 maximumResource, incrementResource);
 LOG.info(After request normalization, the ask capacity:  + normalized);
 ask.setCapability(normalized);
   }
 The resulted log showed that the vcore in ask was changed from 2 to 1.
 2014-08-01 20:54:15,537 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils (IPC 
 Server handler 4 on 9024): Before request normalization, the ask capacity: 
 memory:1536, vCores:2
 2014-08-01 20:54:15,537 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils (IPC 
 Server handler 4 on 9024): After request normalization, the ask capacity: 
 memory:1536, vCores:1
 The root cause is the DefaultResourceCalculator calls 
 Resources.createResource(normalizedMemory) to regenerate a new resource with 
 vcore = 1.
 This bug is critical and it leads to the mismatch of the request resource and 
 the container resource and many other potential issues if the user requests 
 containers with vcore  1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-867) Isolation of failures in aux services


 [ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-867:
--
Labels: BB2015-05-TBR  (was: )

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1050) Document the Fair Scheduler REST API


 [ 
https://issues.apache.org/jira/browse/YARN-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1050:
---
Labels: BB2015-05-TBR  (was: )

 Document the Fair Scheduler REST API
 

 Key: YARN-1050
 URL: https://issues.apache.org/jira/browse/YARN-1050
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, fairscheduler
Reporter: Sandy Ryza
Assignee: Kenji Kikushima
  Labels: BB2015-05-TBR
 Attachments: YARN-1050-2.patch, YARN-1050-3.patch, YARN-1050.patch


 The documentation should be placed here along with the Capacity Scheduler 
 documentation: 
 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits


 [ 
https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3261:
---
Labels: BB2015-05-TBR  (was: )

 rewrite resourcemanager restart doc to remove roadmap bits 
 ---

 Key: YARN-3261
 URL: https://issues.apache.org/jira/browse/YARN-3261
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Gururaj Shetty
  Labels: BB2015-05-TBR
 Attachments: YARN-3261.01.patch


 Another mixture of roadmap and instruction manual that seems to be ever 
 present in a lot of the recently written documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3126) FairScheduler: queue's usedResource is always more than the maxResource limit

[
https://issues.apache.org/jira/browse/YARN-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Allen Wittenauer updated YARN-3126:
---
Labels: BB2015-05-TBR assignContainer fairscheduler resources (was:
assignContainer fairscheduler resources)

FairScheduler: queue's usedResource is always more than the maxResource limit
-

Key: YARN-3126
URL: https://issues.apache.org/jira/browse/YARN-3126
Project: Hadoop YARN
Issue Type: Bug
Components: fairscheduler
Affects Versions: 2.3.0
Environment: hadoop2.3.0. fair scheduler. spark 1.1.0.
Reporter: Xia Hu
Labels: BB2015-05-TBR, assignContainer, fairscheduler, resources
Fix For: trunk-win

Attachments: resourcelimit-02.patch, resourcelimit.patch

When submitting spark application(both spark-on-yarn-cluster and
spark-on-yarn-cleint model), the queue's usedResources assigned by
fairscheduler always can be more than the queue's maxResources limit.
And by reading codes of fairscheduler, I suppose this issue happened because
of ignore to check the request resources when assign Container.
Here is the detail:
1. choose a queue. In this process, it will check if queue's usedResource is
bigger than its max, with assignContainerPreCheck.
2. then choose a app in the certain queue.
3. then choose a container. And here is the question, there is no check
whether this container would make the queue sources over its max limit. If a
queue's usedResource is 13G, the maxResource limit is 16G, then a container
which asking for 4G resources may be assigned successful.
This problem will always happen in spark application, cause we can ask for
different container resources in different applications.
By the way, I have already use the patch from YARN-2083.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2345) yarn rmadmin -report


 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2345:
---
Labels: BB2015-05-TBR newbie  (was: newbie)

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: BB2015-05-TBR, newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3360) Add JMX metrics to TimelineDataManager


 [ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3360:
---
Labels: BB2015-05-TBR  (was: )

 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Attachments: YARN-3360.001.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs


 [ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1462:
---
Labels: BB2015-05-TBR  (was: )

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
  Labels: BB2015-05-TBR
 Attachments: YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, 
 YARN-1462.2.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED


 [ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3535:
---
Labels: BB2015-05-TBR  (was: )

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
  Labels: BB2015-05-TBR
 Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-20) More information for yarn.resourcemanager.webapp.address in yarn-default.xml


 [ 
https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-20:
-
Labels: BB2015-05-TBR  (was: )

 More information for yarn.resourcemanager.webapp.address in yarn-default.xml
 --

 Key: YARN-20
 URL: https://issues.apache.org/jira/browse/YARN-20
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Nemon Lou
Priority: Trivial
  Labels: BB2015-05-TBR
 Attachments: YARN-20.1.patch, YARN-20.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

   The parameter  yarn.resourcemanager.webapp.address in yarn-default.xml  is 
 in host:port format,which is noted in the cluster set up guide 
 (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html).
   When i read though the code,i find host format is also supported. In 
 host format,the port will be random.
   So we may add more documentation in  yarn-default.xml for easy understood.
   I will submit a patch if it's helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1515) Provide ContainerManagementProtocol#signalContainer processing a batch of signals


 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1515:
---
Labels: BB2015-05-TBR  (was: )

 Provide ContainerManagementProtocol#signalContainer processing a batch of 
 signals 
 --

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
  Labels: BB2015-05-TBR
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch, YARN-1515.v08.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2325:
---
Labels: BB2015-05-TBR  (was: )

 need check whether node is null in nodeUpdate for FairScheduler 
 

 Key: YARN-2325
 URL: https://issues.apache.org/jira/browse/YARN-2325
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: YARN-2325.000.patch


 need check whether node is null in nodeUpdate for FairScheduler.
 If nodeUpdate is called after removeNode, the getFSSchedulerNode will be 
 null. If the node is null, we should return with error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2151) FairScheduler option for global preemption within hierarchical queues

[
https://issues.apache.org/jira/browse/YARN-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Allen Wittenauer updated YARN-2151:
---
Labels: BB2015-05-TBR (was: )

FairScheduler option for global preemption within hierarchical queues
-

Key: YARN-2151
URL: https://issues.apache.org/jira/browse/YARN-2151
Project: Hadoop YARN
Issue Type: Improvement
Components: fairscheduler
Reporter: Andrey Stepachev
Labels: BB2015-05-TBR
Attachments: YARN-2151.patch

FairScheduler has hierarchical queues, but fair share calculation and
preemption still works withing a limited range and effectively still
nonhierarchical.
This patch solves this incompleteness in two aspects:
1. Currently MinShare is not propagated to upper queue, that leads to
fair share calculation ignores all Min Shares in deeper queues.
Lets take an example
(implemented as test case TestFairScheduler#testMinShareInHierarchicalQueues)
{code}
?xml version=1.0?
allocations
queue name=queue1
maxResources10240mb, 10vcores/maxResources
queue name=big/
queue name=sub1
schedulingPolicyfair/schedulingPolicy
queue name=sub11
minResources6192mb, 6vcores/minResources
/queue
/queue
queue name=sub2
/queue
/queue
/allocations
{code}
Then bigApp started within queue1.big with 10x1GB containers.
That effectively eats all maximum allowed resources for queue1.
Subsequent requests for app1 (queue1.sub1.sub11) and
app2 (queue1.sub2) (5x1GB each) will wait for free resources.
Take a note, that sub11 has min share requirements for 6x1GB.
Without given patch fair share will be calculated with no knowledge
about min share requirements and app1 and app2 will get equal
number of containers.
With the patch resources will split according to min share ( in test
it will be 5 for app1 and 1 for app2)
That behaviour controlled by the same parameter as ‘globalPreemtion’,
but that can be changed easily.
Implementation is a bit awkward, but seems that method for min share
recalculation can be exposed as public or protected api and constructor
in FSQueue can call it before using minShare getter. But right now
current implementation with nulls should work too.
2. Preemption doesn’t works between queues on different level for the
queues hierarchy. Moreover, it is not possible to override various
parameters for children queues.
This patch adds parameter ‘globalPreemption’, which enables global
preemption algorithm modifications.
In a nutshell patch adds function shouldAttemptPreemption(queue),
which can calculate usage for nested queues, and if queue with usage more
that specified threshold is found, preemption can be triggered.
Aggregated minShare does the rest of work and preemption will work
as expected within hierarchy of queues with different MinShare/MaxShare
specifications on different levels.
Test case TestFairScheduler#testGlobalPreemption depicts how it works.
One big app gets resources above its fair share and app1 has a declared
min share. On submission code finds that starvation and preempts enough
containers to give enough room for app1.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1782) CLI should let users to query cluster metrics


 [ 
https://issues.apache.org/jira/browse/YARN-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1782:
---
Labels: BB2015-05-TBR  (was: )

 CLI should let users to query cluster metrics
 -

 Key: YARN-1782
 URL: https://issues.apache.org/jira/browse/YARN-1782
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Zhijie Shen
Assignee: Kenji Kikushima
  Labels: BB2015-05-TBR
 Attachments: YARN-1782.patch


 Like RM webUI and RESTful services, YARN CLI should also enable users to 
 query the cluster metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT


 [ 
https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-126:
--
Labels: BB2015-05-TBR usability  (was: usability)

 yarn rmadmin help message contains reference to hadoop cli and JT
 -

 Key: YARN-126
 URL: https://issues.apache.org/jira/browse/YARN-126
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.0.3-alpha
Reporter: Thomas Graves
Assignee: Rémy SAISSY
  Labels: BB2015-05-TBR, usability
 Attachments: YARN-126.patch


 has option to specify a job tracker and the last line for general command 
 line syntax had bin/hadoop command [genericOptions] [commandOptions]
 ran yarn rmadmin to get usage:
 RMAdmin
 Usage: java RMAdmin
[-refreshQueues]
[-refreshNodes]
[-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration]
[-refreshAdminAcls]
[-refreshServiceAcl]
[-help [cmd]]
 Generic options supported are
 -conf configuration file specify an application configuration file
 -D property=valueuse value for given property
 -fs local|namenode:port  specify a namenode
 -jt local|jobtracker:portspecify a job tracker
 -files comma separated list of filesspecify comma separated files to be 
 copied to the map reduce cluster
 -libjars comma separated list of jarsspecify comma separated jar files 
 to include in the classpath.
 -archives comma separated list of archivesspecify comma separated 
 archives to be unarchived on the compute machines.
 The general command line syntax is
 bin/hadoop command [genericOptions] [commandOptions]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3395) [Fair Scheduler] Handle the user name correctly when user name is used as default queue name.


 [ 
https://issues.apache.org/jira/browse/YARN-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3395:
---
Labels: BB2015-05-TBR  (was: )

 [Fair Scheduler] Handle the user name correctly when user name is used as 
 default queue name.
 -

 Key: YARN-3395
 URL: https://issues.apache.org/jira/browse/YARN-3395
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: zhihai xu
Assignee: zhihai xu
  Labels: BB2015-05-TBR
 Attachments: YARN-3395.000.patch


 Handle the user name correctly when user name is used as default queue name 
 in fair scheduler.
 It will be better to remove the trailing and leading whitespace of the user 
 name when we use user name as default queue name, otherwise it will be 
 rejected by InvalidQueueNameException from QueueManager. I think it is 
 reasonable to make this change, because we already did special handling for 
 '.' in user name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1287) Consolidate MockClocks


 [ 
https://issues.apache.org/jira/browse/YARN-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1287:
---
Labels: BB2015-05-TBR newbie  (was: newbie)

 Consolidate MockClocks
 --

 Key: YARN-1287
 URL: https://issues.apache.org/jira/browse/YARN-1287
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Sandy Ryza
Assignee: Sebastian Wong
  Labels: BB2015-05-TBR, newbie
 Attachments: YARN-1287-3.patch


 A bunch of different tests have near-identical implementations of MockClock.  
 TestFairScheduler, TestFSSchedulerApp, and TestCgroupsLCEResourcesHandler for 
 example.  They should be consolidated into a single MockClock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-641) Make AMLauncher in RM Use NMClient


 [ 
https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-641:
--
Labels: BB2015-05-TBR  (was: )

 Make AMLauncher in RM Use NMClient
 --

 Key: YARN-641
 URL: https://issues.apache.org/jira/browse/YARN-641
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhijie Shen
Assignee: Zhijie Shen
  Labels: BB2015-05-TBR
 Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch


 YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions 
 with an application's AM container. AMLauncher should also replace the raw 
 ContainerManager proxy with NMClient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1391) Lost node list should be identify by NodeId


 [ 
https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1391:
---
Labels: BB2015-05-TBR  (was: )

 Lost node list should be identify by NodeId
 ---

 Key: YARN-1391
 URL: https://issues.apache.org/jira/browse/YARN-1391
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: Siqi Li
Assignee: Siqi Li
  Labels: BB2015-05-TBR
 Attachments: YARN-1391.v1.patch, YARN-1391.v2.patch


 in case of multiple node managers on a single machine. each of them should be 
 identified by NodeId, which is more unique than just host name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler

2015-05-05 Thread Dian Fu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529870#comment-14529870
 ] 

Dian Fu commented on YARN-3557:
---

Hi [~Naganarasimha],
{quote}constraints should be supported to be added from both RM and NM and 
partitions (existing labels) should be allowed either by RM or NM{quote}
Agree with you that constraints should be supported to be added from both RM 
and RN. TRUSTED/UNTRUSTED are also more like constraints of a node. BTW, it 
seems that constraints support is already created(YARN-3409). 


 Support Intel Trusted Execution Technology(TXT) in YARN scheduler
 -

 Key: YARN-3557
 URL: https://issues.apache.org/jira/browse/YARN-3557
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Dian Fu
 Attachments: Support TXT in YARN high level design doc.pdf


 Intel TXT defines platform-level enhancements that provide the building 
 blocks for creating trusted platforms. A TXT aware YARN scheduler can 
 schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 
 provides the capacity to restrict YARN applications to run only on cluster 
 nodes that have a specified node label. This is a good mechanism that be 
 utilized for TXT aware YARN scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied


 [ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1813:
--
Target Version/s: 2.8.0  (was: 2.6.0)

 Better error message for yarn logs when permission denied
 ---

 Key: YARN-1813
 URL: https://issues.apache.org/jira/browse/YARN-1813
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0, 2.4.1, 2.5.1
Reporter: Andrew Wang
Assignee: Tsuyoshi Ozawa
Priority: Minor
 Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, 
 YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch, YARN-1813.6.patch


 I ran some MR jobs as the hdfs user, and then forgot to sudo -u when 
 grabbing the logs. yarn logs prints an error message like the following:
 {noformat}
 [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
 a2402.halxg.cloudera.com/10.20.212.10:8032
 Logs not available at 
 /tmp/logs/andrew.wang/logs/application_1394482121761_0010
 Log aggregation has not completed or is not enabled.
 {noformat}
 It'd be nicer if it said Permission denied or AccessControlException or 
 something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1813) Better error message for yarn logs when permission denied


[ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529507#comment-14529507
 ] 

Jian He commented on YARN-1813:
---

[~ozawa], patch unfortunately doesn't apply any more. mind updating please ? 
I'll review and get this in.

 Better error message for yarn logs when permission denied
 ---

 Key: YARN-1813
 URL: https://issues.apache.org/jira/browse/YARN-1813
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0, 2.4.1, 2.5.1
Reporter: Andrew Wang
Assignee: Tsuyoshi Ozawa
Priority: Minor
 Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, 
 YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch, YARN-1813.6.patch


 I ran some MR jobs as the hdfs user, and then forgot to sudo -u when 
 grabbing the logs. yarn logs prints an error message like the following:
 {noformat}
 [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
 a2402.halxg.cloudera.com/10.20.212.10:8032
 Logs not available at 
 /tmp/logs/andrew.wang/logs/application_1394482121761_0010
 Log aggregation has not completed or is not enabled.
 {noformat}
 It'd be nicer if it said Permission denied or AccessControlException or 
 something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3581) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI

2015-05-05 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3581:
-
Summary: Deprecate -directlyAccessNodeLabelStore in RMAdminCLI  (was: Drop 
-directlyAccessNodeLabelStore in RMAdminCLI)

 Deprecate -directlyAccessNodeLabelStore in RMAdminCLI
 -

 Key: YARN-3581
 URL: https://issues.apache.org/jira/browse/YARN-3581
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan

 In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make 
 RM can start with label-configured queue settings. After YARN-2918, we don't 
 need this option any more, admin can configure queue setting, start RM and 
 configure node label via RMAdminCLI without any error.
 In addition, this option is very restrictive, first it needs to run on the 
 same node where RM is running if admin configured to store labels in local 
 disk.
 Second, when admin run the option when RM is running, multiple process write 
 to a same file can happen, this could make node label store becomes invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3127) Apphistory url crashes when RM switches with ATS enabled

2015-05-05 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529606#comment-14529606
 ] 

Naganarasimha G R commented on YARN-3127:
-

Thanks for reviewing [~gtCarrera9],  
Issue mentioned over here main cause is already addressed in another jira by 
[~xgong]  and but when we test in this way we still get to see null in the 
webui and also more importantly this jira addressing is required as events are 
published for every app (start and finished) on RM failover. So if 1 apps 
are maintained then so many additional non required events are getting 
triggered. this we need to address. And for the issue pointed by [~xgong], i 
had asked for suggestion of approach being taken and hence waiting for it, 
AFAIK we need to ensure first ATS events are sent and then store the final 
application state to RMstate store in FINAL_SAVING transition (and also other 
possible cases where app is created and will be killed b4 attempt is created in 
which case  FINAL_SAVING is not called). If this approach is fine then will 
update the patch and the description. 

 Apphistory url crashes when RM switches with ATS enabled
 

 Key: YARN-3127
 URL: https://issues.apache.org/jira/browse/YARN-3127
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: RM HA with ATS
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
 Attachments: YARN-3127.20150213-1.patch, YARN-3127.20150329-1.patch


 1.Start RM with HA and ATS configured and run some yarn applications
 2.Once applications are finished sucessfully start timeline server
 3.Now failover HA form active to standby
 4.Access timeline server URL IP:PORT/applicationhistory
 Result: Application history URL fails with below info
 {quote}
 2015-02-03 20:28:09,511 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the applications.
 java.lang.reflect.UndeclaredThrowableException
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643)
   at 
 org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:80)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   ...
 Caused by: 
 org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: The 
 entity for application attempt appattempt_1422972608379_0001_01 doesn't 
 exist in the timeline store
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplicationAttempt(ApplicationHistoryManagerOnTimelineStore.java:151)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.generateApplicationReport(ApplicationHistoryManagerOnTimelineStore.java:499)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAllApplications(ApplicationHistoryManagerOnTimelineStore.java:108)
   at 
 org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:84)
   at 
 org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:81)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   ... 51 more
 2015-02-03 20:28:09,512 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
 handling URI: /applicationhistory
 org.apache.hadoop.yarn.webapp.WebAppException: Error rendering block: 
 nestLevel=6 expected 5
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
 {quote}
 Behaviour with AHS with file based history store
   -Apphistory url is working 
   -No attempt entries are shown for each application.
   
 Based on inital analysis when RM switches ,application attempts from state 
 store  are not replayed but only applications are.
 So when /applicaitonhistory url is accessed it tries for all attempt id and 
 fails



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

2015-05-05 Thread Li Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-3134:

Attachment: YARN-3134-YARN-2928.004.patch

I've addressed all comments from [~zjshen] except for two points: I think we 
need to discuss more about possible cache settings for performance tuning. I 
also leave the work for accurately verifying the content in 
TestPhoenixTimelineWriterImpl as a future work. For now my main focus is on 
pushing the current version forward into a benchmark-ready state. 

 [Storage implementation] Exploiting the option of using Phoenix to access 
 HBase backend
 ---

 Key: YARN-3134
 URL: https://issues.apache.org/jira/browse/YARN-3134
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Li Lu
 Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, 
 YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, 
 YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, 
 YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, 
 YARN-3134-YARN-2928.003.patch, YARN-3134-YARN-2928.004.patch, 
 YARN-3134DataSchema.pdf


 Quote the introduction on Phoenix web page:
 {code}
 Apache Phoenix is a relational database layer over HBase delivered as a 
 client-embedded JDBC driver targeting low latency queries over HBase data. 
 Apache Phoenix takes your SQL query, compiles it into a series of HBase 
 scans, and orchestrates the running of those scans to produce regular JDBC 
 result sets. The table metadata is stored in an HBase table and versioned, 
 such that snapshot queries over prior versions will automatically use the 
 correct schema. Direct use of the HBase API, along with coprocessors and 
 custom filters, results in performance on the order of milliseconds for small 
 queries, or seconds for tens of millions of rows.
 {code}
 It may simply our implementation read/write data from/to HBase, and can 
 easily build index and compose complex query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2015-05-05 Thread Vinod Kumar Vavilapalli (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529603#comment-14529603
]

Vinod Kumar Vavilapalli commented on YARN-1680:
---

When I said node-labels above, I meant partitions. Clearly the problem and the
corresponding solution will likely be very similar for node-constraints (one
type of node-labels). After all, blacklisting is a type of (anti)
node-constraint. /cc [~leftnoteasy]

availableResources sent to applicationMaster in heartbeat should exclude
blacklistedNodes free memory.
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI


[ 
https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529658#comment-14529658
 ] 

Hadoop QA commented on YARN-3301:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 42s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 44s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m  9s | The applied patch generated  2 
new checkstyle issues (total was 58, now 43). |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m  8s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| {color:red}-1{color} | yarn tests |  54m 42s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  93m  9s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730644/YARN-3301.4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 4da8490 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7716/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7716/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7716/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7716/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7716/console |


This message was automatically generated.

 Fix the format issue of the new RM web UI and AHS web UI
 

 Key: YARN-3301
 URL: https://issues.apache.org/jira/browse/YARN-3301
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 
 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, 
 YARN-3301.3.patch, YARN-3301.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3513) Remove unused variables in ContainersMonitorImpl


[ 
https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529759#comment-14529759
 ] 

Hadoop QA commented on YARN-3513:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m  0s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 38s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 52s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 38s | The applied patch generated  1 
new checkstyle issues (total was 27, now 27). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  3s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   6m  2s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  42m 46s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730656/YARN-3513.20150506-1.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 90b3845 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7718/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7718/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7718/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7718/console |


This message was automatically generated.

 Remove unused variables in ContainersMonitorImpl
 

 Key: YARN-3513
 URL: https://issues.apache.org/jira/browse/YARN-3513
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Trivial
  Labels: newbie
 Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, 
 YARN-3513.20150506-1.patch


 class members :  {{private final Context context;}}
 and some local variables in MonitoringThread.run()  : {{vmemStillInUsage and 
 pmemStillInUsage}} are not used and just updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3169) drop the useless yarn overview document


 [ 
https://issues.apache.org/jira/browse/YARN-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3169:
---
Labels: BB2015-05-TBR  (was: )

 drop the useless yarn overview document
 ---

 Key: YARN-3169
 URL: https://issues.apache.org/jira/browse/YARN-3169
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3169-002.patch, YARN-3169.patch


 It's pretty superfluous given there is a site index on the left.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2306) leak of reservation metrics (fair scheduler)


 [ 
https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2306:
---
Labels: BB2015-05-TBR  (was: )

 leak of reservation metrics (fair scheduler)
 

 Key: YARN-2306
 URL: https://issues.apache.org/jira/browse/YARN-2306
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: YARN-2306-2.patch, YARN-2306.patch


 This only applies to fair scheduler. Capacity scheduler is OK.
 When appAttempt or node is removed, the metrics for 
 reservation(reservedContainers, reservedMB, reservedVCores) is not reduced 
 back.
 These are important metrics for administrator. The wrong metrics confuses may 
 confuse them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1126) Add validation of users input nodes-states options to nodes CLI


 [ 
https://issues.apache.org/jira/browse/YARN-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1126:
---
Labels: BB2015-05-TBR  (was: )

 Add validation of users input nodes-states options to nodes CLI
 ---

 Key: YARN-1126
 URL: https://issues.apache.org/jira/browse/YARN-1126
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
  Labels: BB2015-05-TBR
 Attachments: YARN-905-addendum.patch


 Follow the discussion in YARN-905.
 (1) case-insensitive checks for all.
 (2) validation of users input, exit with non-zero code and print all valid 
 states when user gives an invalid state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users


 [ 
https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2423:
---
Labels: BB2015-05-TBR  (was: )

 TimelineClient should wrap all GET APIs to facilitate Java users
 

 Key: YARN-2423
 URL: https://issues.apache.org/jira/browse/YARN-2423
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Robert Kanter
  Labels: BB2015-05-TBR
 Attachments: YARN-2423.004.patch, YARN-2423.005.patch, 
 YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, 
 YARN-2423.patch


 TimelineClient provides the Java method to put timeline entities. It's also 
 good to wrap over all GET APIs (both entity and domain), and deserialize the 
 json response into Java POJO objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-535) TestUnmanagedAMLauncher can corrupt target/test-classes/yarn-site.xml during write phase, breaks later test runs


 [ 
https://issues.apache.org/jira/browse/YARN-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-535:
--
Labels: BB2015-05-TBR  (was: )

 TestUnmanagedAMLauncher can corrupt target/test-classes/yarn-site.xml during 
 write phase, breaks later test runs
 

 Key: YARN-535
 URL: https://issues.apache.org/jira/browse/YARN-535
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 2.6.0
 Environment: OS/X laptop, HFS+ filesystem
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: YARN-535-02.patch, YARN-535.patch


 the setup phase of {{TestUnmanagedAMLauncher}} overwrites {{yarn-site.xml}}. 
 As {{Configuration.writeXml()}} does a reread of all resources, this will 
 break if the (open-for-writing) resource is already visible as an empty file. 
 This leaves a corrupted {{target/test-classes/yarn-site.xml}}, which breaks 
 later test runs -because it is not overwritten by later incremental builds, 
 due to timestamps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]


 [ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2003:
---
Labels: BB2015-05-TBR  (was: )

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 
 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 
 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 
 0008-YARN-2003.patch, 0009-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case


 [ 
https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3405:
---
Labels: BB2015-05-TBR  (was: )

 FairScheduler's preemption cannot happen between sibling in some case
 -

 Key: YARN-3405
 URL: https://issues.apache.org/jira/browse/YARN-3405
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-3405.01.patch, YARN-3405.02.patch


 Queue hierarchy described as below:
 {noformat}
   root
/ \
queue-1  queue-2   
   /  \
 queue-1-1 queue-1-2
 {noformat}
 Assume cluster resource is 100
 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. 
 # When queue-1-2 is active, and it cause some new preemption request for 
 fairshare 25.
 # When preemption from root, it has possibility to find preemption candidate 
 is queue-2. If so preemptContainerPreCheck for queue-2 return false because 
 it's equal to its fairshare.
 # Finally queue-1-2 will be waiting for resource release form queue-1-1 
 itself.
 What I expect here is that queue-1-2 preempt from queue-1-1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2444) Primary filters added after first submission not indexed, cause exceptions in logs.


 [ 
https://issues.apache.org/jira/browse/YARN-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2444:
---
Labels: BB2015-05-TBR  (was: )

 Primary filters added after first submission not indexed, cause exceptions in 
 logs.
 ---

 Key: YARN-2444
 URL: https://issues.apache.org/jira/browse/YARN-2444
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.5.0
Reporter: Marcelo Vanzin
Assignee: Steve Loughran
  Labels: BB2015-05-TBR
 Attachments: YARN-2444-001.patch, ats.java, 
 org.apache.hadoop.yarn.server.timeline.TestTimelineClientPut-output.txt


 See attached code for an example. The code creates an entity with a primary 
 filter, submits it to the ATS. After that, a new primary filter value is 
 added and the entity is resubmitted. At that point two things can be seen:
 - Searching for the new primary filter value does not return the entity
 - The following exception shows up in the logs:
 {noformat}
 14/08/22 11:33:42 ERROR webapp.TimelineWebServices: Error when verifying 
 access for user dr.who (auth:SIMPLE) on the events of the timeline entity { 
 id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test }
 org.apache.hadoop.yarn.exceptions.YarnException: Owner information of the 
 timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test 
 } is corrupted.
 at 
 org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:67)
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:172)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running


 [ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2268:
---
Labels: BB2015-05-TBR  (was: )

 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-2268.patch


 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3541) Add version info on timeline service / generic history web UI and RES API


 [ 
https://issues.apache.org/jira/browse/YARN-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3541:
---
Labels: BB2015-05-TBR  (was: )

 Add version info on timeline service / generic history web UI and RES API
 -

 Key: YARN-3541
 URL: https://issues.apache.org/jira/browse/YARN-3541
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
  Labels: BB2015-05-TBR
 Attachments: YARN-3541.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2556) Tool to measure the performance of the timeline server


 [ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2556:
---
Labels: BB2015-05-TBR  (was: )

 Tool to measure the performance of the timeline server
 --

 Key: YARN-2556
 URL: https://issues.apache.org/jira/browse/YARN-2556
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Chang Li
  Labels: BB2015-05-TBR
 Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
 YARN-2556.1.patch, YARN-2556.2.patch, YARN-2556.patch, yarn2556.patch, 
 yarn2556.patch, yarn2556_wip.patch


 We need to be able to understand the capacity model for the timeline server 
 to give users the tools they need to deploy a timeline server with the 
 correct capacity.
 I propose we create a mapreduce job that can measure timeline server write 
 and read performance. Transactions per second, I/O for both read and write 
 would be a good start.
 This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3170) YARN architecture document needs updating


 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3170:
---
Labels: BB2015-05-TBR  (was: )

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3458) CPU resource monitoring in Windows


 [ 
https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3458:
---
Labels: BB2015-05-TBR containers metrics windows  (was: containers metrics 
windows)

 CPU resource monitoring in Windows
 --

 Key: YARN-3458
 URL: https://issues.apache.org/jira/browse/YARN-3458
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.7.0
 Environment: Windows
Reporter: Inigo Goiri
Assignee: Inigo Goiri
Priority: Minor
  Labels: BB2015-05-TBR, containers, metrics, windows
 Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, 
 YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 The current implementation of getCpuUsagePercent() for 
 WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to 
 do it. I reused the CpuTimeTracker using 1 jiffy=1ms.
 This was left open by YARN-3122.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer


 [ 
https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-644:
--
Labels: BB2015-05-TBR  (was: )

 Basic null check is not performed on passed in arguments before using them in 
 ContainerManagerImpl.startContainer
 -

 Key: YARN-644
 URL: https://issues.apache.org/jira/browse/YARN-644
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Omkar Vinit Joshi
Assignee: Varun Saxena
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: YARN-644.001.patch, YARN-644.002.patch


 I see that validation/ null check is not performed on passed in parameters. 
 Ex. tokenId.getContainerID().getApplicationAttemptId() inside 
 ContainerManagerImpl.authorizeRequest()
 I guess we should add these checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-41) The RM should handle the graceful shutdown of the NM.


 [ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-41:
-
Labels: BB2015-05-TBR  (was: )

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2571) RM to support YARN registry


 [ 
https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2571:
---
Labels: BB2015-05-TBR  (was: )

 RM to support YARN registry 
 

 Key: YARN-2571
 URL: https://issues.apache.org/jira/browse/YARN-2571
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
  Labels: BB2015-05-TBR
 Attachments: YARN-2571-001.patch, YARN-2571-002.patch, 
 YARN-2571-003.patch, YARN-2571-005.patch, YARN-2571-007.patch, 
 YARN-2571-008.patch, YARN-2571-009.patch, YARN-2571-010.patch


 The RM needs to (optionally) integrate with the YARN registry:
 # startup: create the /services and /users paths with system ACLs (yarn, hdfs 
 principals)
 # app-launch: create the user directory /users/$username with the relevant 
 permissions (CRD) for them to create subnodes.
 # attempt, container, app completion: remove service records with the 
 matching persistence and ID



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.


[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529842#comment-14529842
 ] 

Hadoop QA commented on YARN-3385:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 56s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 37s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 46s | The applied patch generated  1 
new checkstyle issues (total was 42, now 43). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 15s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |  62m 58s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  99m 52s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730672/YARN-3385.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 90b3845 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7721/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7721/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7721/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7721/console |


This message was automatically generated.

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-3385.000.patch, YARN-3385.001.patch, 
 YARN-3385.002.patch, YARN-3385.003.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at

[jira] [Updated] (YARN-1426) YARN Components need to unregister their beans upon shutdown


 [ 
https://issues.apache.org/jira/browse/YARN-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1426:
---
Labels: BB2015-05-TBR  (was: )

 YARN Components need to unregister their beans upon shutdown
 

 Key: YARN-1426
 URL: https://issues.apache.org/jira/browse/YARN-1426
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.3.0
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
  Labels: BB2015-05-TBR
 Attachments: YARN-1426.patch, YARN-1426.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state

2015-05-05 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529488#comment-14529488
 ] 

Craig Welch commented on YARN-2421:
---

Hi [~lichangleo], thanks for working on this fix.  Can you resolve the javac 
warning and run the TestRMContainerImpl test locally with the patch to verify 
the patch is not the cause?  It seems to be persistently failing.

 CapacityScheduler still allocates containers to an app in the FINISHING state
 -

 Key: YARN-2421
 URL: https://issues.apache.org/jira/browse/YARN-2421
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.1
Reporter: Thomas Graves
Assignee: Chang Li
 Attachments: yarn2421.patch, yarn2421.patch, yarn2421.patch


 I saw an instance of a bad application master where it unregistered with the 
 RM but then continued to call into allocate.  The RMAppAttempt went to the 
 FINISHING state, but the capacity scheduler kept allocating it containers.   
 We should probably have the capacity scheduler check that the application 
 isn't in one of the terminal states before giving it containers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3581) Drop -directlyAccessNodeLabelStore in RMAdminCLI


[ 
https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529552#comment-14529552
 ] 

Allen Wittenauer commented on YARN-3581:


Removing a command line option is an incompatible change.

 Drop -directlyAccessNodeLabelStore in RMAdminCLI
 

 Key: YARN-3581
 URL: https://issues.apache.org/jira/browse/YARN-3581
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan

 In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make 
 RM can start with label-configured queue settings. After YARN-2918, we don't 
 need this option any more, admin can configure queue setting, start RM and 
 configure node label via RMAdminCLI without any error.
 In addition, this option is very restrictive, first it needs to run on the 
 same node where RM is running if admin configured to store labels in local 
 disk.
 Second, when admin run the option when RM is running, multiple process write 
 to a same file can happen, this could make node label store becomes invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3381) A typographical error in InvalidStateTransitonException

2015-05-05 Thread Sidharta Seethana (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529621#comment-14529621
 ] 

Sidharta Seethana commented on YARN-3381:
-

[~brahmareddy] , the latest version of the patch appears to generated 
differently than the earlier versions (--no-prefix missing?). Could you please 
fix? This would make it easier to compare patch versions. 

 A typographical error in InvalidStateTransitonException
 -

 Key: YARN-3381
 URL: https://issues.apache.org/jira/browse/YARN-3381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Xiaoshuang LU
Assignee: Brahma Reddy Battula
 Attachments: YARN-3381-002.patch, YARN-3381-003.patch, YARN-3381.patch


 Appears that InvalidStateTransitonException should be 
 InvalidStateTransitionException.  Transition was misspelled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.


[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529622#comment-14529622
 ] 

Hadoop QA commented on YARN-3385:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 38s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 51s | The applied patch generated  1 
new checkstyle issues (total was 42, now 43). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 16s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |  52m 28s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m  1s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730639/YARN-3385.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 9809a16 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7715/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7715/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7715/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7715/console |


This message was automatically generated.

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch, YARN-3385.001.patch, 
 YARN-3385.002.patch, YARN-3385.003.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at

[jira] [Created] (YARN-3582) NPE in WebAppProxyServlet