date:20140721


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068248#comment-14068248
 ] 

Wangda Tan commented on YARN-796:
-

Allen,
I think what we was just talking about is how to support hard partition use 
case in YARN, aren't we? I'm surprised to get a -1 here, Nobody has ever said 
dynamic labeling from NM will not be supported.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-21 Thread Varun Vasudev (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Varun Vasudev updated YARN-2247:

Attachment: apache-yarn-2247.3.patch

{quote}
bq.The current implementation uses the standard http authentication for
hadoop. Users can set it to simple if they choose.

I was trying to make the point that when kerberos authentication is not used,
simple authentication is not implicitly set, isn't it? In this case, without
the authentication filter, we cannot identify the user via HTTP interface, such
that we cannot behave correctly for those operations that require the knowledge
of user information, such as submit/kill an application.

One step back, and let's look at the analog RPC interfaces. By default, the
authentication is SIMPLE, and at the server side, we can still identify who the
user is, such that the feature such as ACLs are is still working in the SIMPLE
case.

{quote}

Got it. I've added support for simple auth in the default case. I also spoke
with [~vinodkv] offline and we felt that in secure mode the default static user
should not be allowed to submit jobs. I made that change as well.

{quote}
bq.For now I'd like to use the same configs as the standard hadoop http
auth. I'm open to changing them if we feel strongly about it in the future.

It's okay to keep the configuration same. Just think it out loudly. If so, you
may not want to add RM_WEBAPP_USE_YARN_AUTH_FILTER at all, and not load
YarnAuthenticationFilterInitializer programatically. The rationales behind them
are similar. Previously, I tried to add TimelineAuthenticationFilterInitializer
programmatically because I find the same http auth config applies to different
daemons, and I think it's annoying that at a single node cluster, I want to
config something only for timeline server, it will affect others. Afterwards, I
tried to make timeline server to use a set of configs with timeline-service
prefix. This is what we did for the RPC interface configurations.
{quote}

I see your point but I don't think forcing users to replicate existing configs
makes sense at this point. The RM web interfaces are already controlled by the
common http auth configs and I'd like to preserve that behaviour.

{quote}
bq.I didn't understand - can you explain further?

Let's take RMWebServices#getApp as an example. Previously we don't have (at
least don't know) the auth filter, such that we cannot detect the user.
Therefore, we don't check the ACLs, and simply get the application from
RMContext and return the user. Now, we have the auth filter, and we can
identify the user. Hence, it's possible for use to fix this API to only return
the application information to the user that has the access. This is also
another reason why I suggest to always have authentication filter on, whether
it is simple or kerberos.
{quote}

Agree with your tickets.

{quote}
bq.Am I looking at the wrong file?

This is the right file, but I'm afraid it is not the correct logic.
AuthenticationFilter accept null secret file. However, if we use
AuthenticationFilterInitializer to construct AuthenticationFilter, the null
case is denied. I previously open a ticket for this issue (HADOOP-10600).
{quote}

Thanks for pointing that out. Fixed.

Allow RM web services users to authenticate using delegation tokens
---

Key: YARN-2247
URL: https://issues.apache.org/jira/browse/YARN-2247
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch,
apache-yarn-2247.2.patch, apache-yarn-2247.3.patch

The RM webapp should allow users to authenticate using delegation tokens to
maintain parity with RPC.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068258#comment-14068258
 ] 

Tsuyoshi OZAWA commented on YARN-2325:
--

Hi [~zxu], could you explain the case when nodeUpdate is called after 
removeNode? IIUC, RMNodeImpl state machine assures that nodeUpdate is not 
called after removeNode without adding the node. In concrete, 
NodeRemovedSchedulerEvent arises only in 
ReconnectNodeTransition/DeactivateNodeTransition. 

* ReconnectNodeTransition assures that new node is added after the transition 
with same node id. 
* DeactivateNodeTransition assures that the nodes will be DECOMMISSIONED, and 
the node won't be updated.

If the transition you mentioned occurs, it might be bug of state-machine 
transition. Please correct me if I'm wrong.

 need check whether node is null in nodeUpdate for FairScheduler 
 

 Key: YARN-2325
 URL: https://issues.apache.org/jira/browse/YARN-2325
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2325.000.patch


 need check whether node is null in nodeUpdate for FairScheduler.
 If nodeUpdate is called after removeNode, the getFSSchedulerNode will be 
 null. If the node is null, we should return with error message.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens


[ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068289#comment-14068289
 ] 

Hadoop QA commented on YARN-2247:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12656829/apache-yarn-2247.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.util.TestFSDownload
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4380//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4380//console

This message is automatically generated.

 Allow RM web services users to authenticate using delegation tokens
 ---

 Key: YARN-2247
 URL: https://issues.apache.org/jira/browse/YARN-2247
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
 Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, 
 apache-yarn-2247.2.patch, apache-yarn-2247.3.patch


 The RM webapp should allow users to authenticate using delegation tokens to 
 maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler

2014-07-21 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068319#comment-14068319
 ] 

zhihai xu commented on YARN-2325:
-

Hi Tsuyoshi OZAWA, thanks for your quick response to my patch.
I agree to your points above. If this transition occurs, it might be bug in the 
code. My patch is just to make sure we return early to avoid a 
NullPointerException for some unexpected code error which cause the node being 
removed.
I also find the current removeNode function did the same thing: check null 
pointer and return early.
if (node == null) {
  return;
}
If you think my patch is not needed for NPE prevention. I am ok to close this 
JIRA.

 need check whether node is null in nodeUpdate for FairScheduler 
 

 Key: YARN-2325
 URL: https://issues.apache.org/jira/browse/YARN-2325
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2325.000.patch


 need check whether node is null in nodeUpdate for FairScheduler.
 If nodeUpdate is called after removeNode, the getFSSchedulerNode will be 
 null. If the node is null, we should return with error message.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler

2014-07-21 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2325:


Priority: Minor  (was: Major)

 need check whether node is null in nodeUpdate for FairScheduler 
 

 Key: YARN-2325
 URL: https://issues.apache.org/jira/browse/YARN-2325
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2325.000.patch


 need check whether node is null in nodeUpdate for FairScheduler.
 If nodeUpdate is called after removeNode, the getFSSchedulerNode will be 
 null. If the node is null, we should return with error message.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2301) Improve yarn container command

2014-07-21 Thread Naganarasimha G R (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068332#comment-14068332
]

Naganarasimha G R commented on YARN-2301:
-

Hi [~jianhe] and [~zjshen]
I have fixed the first 3 but while working for the 4th one
with [~zjshen] comments
bq. 4) May have an option to run as yarn container -list appId
i had to pick few alternatives so wanted opinion from you guys before i go
forward
# Issue1 : As i understand this option should be in addition to ??yarn
container -list Application Attempt ID??, As its CLI we can't have
polymorphic parameters,
so i thought of providing command like ??yarn container
-list-forAppId Application ID??. Is this option fine ? And please note
GetContainersRequestProto api will be changed
# Issue2 : As [~zjshen] pointed out we need to take care for containers from
Timeline server flow also. For this i would suggest few options
#* Option1: For the given AppID When application is still running (RM flow),
Only show containers for the current running application attempt irrespective
of the earlier failure attempts.
When application is finished(Timeline server flow), Only show containers for
the last attempt of the application (whether its failed or succeded)
#* Option2: Introduce new additional parameter like ??-all??,??-last??.
And let ??-last?? behave similar to option 1 and the default option and
for ??-all?? in both flows (RM and Timeline server flow) we display containers
for each attempt.

Improve yarn container command
--

Key: YARN-2301
URL: https://issues.apache.org/jira/browse/YARN-2301
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Jian He
Assignee: Naganarasimha G R
Labels: usability

While running yarn container -list Application Attempt ID command, some
observations:
1) the scheme (e.g. http/https ) before LOG-URL is missing
2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to
print as time format.
3) finish-time is 0 if container is not yet finished. May be N/A
4) May have an option to run as yarn container -list appId OR yarn
application -list-containers appId also.
As attempt Id is not shown on console, this is easier for user to just copy
the appId and run it, may also be useful for container-preserving AM
restart.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store


 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--

Attachment: YARN-2033_ALL.3.patch

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, 
 YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
 YARN-2033_ALL.3.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store


 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--

Attachment: YARN-2033.3.patch

Rebase against the latest trunk

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, 
 YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
 YARN-2033_ALL.3.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart

2014-07-21 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-2262:
---

Assignee: Naganarasimha G R

 Few fields displaying wrong values in Timeline server after RM restart
 --

 Key: YARN-2262
 URL: https://issues.apache.org/jira/browse/YARN-2262
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.4.0
Reporter: Nishan Shetty
Assignee: Naganarasimha G R

 Few fields displaying wrong values in Timeline server after RM restart
 State:null
 FinalStatus:  UNDEFINED
 Started:  8-Jul-2014 14:58:08
 Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes

2014-07-21 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068363#comment-14068363
 ] 

Junping Du commented on YARN-2013:
--

Agree that the test failure is not related as it randomly show up in many other 
tests.
+1. Will commit it shortly.

 The diagnostics is always the ExitCodeException stack when the container 
 crashes
 

 Key: YARN-2013
 URL: https://issues.apache.org/jira/browse/YARN-2013
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
 YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch


 When a container crashes, ExitCodeException will be thrown from Shell. 
 Default/LinuxContainerExecutor captures the exception, put the exception 
 stack into the diagnostic. Therefore, the exception stack is always the same. 
 {code}
 String diagnostics = Exception from container-launch: \n
 + StringUtils.stringifyException(e) + \n + shExec.getOutput();
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
 {code}
 In addition, it seems that the exception always has a empty message as 
 there's no message from stderr. Hence the diagnostics is not of much use for 
 users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2323) FairShareComparator creates too many Resource objects


[ 
https://issues.apache.org/jira/browse/YARN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068389#comment-14068389
 ] 

Hudson commented on YARN-2323:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #619 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/619/])
YARN-2323. FairShareComparator creates too many Resource objects (Hong Zhiguo 
via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612187)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java


 FairShareComparator creates too many Resource objects
 -

 Key: YARN-2323
 URL: https://issues.apache.org/jira/browse/YARN-2323
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2323-2.patch, YARN-2323.patch


 Each call of {{FairShareComparator}} creates a new Resource object one:
 {code}
 Resource one = Resources.createResource(1);
 {code}
 At the volume of 1000 nodes and 1000 apps, the comparator will be called more 
 than 10 million times per second, thus creating more than 10 million object 
 one, which is unnecessary.
 Since the object one is read-only and is never referenced outside of 
 comparator, we could make it static.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068397#comment-14068397
]

Hadoop QA commented on YARN-2033:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12656841/YARN-2033_ALL.3.patch
against trunk revision .

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 20 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
org.apache.hadoop.yarn.util.TestFSDownload

org.apache.hadoop.yarn.server.timeline.webapp.TestTimelineWebServices

org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/4381//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4381//console

This message is automatically generated.

Investigate merging generic-history into the Timeline Store
---

Key: YARN-2033
URL: https://issues.apache.org/jira/browse/YARN-2033
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf,
YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch,
YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch,
YARN-2033_ALL.3.patch

Having two different stores isn't amicable to generic insights on what's
happening with applications. This is to investigate porting generic-history
into the Timeline Store.
One goal is to try and retain most of the client side interfaces as close to
what we have today.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)


[ 
https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068411#comment-14068411
 ] 

Zhijie Shen commented on YARN-1954:
---

[~ozawa], I'm generally fine with the patch. Just some minor comments.

1. Throw IllegalArgumentException instead?
{code}
+if (checkEveryMillis = 0) {
+  checkEveryMillis = 1;
+}
{code}

2. What if checkEveryMillis  6? Maybe we can simply hard code the fix 
number of rounds to output a warning log. And don't output a warning long in 
each round, and a info log at regular intervals. How do you think?
{code}
+final int loggingCounterInitValue = 6 / checkEveryMillis;
+int loggingCounter = loggingCounterInitValue;
{code}

3. There's unnecessary changes in AMRMClientAsyncImpl

4. In TestAMRMClient #testWaitFor, can you justify countDownChecker#counter == 
3 after waitFor.

5. Not necessary to be in a synchronized block if you don't use wait here.
{code}
+synchronized (callbackHandler.notifier) {
+  asyncClient.registerApplicationMaster(localhost, 1234, null);
+  asyncClient.waitFor(checker);
+}
{code}

 Add waitFor to AMRMClient(Async)
 

 Key: YARN-1954
 URL: https://issues.apache.org/jira/browse/YARN-1954
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: client
Affects Versions: 3.0.0, 2.4.0
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, 
 YARN-1954.4.patch, YARN-1954.4.patch


 Recently, I saw some use cases of AMRMClient(Async). The painful thing is 
 that the main non-daemon thread has to sit in a dummy loop to prevent AM 
 process exiting before all the tasks are done, while unregistration is 
 triggered on a separate another daemon thread by callback methods (in 
 particular when using AMRMClientAsync). IMHO, it should be beneficial to add 
 a waitFor method to AMRMClient(Async) to block the AM until unregistration or 
 user supplied check point, such that users don't need to write the loop 
 themselves.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently


[ 
https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068431#comment-14068431
 ] 

Zhijie Shen commented on YARN-2304:
---

[~ozawa] and [~kj-ki], the test failures seem to go beyond TestRMWebServices* 
set:

https://builds.apache.org/job/PreCommit-YARN-Build/4381//testReport/

We didn't change Jersey dependency recently, but the test cases didn't fail 
before. Is it likely that some other test which uses JerseyTest has hung, such 
that the following tests cannot bind the same port?

 TestRMWebServices* fails intermittently
 ---

 Key: YARN-2304
 URL: https://issues.apache.org/jira/browse/YARN-2304
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA
 Attachments: test-failure-log-RMWeb.txt


 The test fails intermittently because of bind exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue


 [ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2313:
-

Attachment: YARN-2313.4.patch

Thanks for the review, [~sandyr].

Updated a patch:
* Moved new configuration definition to FairSchedulerConfiguration.
* Excluded warning by updateInterval.
* Replaced warning message use with using
* Removed trailing spaces.
 

 Livelock can occur on FairScheduler when there are lots entry in queue
 --

 Key: YARN-2313
 URL: https://issues.apache.org/jira/browse/YARN-2313
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.1
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
 YARN-2313.4.patch, rm-stack-trace.txt


 Observed livelock on FairScheduler when there are lots entry in queue. After 
 my investigating code, following case can occur:
 1. {{update()}} called by UpdateThread takes longer times than 
 UPDATE_INTERVAL(500ms) if there are lots queue.
 2. UpdateThread goes busy loop.
 3. Other threads(AllocationFileReloader, 
 ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java


[ 
https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068444#comment-14068444
 ] 

Zhijie Shen commented on YARN-2319:
---

[~gujilangzi], it seems that setupKDC is not need to be invoked with every 
parameterized round, isn't it? It can be done before class.

 Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
 ---

 Key: YARN-2319
 URL: https://issues.apache.org/jira/browse/YARN-2319
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
 Attachments: YARN-2319.0.patch


 MiniKdc only invoke start method not stop in 
 TestRMWebServicesDelegationTokens.java
 {code}
 testMiniKDC.start();
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart

2014-07-21 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068449#comment-14068449
 ] 

Devaraj K commented on YARN-1342:
-

Thanks for updating the patch. Good work [~jlowe]. Here are some comments for 
the patch. 

1. NodeManager.java

* NMContainerTokenSecretManager gets the nmStore through constructor and 
recover() also gets state as argument which is from nmStore. I think we can get 
the state from nmStore inside recover() instead of getting as an argument.

{code:xml}+  
containerTokenSecretManager.recover(nmStore.loadContainerTokenState());{code}
{code:xml}+new NMContainerTokenSecretManager(conf, nmStore);{code}

This same applies for NMTokenSecretManagerInNM also.

2. NMLeveldbStateStoreService.java

* Here e.getMessage() may not be required to pass as message since we are 
wrapping the same exception. If we have some custom message we could pass it 
otherwise we can simply use like new IOException(e).
{code:xml}throw new IOException(e.getMessage(), e);{code}
* Can we move the CONTAINER_TOKENS_KEY_PREFIX.length() to outside of the while 
loop?
{code:xml}
String key = fullKey.substring(CONTAINER_TOKENS_KEY_PREFIX.length());
{code}
* Can we make the string *container_* as a constant?
{code:xml}
} else if (key.startsWith(container_)) {
{code}

3. What do you think of having the names like RecoveredContainerTokensState, 
loadContainerTokensState for RecoveredContainerTokenState, 
loadContainerTokesState since these handle more than one ContainerToken?

 Recover container tokens upon nodemanager restart
 -

 Key: YARN-1342
 URL: https://issues.apache.org/jira/browse/YARN-1342
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1342.patch, YARN-1342v2.patch, 
 YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently


[ 
https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068457#comment-14068457
 ] 

Tsuyoshi OZAWA commented on YARN-2304:
--

Hi Zhijie, thank you for the comment.

{quote}
 Is it likely that some other test which uses JerseyTest has hung, such that 
the following tests cannot bind the same port?
{quote}

Yes, I thought other test which uses JerseyTest has hung and fails to cleanup. 
I checked recents commits to confirm it since the end of June, but cannot find 
the reason for now. I wrote my thought on YARN-2304: 

{quote}
I think there are two kind of possibility - 1. new added tests recently on 
Web-Services didn't do proper cleanup as you mentioned 2. infra change for 
Jenkins CI influences and the problem appear.
As I mentioned on YARN-2304, we don't support parallel test now because 
JerseyTest 1.9 doesn't support it. If the test run in parallel on single 
machine, it fails.
{quote}

I don't know how Jenkins infra changed, so it's one of possibility and just my 
anticipation.

 TestRMWebServices* fails intermittently
 ---

 Key: YARN-2304
 URL: https://issues.apache.org/jira/browse/YARN-2304
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA
 Attachments: test-failure-log-RMWeb.txt


 The test fails intermittently because of bind exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2326) WritingYarnApplications.html includes deprecated information

Tsuyoshi OZAWA created YARN-2326:


 Summary: WritingYarnApplications.html includes deprecated 
information
 Key: YARN-2326
 URL: https://issues.apache.org/jira/browse/YARN-2326
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Tsuyoshi OZAWA


For example, YarnConfiguration.YARN_SECURITY_INFO is removed on MAPREDUCE-3013, 
but it still mention it

{code}
appsManagerServerConf.setClass(
YarnConfiguration.YARN_SECURITY_INFO,
ClientRMSecurityInfo.class, SecurityInfo.class);
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068474#comment-14068474
 ] 

Karthik Kambatla commented on YARN-2273:


Yep. Thanks for pointing it out, Tsuyoshi. That makes sense. 

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue


[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068478#comment-14068478
 ] 

Hadoop QA commented on YARN-2313:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656855/YARN-2313.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4382//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4382//console

This message is automatically generated.

 Livelock can occur on FairScheduler when there are lots entry in queue
 --

 Key: YARN-2313
 URL: https://issues.apache.org/jira/browse/YARN-2313
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.1
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
 YARN-2313.4.patch, rm-stack-trace.txt


 Observed livelock on FairScheduler when there are lots entry in queue. After 
 my investigating code, following case can occur:
 1. {{update()}} called by UpdateThread takes longer times than 
 UPDATE_INTERVAL(500ms) if there are lots queue.
 2. UpdateThread goes busy loop.
 3. Other threads(AllocationFileReloader, 
 ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068500#comment-14068500
 ] 

Karthik Kambatla commented on YARN-2273:


+1. Checking this in. 

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2323) FairShareComparator creates too many Resource objects


[ 
https://issues.apache.org/jira/browse/YARN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068506#comment-14068506
 ] 

Hudson commented on YARN-2323:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1838 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1838/])
YARN-2323. FairShareComparator creates too many Resource objects (Hong Zhiguo 
via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612187)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java


 FairShareComparator creates too many Resource objects
 -

 Key: YARN-2323
 URL: https://issues.apache.org/jira/browse/YARN-2323
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2323-2.patch, YARN-2323.patch


 Each call of {{FairShareComparator}} creates a new Resource object one:
 {code}
 Resource one = Resources.createResource(1);
 {code}
 At the volume of 1000 nodes and 1000 apps, the comparator will be called more 
 than 10 million times per second, thus creating more than 10 million object 
 one, which is unnecessary.
 Since the object one is read-only and is never referenced outside of 
 comparator, we could make it static.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068508#comment-14068508
 ] 

Karthik Kambatla commented on YARN-2273:


Actually, let me retract that +1 temporarily. Can we add a test case here? 

We can move while(true) into the run method and rename continuousScheduling to 
continuousSchedulingAttempt. The test from replayException can be used for the 
test, if we move the fail() to catch-block. 


 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2323) FairShareComparator creates too many Resource objects


[ 
https://issues.apache.org/jira/browse/YARN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068537#comment-14068537
 ] 

Hudson commented on YARN-2323:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1811 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1811/])
YARN-2323. FairShareComparator creates too many Resource objects (Hong Zhiguo 
via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612187)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java


 FairShareComparator creates too many Resource objects
 -

 Key: YARN-2323
 URL: https://issues.apache.org/jira/browse/YARN-2323
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2323-2.patch, YARN-2323.patch


 Each call of {{FairShareComparator}} creates a new Resource object one:
 {code}
 Resource one = Resources.createResource(1);
 {code}
 At the volume of 1000 nodes and 1000 apps, the comparator will be called more 
 than 10 million times per second, thus creating more than 10 million object 
 one, which is unnecessary.
 Since the object one is read-only and is never referenced outside of 
 comparator, we could make it static.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-07-21 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068545#comment-14068545
]

Jason Lowe commented on YARN-1198:
--

We need to worry about headroom as a report of resources that can be allocated
if requested. If the AM cannot currently allocate any containers then the
headroom should be reported as zero. I think guaranteed headroom is a
separate JIRA and not necessary to solve the deadlock issues surrounding the
current headroom reporting.

bq. Because in a dynamic cluster, the number can change rapidly, it is possible
that a cluster is fulfilled by another application just happens one second
after the AM got the available headroom.

Sure, this can happen. However on the next heartbeat the headroom will be
reported as less than it was before, and the AM can take appropriate action. I
don't see this as a major issue at least in the short-term. Telling an AM
repeatedly that it can allocate resources that will never be allocated in
practice is definitely wrong and needs to be fixed.

bq. And also, this field can not solve the deadlock problem as well, a
malicious application can ask much more resource of this, or a careless
developer totally ignore this field.

A malicious application cannot cause another application to deadlock as long as
the YARN scheduler properly enforces user limits and properly reports the
headroom to applications. It seems to me the worst case is an application
hurts itself, but since the entire application can be custom user code there's
not much YARN can do to prevent that.

bq. The only valid solution in my head is putting such logic into scheduler
side, and enforce resource usage by preemption policy.

The problem is that the scheduler does not, and IMHO should not, know the
details of the particular application. For example, let's say an application's
headroom goes to zero but is has outstanding allocation requests. Should the
YARN scheduler automatically preempt something when this occurs? If so which
container does it preempt? These are questions an AM can answer optimally,
including an answer of preempting nothing (e.g.: task is completing
imminently), while I don't see how the YARN scheduler can make good decisions
without either putting application-specific logic in the YARN scheduler or
having the YARN scheduler defer to the AM to make the decision. Reporting the
headroom to the AM enables the AM to make an application-optimal decision of
what to do, if anything, when the available resources to the application
changes.

Capacity Scheduler headroom calculation does not work as expected
-

Key: YARN-1198
URL: https://issues.apache.org/jira/browse/YARN-1198
Project: Hadoop YARN
Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Attachments: YARN-1198.1.patch

Today headroom calculation (for the app) takes place only when
* New node is added/removed from the cluster
* New container is getting assigned to the application.
However there are potentially lot of situations which are not considered for
this calculation
* If a container finishes then headroom for that application will change and
should be notified to the AM accordingly.
* If a single user has submitted multiple applications (app1 and app2) to the
same queue then
** If app1's container finishes then not only app1's but also app2's AM
should be notified about the change in headroom.
** Similarly if a container is assigned to any applications app1/app2 then
both AM should be notified about their headroom.
** To simplify the whole communication process it is ideal to keep headroom
per User per LeafQueue so that everyone gets the same picture (apps belonging
to same user and submitted in same queue).
* If a new user submits an application to the queue then all applications
submitted by all users in that queue should be notified of the headroom
change.
* Also today headroom is an absolute number ( I think it should be normalized
but then this is going to be not backward compatible..)
* Also when admin user refreshes queue headroom has to be updated.
These all are the potential bugs in headroom calculations

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store


[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068546#comment-14068546
 ] 

Zhijie Shen commented on YARN-2033:
---

The test failures are not related. See YARN-2304.

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, 
 YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
 YARN-2033_ALL.3.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned


[ 
https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068587#comment-14068587
 ] 

Hudson commented on YARN-2045:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5922 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5922/])
YARN-2045. Data persisted in NM should be versioned. Contributed by Junping Du 
(jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612285)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/NMDBSchemaVersion.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb/NMDBSchemaVersionPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


 Data persisted in NM should be versioned
 

 Key: YARN-2045
 URL: https://issues.apache.org/jira/browse/YARN-2045
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.4.1
Reporter: Junping Du
Assignee: Junping Du
 Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, 
 YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, 
 YARN-2045-v7.patch, YARN-2045.patch


 As a split task from YARN-667, we want to add version info to NM related 
 data, include:
 - NodeManager local LevelDB state
 - NodeManager directory structure



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-07-21 Thread Chen He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068594#comment-14068594
]

Chen He commented on YARN-1198:
---

{quote}
With preemption, resource beyond guaranteed resource will be likely preempted.
It should be consider as a temporary resource.
{quote}
One thing needs to be clarified about preemption. I think we can resolve
YARN-2008 without introducing preemption. Because if we allow preemption before
define priority, it is wasting time and resource to let thousand of AMs to
compete those temporary resources repeatedly.

priority is the most important factor in preemption of scheduling. I think, in
this JIRA, we are talking about how to efficiently and relatively accurate get
headroom in capacity scheduler. Preemption is another story. Here is how
preemption defined in scheduling:

In computing, preemption is the act of temporarily interrupting a task being
carried out by a computer system, without requiring its cooperation, and with
the intention of resuming the task at a later time. Such a change is known as a
context switch. It is normally carried out by a privileged task or part of the
system known as a preemptive scheduler, which has the power to preempt, or
interrupt, and later resume, other tasks in the system. refer to Preemption
from wikipedia [http://en.wikipedia.org/wiki/Preemption_%28computing%29]

Capacity Scheduler headroom calculation does not work as expected
-

Key: YARN-1198
URL: https://issues.apache.org/jira/browse/YARN-1198
Project: Hadoop YARN
Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Attachments: YARN-1198.1.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart


 [ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2262:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-1530

 Few fields displaying wrong values in Timeline server after RM restart
 --

 Key: YARN-2262
 URL: https://issues.apache.org/jira/browse/YARN-2262
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.4.0
Reporter: Nishan Shetty
Assignee: Naganarasimha G R

 Few fields displaying wrong values in Timeline server after RM restart
 State:null
 FinalStatus:  UNDEFINED
 Started:  8-Jul-2014 14:58:08
 Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart


[ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068597#comment-14068597
 ] 

Zhijie Shen commented on YARN-2262:
---

[~nishan], do you have more details about the issue? Does this application 
already finish before RM restarting?

 Few fields displaying wrong values in Timeline server after RM restart
 --

 Key: YARN-2262
 URL: https://issues.apache.org/jira/browse/YARN-2262
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.4.0
Reporter: Nishan Shetty
Assignee: Naganarasimha G R

 Few fields displaying wrong values in Timeline server after RM restart
 State:null
 FinalStatus:  UNDEFINED
 Started:  8-Jul-2014 14:58:08
 Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart


 [ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2262:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-321

 Few fields displaying wrong values in Timeline server after RM restart
 --

 Key: YARN-2262
 URL: https://issues.apache.org/jira/browse/YARN-2262
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.4.0
Reporter: Nishan Shetty
Assignee: Naganarasimha G R

 Few fields displaying wrong values in Timeline server after RM restart
 State:null
 FinalStatus:  UNDEFINED
 Started:  8-Jul-2014 14:58:08
 Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart


 [ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2262:
--

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-1530)

 Few fields displaying wrong values in Timeline server after RM restart
 --

 Key: YARN-2262
 URL: https://issues.apache.org/jira/browse/YARN-2262
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.4.0
Reporter: Nishan Shetty
Assignee: Naganarasimha G R

 Few fields displaying wrong values in Timeline server after RM restart
 State:null
 FinalStatus:  UNDEFINED
 Started:  8-Jul-2014 14:58:08
 Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently


[ 
https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068607#comment-14068607
 ] 

Zhijie Shen commented on YARN-2304:
---

I saw we've opened a number of Test*WebServices* tickets, but I believe the 
problem was the same. Shall we consolidate all of them?

 TestRMWebServices* fails intermittently
 ---

 Key: YARN-2304
 URL: https://issues.apache.org/jira/browse/YARN-2304
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA
 Attachments: test-failure-log-RMWeb.txt


 The test fails intermittently because of bind exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2304) TestWebServices fails intermittently


 [ 
https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2304:
-

Summary: Test*WebServices* fails intermittently  (was: TestRMWebServices* 
fails intermittently)

 Test*WebServices* fails intermittently
 --

 Key: YARN-2304
 URL: https://issues.apache.org/jira/browse/YARN-2304
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA
 Attachments: test-failure-log-RMWeb.txt


 The test fails intermittently because of bind exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2316) TestNMWebServices* get failed on trunk


 [ 
https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA resolved YARN-2316.
--

Resolution: Duplicate

 TestNMWebServices* get failed on trunk
 --

 Key: YARN-2316
 URL: https://issues.apache.org/jira/browse/YARN-2316
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du

 From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with 
 address already get bind. The similar issue happens at RMWebService 
 (YARN-2304) and AMWebService (MAPREDUCE-5973) as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2304) TestWebServices fails intermittently


[ 
https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068624#comment-14068624
 ] 

Tsuyoshi OZAWA commented on YARN-2304:
--

OK, let's deal with the problem on this JIRA.

 Test*WebServices* fails intermittently
 --

 Key: YARN-2304
 URL: https://issues.apache.org/jira/browse/YARN-2304
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA
 Attachments: test-failure-log-RMWeb.txt


 The test fails intermittently because of bind exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2316) TestNMWebServices* get failed on trunk


[ 
https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068628#comment-14068628
 ] 

Tsuyoshi OZAWA commented on YARN-2316:
--

Close this as dup off YARN-2304, because the cause of YARN-2304, YARN-2316, 
MAPREDUCE-5973 look same. Let's consolidate them and deal with on YARN-2304.

 TestNMWebServices* get failed on trunk
 --

 Key: YARN-2316
 URL: https://issues.apache.org/jira/browse/YARN-2316
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du

 From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with 
 address already get bind. The similar issue happens at RMWebService 
 (YARN-2304) and AMWebService (MAPREDUCE-5973) as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2322) Provide Cli to refesh Admin Acls for Timeline server


[ 
https://issues.apache.org/jira/browse/YARN-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068633#comment-14068633
 ] 

Zhijie Shen commented on YARN-2322:
---

In fact, we need an admin service for the timeline server.

 Provide Cli to refesh Admin Acls for Timeline server
 

 Key: YARN-2322
 URL: https://issues.apache.org/jira/browse/YARN-2322
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Karam Singh

 Provide Cli to refresh Admin Acls for Timelineserver.
 Currently rmadmin -refreshAdminAcls provides facility to refresh Admin Acls 
 for ResourceManager. 
 But If we want modify adminAcls from Timelineserver, then we need to restart 
 it.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2322) Provide Cli to refesh Admin Acls for Timeline server


 [ 
https://issues.apache.org/jira/browse/YARN-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2322:
--

Issue Type: Sub-task  (was: Improvement)
Parent: YARN-1530

 Provide Cli to refesh Admin Acls for Timeline server
 

 Key: YARN-2322
 URL: https://issues.apache.org/jira/browse/YARN-2322
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Karam Singh

 Provide Cli to refresh Admin Acls for Timelineserver.
 Currently rmadmin -refreshAdminAcls provides facility to refresh Admin Acls 
 for ResourceManager. 
 But If we want modify adminAcls from Timelineserver, then we need to restart 
 it.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2304) TestWebServices fails intermittently


 [ 
https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2304:
-

Description: 
TestNMWebService, TestRMWebService, and TestAMWebService get failed with 
address already get bind.



  was:The test fails intermittently because of bind exception.


 Test*WebServices* fails intermittently
 --

 Key: YARN-2304
 URL: https://issues.apache.org/jira/browse/YARN-2304
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA
 Attachments: test-failure-log-RMWeb.txt


 TestNMWebService, TestRMWebService, and TestAMWebService get failed with 
 address already get bind.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068640#comment-14068640
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

Sounds reasonable. [~ywskycn], could you update to address Karthik's comment?

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068684#comment-14068684
 ] 

Wei Yan commented on YARN-2273:
---

[~kasha], [~ozawa], will update a patch soon.

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-21 Thread Alejandro Abdelnur (JIRA)

[
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068704#comment-14068704
]

Alejandro Abdelnur commented on YARN-796:
-

Wandga, previously I've missed the new doc explaining label predicates. Thanks
for pointing it out.

How about first shooting for the following?

* RM has list of valid labels. (hot reloadable)
* NMs have list of labels. (hot reloadable)
* NMs report labels at registration time and on heartbeats when they change
* label-expressions support (AND) only
* app able to specify a label-expression when making a resource request
* queues to AND augment the label expression with the queue label-expression

And later we can add (in a backwards compatible way)

* add support for OR and NOT to label-expressions
* add label ACLs
* centralized per NM configuration, REST API for it, etc, etc

Thoughts?

Allow for (admin) labels on nodes and resource-requests
---

Key: YARN-796
URL: https://issues.apache.org/jira/browse/YARN-796
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
Attachments: LabelBasedScheduling.pdf,
Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch

It will be useful for admins to specify labels for nodes. Examples of labels
are OS, processor architecture etc.
We should expose these labels and allow applications to specify labels on
resource-requests.
Obviously we need to support admin operations on adding/removing node labels.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API

2014-07-21 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068735#comment-14068735
 ] 

Jian He commented on YARN-2295:
---

patch looks good overall, one comment:
 we can have the newInstance include the following setResource method also.
{code}
LocalResource shellRsrc = LocalResource.newInstance(null,
  LocalResourceType.FILE, LocalResourceVisibility.APPLICATION,
  shellScriptPathLen, shellScriptPathTimestamp);
try {
  shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
renamedScriptPath.toString(;
} 
{code}

 Refactor YARN distributed shell with existing public stable API
 ---

 Key: YARN-2295
 URL: https://issues.apache.org/jira/browse/YARN-2295
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, 
 YARN-2295-071514.patch


 Some API calls in YARN distributed shell have been marked as unstable and 
 private. Use existing public stable API to replace them, if possible. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


 [ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2273:
--

Attachment: YARN-2273.patch

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-07-21 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068791#comment-14068791
 ] 

Craig Welch commented on YARN-1198:
---

[~wangda] I concur with [~jlowe] and [~airbots] that these headroom fixes (incl 
[YARN-2008]) should happen.  I don't think that this is a redefinition of 
headroom, headroom remains the maximum resource of an application can get - 
the application can't get resources which are not available because they are in 
use, which is what the change addresses.  I think of this change as really only 
being a fix for a missed case - and it will in fact return the same value as it 
does today except under some specific cases of higher cluster utilization, in 
which case the value it returns will actually be better than it's current 
behavior in terms of helping the AM to work accurately and preventing some 
known deadlock conditions.  This kind of behavior is a necessary consequence of 
allowing oversubscription of cluster resources vis - a - vis the maximum 
allocation which is greater than the baseline (and which in aggregate can be  
100%), and this oversubscription is a reasonable design choice to allow 
applications to burst above their guaranteed level when other queues are less 
utilized.  As I mentioned on [YARN-2008], since the aggregate maximum can be  
100% it's not possible to solve this solely with preemption - AM's will still 
be getting higher values than are available without this correction - and 
retaining the max behavior for the reasons above, this kind of approach is 
going to be the way to go.

 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1198.1.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2301) Improve yarn container command

[
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068810#comment-14068810
]

Zhijie Shen commented on YARN-2301:
---

bq. Issue1 : As i understand this option should be in addition to yarn
container -list Application Attempt ID, As its CLI we can't have polymorphic
parameters,

How about having -list only, and then parsing whether the given id is app id or
app attempt id?

bq. Issue2 : As Zhijie Shen pointed out we need to take care for containers
from Timeline server flow also. For this i would suggest few options

Option 2 sounds more attractive to me if it doesn't introduce too much
complexity.

[~jianhe], I don't tracking the recent scheduler changes. Is it able to show
the containers of previous app attempt, or the finished containers of the
current app attempt? Previously, the container is removed from the scheduler
if it is finished.

If so, -all can only work for the application in the timeline server but not
the running app in RM.

Improve yarn container command
--

Key: YARN-2301
URL: https://issues.apache.org/jira/browse/YARN-2301
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Jian He
Assignee: Naganarasimha G R
Labels: usability

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068834#comment-14068834
 ] 

Karthik Kambatla commented on YARN-2273:


Thanks Wei. Few comments on the latest patch, some not specific to changes in 
this patch.

# Can continuousSchedulingAttempt be package-private? 
# We should log the following at ERROR level
{code}
+  } catch (Throwable ex) {
+LOG.warn(Error while attempting scheduling for node  + node +
+:  + ex.toString(), ex);
   }
{code}
# When the scheduling thread is interrupted, shouldn't we actually stop the 
thread? What are the cases where we want to ignore an interruption?
# Update the log message in the catch-block of InterruptedException - 
Continuous scheduling thread interrupted. May be add Exiting. if we do 
decide to shut the thread down. 
# In the test, do we need to call FS#reinitialize()? 
# In the test, should we catch all exceptions instead of just NPE

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-21 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068872#comment-14068872
 ] 

Jian He commented on YARN-2211:
---

- AMRMTokenSecretManagerState class is not needed. we may just use 
AMRMTokenSecretManagerStateData and rename AMRMTokenSecretManagerStateData to 
AMRMTokenSecretManagerState. we should probably remove other ApplicationState 
etc. in a different jira too.
- Fix the naming, similarly  
AMRMTokenSecretManagerStateData#get/setCurrentTokenMasterKey
{code}
   *|- currentTokenMasterKey
   *|- nextTokenMasterKey
{code}
- existsWithRetries(amrmTokenSecretManagerRoot, true) != null; change to use 
isUpdate flag


 RMStateStore needs to save AMRMToken master key for recovery when RM 
 restart/failover happens 
 --

 Key: YARN-2211
 URL: https://issues.apache.org/jira/browse/YARN-2211
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
 YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch


 After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
 related Master Keys and use them to recover the AMRMToken when RM 
 restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-07-21 Thread Anubhav Dhoot (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068879#comment-14068879
]

Anubhav Dhoot commented on YARN-1372:
-

Yes. Working on this now.

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

Currently the NM informs the RM about completed containers and then removes
those containers from the RM notification list. The RM passes on that
completed container information to the AM and the AM pulls this data. If the
RM dies before the AM pulls this data then the AM may not be able to get this
information again. To fix this, NM should maintain a separate list of such
completed container notifications sent to the RM. After the AM has pulled the
containers from the RM then the RM will inform the NM about it and the NM can
remove the completed container from the new list. Upon re-register with the
RM (after RM restart) the NM should send the entire list of completed
containers to the RM along with any other containers that completed while the
RM was dead. This ensures that the RM can inform the AM's about all completed
containers. Some container completions may be reported more than once since
the AM may have pulled the container but the RM may die before notifying the
NM about the pull.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


 [ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2273:
--

Attachment: YARN-2273.patch

Thanks, [~kasha], update a new patch to address the comments.

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068910#comment-14068910
 ] 

Hadoop QA commented on YARN-2273:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656888/YARN-2273.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4383//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4383//console

This message is automatically generated.

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-07-21 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068947#comment-14068947
 ] 

Craig Welch commented on YARN-1198:
---

Also - this combined with preemption will be desirable behavior - as preeption 
rebalances, this logic will properly (accurately) raise the headroom value for 
an application - since the AM understands the particulars of it's own task 
ordering, it will need to know what resources it actually has to work with as 
preemption frees them in order to make optimal use of them.

 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1198.1.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2327) YARN should warn about nodes with poor clock synchronization


 [ 
https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-2327:
---

Summary: YARN should warn about nodes with poor clock synchronization  
(was: A hadoop cluster needs clock synchronization)

 YARN should warn about nodes with poor clock synchronization
 

 Key: YARN-2327
 URL: https://issues.apache.org/jira/browse/YARN-2327
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen

 As a distributed system, a hadoop cluster wants the clock on all the 
 participating hosts synchronized. Otherwise, some problems might happen. For 
 example, in YARN-2251, due to the clock on the host for the task container 
 falls behind that on the host of the AM container, the computed elapsed time 
 (the diff between the timestamps produced on two hosts) becomes negative.
 In YARN-2251, we tried to mask the negative elapsed time. However, we should 
 seek for a decent long term solution, such as providing mechanism to do and 
 check clock synchronization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Moved] (YARN-2327) A hadoop cluster needs clock synchronization


 [ 
https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe moved HADOOP-10794 to YARN-2327:
-

Key: YARN-2327  (was: HADOOP-10794)
Project: Hadoop YARN  (was: Hadoop Common)

 A hadoop cluster needs clock synchronization
 

 Key: YARN-2327
 URL: https://issues.apache.org/jira/browse/YARN-2327
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen

 As a distributed system, a hadoop cluster wants the clock on all the 
 participating hosts synchronized. Otherwise, some problems might happen. For 
 example, in YARN-2251, due to the clock on the host for the task container 
 falls behind that on the host of the AM container, the computed elapsed time 
 (the diff between the timestamps produced on two hosts) becomes negative.
 In YARN-2251, we tried to mask the negative elapsed time. However, we should 
 seek for a decent long term solution, such as providing mechanism to do and 
 check clock synchronization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2327) YARN should warn about nodes with poor clock synchronization


 [ 
https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-2327:
---

Description: 
YARN should warn about nodes with poor clock synchronization.

YARN relies on approximate clock synchronization to report certain elapsed time 
statistics (see YARN-2251), but we currently don't warn if this assumption is 
violated.

  was:
YARN should warn about nodes with poor clock synchronization.

YARN statistics rely on approximate clock synchronization in a few cases (see 
YARN-2251), but we currently don't warn if this assumption is violated.


 YARN should warn about nodes with poor clock synchronization
 

 Key: YARN-2327
 URL: https://issues.apache.org/jira/browse/YARN-2327
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen

 YARN should warn about nodes with poor clock synchronization.
 YARN relies on approximate clock synchronization to report certain elapsed 
 time statistics (see YARN-2251), but we currently don't warn if this 
 assumption is violated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2327) YARN should warn about nodes with poor clock synchronization


 [ 
https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-2327:
---

Description: 
YARN should warn about nodes with poor clock synchronization.

YARN statistics rely on approximate clock synchronization in a few cases (see 
YARN-2251), but we currently don't warn if this assumption is violated.

  was:
As a distributed system, a hadoop cluster wants the clock on all the 
participating hosts synchronized. Otherwise, some problems might happen. For 
example, in YARN-2251, due to the clock on the host for the task container 
falls behind that on the host of the AM container, the computed elapsed time 
(the diff between the timestamps produced on two hosts) becomes negative.

In YARN-2251, we tried to mask the negative elapsed time. However, we should 
seek for a decent long term solution, such as providing mechanism to do and 
check clock synchronization.


 YARN should warn about nodes with poor clock synchronization
 

 Key: YARN-2327
 URL: https://issues.apache.org/jira/browse/YARN-2327
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen

 YARN should warn about nodes with poor clock synchronization.
 YARN statistics rely on approximate clock synchronization in a few cases (see 
 YARN-2251), but we currently don't warn if this assumption is violated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069040#comment-14069040
 ] 

Hadoop QA commented on YARN-2273:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656909/YARN-2273.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4384//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4384//console

This message is automatically generated.

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069077#comment-14069077
 ] 

Karthik Kambatla commented on YARN-2273:


Still missing a return in this block:
{code}
} catch (InterruptedException e) {
  LOG.error(Continuous scheduling thread interrupted. Exiting. 
,
  e);
}
{code}

Unrelated, I think ContinuousSchedulingThread should be a separate class like 
UpdateThread. Both should extend Thread and be singleton classes. We can 
address this in another JIRA. In that JIRA, we should also add a test to make 
sure FairScheduler#stop stops both the threads. 

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2328) FairScheduler: Update and Continuous-Scheduling threads should be singletons

Karthik Kambatla created YARN-2328:
--

 Summary: FairScheduler: Update and Continuous-Scheduling threads 
should be singletons
 Key: YARN-2328
 URL: https://issues.apache.org/jira/browse/YARN-2328
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor


FairScheduler threads can use a little cleanup and tests. To begin with, the 
update and continuous-scheduling threads should extend Thread and be 
singletons. We should have tests for starting and stopping them as well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap


[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069085#comment-14069085
 ] 

Karthik Kambatla commented on YARN-2273:


Filed YARN-2328 for the latter comment. 

 NPE in ContinuousScheduling Thread crippled RM after DN flap
 

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
 Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
 YARN-2273.patch, YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069092#comment-14069092
 ] 

Karthik Kambatla commented on YARN-2131:


+1 to the addendum.

I ll commit this later today if no one objects. 

 Add a way to format the RMStateStore
 

 Key: YARN-2131
 URL: https://issues.apache.org/jira/browse/YARN-2131
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Robert Kanter
 Fix For: 2.6.0

 Attachments: YARN-2131.patch, YARN-2131.patch, 
 YARN-2131_addendum.patch


 There are cases when we don't want to recover past applications, but recover 
 applications going forward. To do this, one has to clear the store. Today, 
 there is no easy way to do this and users should understand how each store 
 works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-671) Add an interface on the RM to move NMs into a maintenance state

2014-07-21 Thread Jian Fang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069124#comment-14069124
 ] 

Jian Fang commented on YARN-671:


YARN-796 is adding label support for NMs, could label based schedule be used to 
ask RM not to assign any new tasks to the NMs that will be decommissioned to 
give some grace time for the decommission? 

 Add an interface on the RM to move NMs into a maintenance state
 ---

 Key: YARN-671
 URL: https://issues.apache.org/jira/browse/YARN-671
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-21 Thread Jian Fang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069175#comment-14069175
 ] 

Jian Fang commented on YARN-796:


* RM has list of valid labels. (hot reloadable)

This requires that RM has a global picture of the cluster before it starts, 
which is unlikely to be true in our use case where we provide hadoop as a cloud 
platform and the RM does not have any information about the slave nodes until 
they join the cluster. Why not just treat all registered lables from NMs as 
valid ones? Label validation could be just for resource requests.

* label-expressions support  (AND) only

At least in our use case, OR is often used, not AND

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-21 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2211:


Attachment: YARN-2211.6.patch

 RMStateStore needs to save AMRMToken master key for recovery when RM 
 restart/failover happens 
 --

 Key: YARN-2211
 URL: https://issues.apache.org/jira/browse/YARN-2211
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
 YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.patch


 After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
 related Master Keys and use them to recover the AMRMToken when RM 
 restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement

2014-07-21 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2321:
-

Summary: NodeManager web UI can incorrectly report Pmem enforcement  (was: 
NodeManager WebUI get wrong configuration of isPmemCheckEnabled())

+1 lgtm.  Committing this.

 NodeManager web UI can incorrectly report Pmem enforcement
 --

 Key: YARN-2321
 URL: https://issues.apache.org/jira/browse/YARN-2321
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
 Attachments: YARN-2321.patch


 WebUI of NodeManager get the wrong configuration of Pmem enforcement enable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Moved] (YARN-2329) Machine List generated by machines.jsp should be sorted

2014-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer moved MAPREDUCE-498 to YARN-2329:
--

Key: YARN-2329  (was: MAPREDUCE-498)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

 Machine List generated by machines.jsp should be sorted
 ---

 Key: YARN-2329
 URL: https://issues.apache.org/jira/browse/YARN-2329
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Tim Williamson
Priority: Minor
 Attachments: HADOOP-5586.patch


 The listing of machines shown by machine.jsp is arbitrarily ordered.  It 
 would be a more useful to sort them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement


[ 
https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069426#comment-14069426
 ] 

Hudson commented on YARN-2321:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5927 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5927/])
YARN-2321. NodeManager web UI can incorrectly report Pmem enforcement. 
Contributed by Leitao Guo (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612411)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NodePage.java


 NodeManager web UI can incorrectly report Pmem enforcement
 --

 Key: YARN-2321
 URL: https://issues.apache.org/jira/browse/YARN-2321
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Assignee: Leitao Guo
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2321.patch


 WebUI of NodeManager get the wrong configuration of Pmem enforcement enable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens


[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069453#comment-14069453
 ] 

Hadoop QA commented on YARN-2211:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656964/YARN-2211.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4385//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4385//console

This message is automatically generated.

 RMStateStore needs to save AMRMToken master key for recovery when RM 
 restart/failover happens 
 --

 Key: YARN-2211
 URL: https://issues.apache.org/jira/browse/YARN-2211
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
 YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.patch


 After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
 related Master Keys and use them to recover the AMRMToken when RM 
 restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041


 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1726:
---

 Priority: Blocker  (was: Minor)
 Target Version/s: 2.5.0
Affects Version/s: 2.4.1

Given the scheduler-load-simulator is broken without this, I am making it a 
blocker for 2.5. 

 ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
 in YARN-1041
 

 Key: YARN-1726
 URL: https://issues.apache.org/jira/browse/YARN-1726
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Blocker
 Attachments: YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, 
 YARN-1726.patch


 The YARN scheduler simulator failed when running Fair Scheduler, due to 
 AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
 should inherit AbstractYarnScheduler, instead of implementing 
 ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041


[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069522#comment-14069522
 ] 

Hadoop QA commented on YARN-1726:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638118/YARN-1726.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4386//console

This message is automatically generated.

 ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
 in YARN-1041
 

 Key: YARN-1726
 URL: https://issues.apache.org/jira/browse/YARN-1726
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Blocker
 Attachments: YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, 
 YARN-1726.patch


 The YARN scheduler simulator failed when running Fair Scheduler, due to 
 AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
 should inherit AbstractYarnScheduler, instead of implementing 
 ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-21 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2211:


Attachment: YARN-2211.6.1.patch

fix the testcase failures

 RMStateStore needs to save AMRMToken master key for recovery when RM 
 restart/failover happens 
 --

 Key: YARN-2211
 URL: https://issues.apache.org/jira/browse/YARN-2211
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
 YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
 YARN-2211.6.1.patch, YARN-2211.6.patch


 After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
 related Master Keys and use them to recover the AMRMToken when RM 
 restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2295) Refactor YARN distributed shell with existing public stable API

2014-07-21 Thread Li Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2295:


Attachment: YARN-2295-072114.patch

Thanks [~jianhe]! I've addressed the problem in the latest patch. If you've got 
time please take a look at it. Thanks! 

 Refactor YARN distributed shell with existing public stable API
 ---

 Key: YARN-2295
 URL: https://issues.apache.org/jira/browse/YARN-2295
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, 
 YARN-2295-071514.patch, YARN-2295-072114.patch


 Some API calls in YARN distributed shell have been marked as unstable and 
 private. Use existing public stable API to replace them, if possible. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API


[ 
https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069588#comment-14069588
 ] 

Hadoop QA commented on YARN-2295:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657002/YARN-2295-072114.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell:

  
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4387//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4387//console

This message is automatically generated.

 Refactor YARN distributed shell with existing public stable API
 ---

 Key: YARN-2295
 URL: https://issues.apache.org/jira/browse/YARN-2295
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, 
 YARN-2295-071514.patch, YARN-2295-072114.patch


 Some API calls in YARN distributed shell have been marked as unstable and 
 private. Use existing public stable API to replace them, if possible. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2316) TestNMWebServices* get failed on trunk

2014-07-21 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069593#comment-14069593
 ] 

Junping Du commented on YARN-2316:
--

+1 on resolving it as duplicated. It looks like these failures are all for the 
same reason.

 TestNMWebServices* get failed on trunk
 --

 Key: YARN-2316
 URL: https://issues.apache.org/jira/browse/YARN-2316
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du

 From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with 
 address already get bind. The similar issue happens at RMWebService 
 (YARN-2304) and AMWebService (MAPREDUCE-5973) as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069615#comment-14069615
 ] 

Wangda Tan commented on YARN-796:
-

Hi Tucu,
Thanks for providing thoughts about how to stage development works. It's 
reasonable and we're trying to scope work for first shooting as well. 
Will keep you posted.

Thanks,
Wangda

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069619#comment-14069619
 ] 

Wangda Tan commented on YARN-796:
-

Jian Fang,
I think it's make sense to make RM has a global picture because we can prevent 
typos created by admin manually filling labels on NM config, etc.
In another hand, I think your use case is also reasonable, 
We'd better need to support both of them, as well as OR label expression. 
Will keep you posted when we made a plan.

Thanks,
Wangda

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens


[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069644#comment-14069644
 ] 

Hadoop QA commented on YARN-2211:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657000/YARN-2211.6.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4388//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4388//console

This message is automatically generated.

 RMStateStore needs to save AMRMToken master key for recovery when RM 
 restart/failover happens 
 --

 Key: YARN-2211
 URL: https://issues.apache.org/jira/browse/YARN-2211
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
 YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
 YARN-2211.6.1.patch, YARN-2211.6.patch


 After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
 related Master Keys and use them to recover the AMRMToken when RM 
 restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

[
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069651#comment-14069651
]

Wangda Tan commented on YARN-1198:
--

I agree with [~jlowe], [~airbots] and [~cwelch], used resource should be
considered into headroom (which is YANR-2008). And apparently, application
master can ask more than that number to get more resource possibly.

I completely agree with what Jason mentioned, ignore headroom will not cause
more problem except application itself. What I originally want to say is when
putting headroom and gang scheduling together, it will cause deadlock problem
and should be solved in scheduler side. But it seems kind of off-topic, let's
ignore it here.

Also, as Chen mentioned, we don't need consider preemption when computing
headroom. And besides, when resource will be preempted from an app, the AM will
receive messages about preemption requests, it should handle itself.

Thanks,
Wangda

Capacity Scheduler headroom calculation does not work as expected
-

Key: YARN-1198
URL: https://issues.apache.org/jira/browse/YARN-1198
Project: Hadoop YARN
Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Attachments: YARN-1198.1.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-21 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069662#comment-14069662
 ] 

Xuan Gong commented on YARN-2211:
-

testcase failures are un-related. All the testcases can be passed on my local 
machine.

 RMStateStore needs to save AMRMToken master key for recovery when RM 
 restart/failover happens 
 --

 Key: YARN-2211
 URL: https://issues.apache.org/jira/browse/YARN-2211
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
 YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
 YARN-2211.6.1.patch, YARN-2211.6.patch


 After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
 related Master Keys and use them to recover the AMRMToken when RM 
 restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement

2014-07-21 Thread Leitao Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069661#comment-14069661
 ] 

Leitao Guo commented on YARN-2321:
--

Thanks Jason Lowe!

 NodeManager web UI can incorrectly report Pmem enforcement
 --

 Key: YARN-2321
 URL: https://issues.apache.org/jira/browse/YARN-2321
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Assignee: Leitao Guo
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2321.patch


 WebUI of NodeManager get the wrong configuration of Pmem enforcement enable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2131) Add a way to format the RMStateStore


 [ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla resolved YARN-2131.


Resolution: Fixed

Thanks Robert. Committed addendum to trunk and branch-2. 

 Add a way to format the RMStateStore
 

 Key: YARN-2131
 URL: https://issues.apache.org/jira/browse/YARN-2131
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Robert Kanter
 Fix For: 2.6.0

 Attachments: YARN-2131.patch, YARN-2131.patch, 
 YARN-2131_addendum.patch


 There are cases when we don't want to recover past applications, but recover 
 applications going forward. To do this, one has to clear the store. Today, 
 there is no easy way to do this and users should understand how each store 
 works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069673#comment-14069673
 ] 

Hudson commented on YARN-2131:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5930 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5930/])
YARN-2131. Addendum. Add a way to format the RMStateStore. (Robert Kanter via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612443)
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 Add a way to format the RMStateStore
 

 Key: YARN-2131
 URL: https://issues.apache.org/jira/browse/YARN-2131
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Robert Kanter
 Fix For: 2.6.0

 Attachments: YARN-2131.patch, YARN-2131.patch, 
 YARN-2131_addendum.patch


 There are cases when we don't want to recover past applications, but recover 
 applications going forward. To do this, one has to clear the store. Today, 
 there is no easy way to do this and users should understand how each store 
 works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041


 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1726:
--

Attachment: YARN-1726-5.patch

Rebase the patch.

 ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
 in YARN-1041
 

 Key: YARN-1726
 URL: https://issues.apache.org/jira/browse/YARN-1726
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Blocker
 Attachments: YARN-1726-5.patch, YARN-1726.patch, YARN-1726.patch, 
 YARN-1726.patch, YARN-1726.patch


 The YARN scheduler simulator failed when running Fair Scheduler, due to 
 AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
 should inherit AbstractYarnScheduler, instead of implementing 
 ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes


[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069765#comment-14069765
 ] 

Tsuyoshi OZAWA commented on YARN-2013:
--

Thanks for your review, [~djp]!

 The diagnostics is always the ExitCodeException stack when the container 
 crashes
 

 Key: YARN-2013
 URL: https://issues.apache.org/jira/browse/YARN-2013
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
 YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch


 When a container crashes, ExitCodeException will be thrown from Shell. 
 Default/LinuxContainerExecutor captures the exception, put the exception 
 stack into the diagnostic. Therefore, the exception stack is always the same. 
 {code}
 String diagnostics = Exception from container-launch: \n
 + StringUtils.stringifyException(e) + \n + shExec.getOutput();
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
 {code}
 In addition, it seems that the exception always has a empty message as 
 there's no message from stderr. Hence the diagnostics is not of much use for 
 users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes


[ 
https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069768#comment-14069768
 ] 

Hudson commented on YARN-2013:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5931 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5931/])
YARN-2013. The diagnostics is always the ExitCodeException stack when the 
container crashes. (Contributed by Tsuyoshi OZAWA) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612449)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java


 The diagnostics is always the ExitCodeException stack when the container 
 crashes
 

 Key: YARN-2013
 URL: https://issues.apache.org/jira/browse/YARN-2013
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, 
 YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch


 When a container crashes, ExitCodeException will be thrown from Shell. 
 Default/LinuxContainerExecutor captures the exception, put the exception 
 stack into the diagnostic. Therefore, the exception stack is always the same. 
 {code}
 String diagnostics = Exception from container-launch: \n
 + StringUtils.stringifyException(e) + \n + shExec.getOutput();
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
 {code}
 In addition, it seems that the exception always has a empty message as 
 there's no message from stderr. Hence the diagnostics is not of much use for 
 users to analyze the reason of container crash.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java

2014-07-21 Thread Wenwu Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenwu Peng updated YARN-2319:
-

Attachment: YARN-2319.1.patch

Thanks [~zjshen] great comments. the new patch address [~zjshen] 's comment to 
re-factor setupKDC to BeforeClass

 Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
 ---

 Key: YARN-2319
 URL: https://issues.apache.org/jira/browse/YARN-2319
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
 Attachments: YARN-2319.0.patch, YARN-2319.1.patch


 MiniKdc only invoke start method not stop in 
 TestRMWebServicesDelegationTokens.java
 {code}
 testMiniKDC.start();
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041


[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069782#comment-14069782
 ] 

Hadoop QA commented on YARN-1726:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657041/YARN-1726-5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4390//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4390//console

This message is automatically generated.

 ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
 in YARN-1041
 

 Key: YARN-1726
 URL: https://issues.apache.org/jira/browse/YARN-1726
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Blocker
 Attachments: YARN-1726-5.patch, YARN-1726.patch, YARN-1726.patch, 
 YARN-1726.patch, YARN-1726.patch


 The YARN scheduler simulator failed when running Fair Scheduler, due to 
 AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
 should inherit AbstractYarnScheduler, instead of implementing 
 ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes


[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069783#comment-14069783
 ] 

Hadoop QA commented on YARN-2242:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12655592/YARN-2242-071414.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4389//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4389//console

This message is automatically generated.

 Improve exception information on AM launch crashes
 --

 Key: YARN-2242
 URL: https://issues.apache.org/jira/browse/YARN-2242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, 
 YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, 
 YARN-2242-071414.patch


 Now on each time AM Container crashes during launch, both the console and the 
 webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
 but sometimes confusing. With the help of log aggregator, container logs are 
 actually aggregated, and can be very helpful for debugging. One possible way 
 to improve the whole process is to send a pointer to the aggregated logs to 
 the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069814#comment-14069814
 ] 

Zhijie Shen commented on YARN-2270:
---

+1, LGTM as well. The test failure was gone locally with this patch.

 TestFSDownload#testDownloadPublicWithStatCache fails in trunk
 -

 Key: YARN-2270
 URL: https://issues.apache.org/jira/browse/YARN-2270
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.1
Reporter: Ted Yu
Assignee: Akira AJISAKA
Priority: Minor
 Attachments: YARN-2270.2.patch, YARN-2270.patch


 From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console :
 {code}
 Running org.apache.hadoop.yarn.util.TestFSDownload
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload
 testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload)  
 Time elapsed: 0.137 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363)
 {code}
 Similar error can be seen here: 
 https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/
 Looks like future.get() returned null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2330) Jobs are not displaying in timeline server after RM restart

2014-07-21 Thread Nishan Shetty (JIRA)

Nishan Shetty created YARN-2330:
---

 Summary: Jobs are not displaying in timeline server after RM 
restart
 Key: YARN-2330
 URL: https://issues.apache.org/jira/browse/YARN-2330
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.4.1
 Environment: Nodemanagers 3 (3*8GB)
Queues A = 70%
Queues B = 30%

Reporter: Nishan Shetty


Submit jobs to queue a
While job is running Restart RM 

Observe that those jobs are not displayed in timelineserver

{code}
2014-07-22 10:11:32,084 ERROR 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
 History information of application application_1406002968974_0003 is not 
included into the result due to the exception
java.io.IOException: Cannot seek to negative offset
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1381)
at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:624)
at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.init(FileSystemApplicationHistoryStore.java:683)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:661)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:146)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getAllApplications(FileSystemApplicationHistoryStore.java:199)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getAllApplications(ApplicationHistoryManagerImpl.java:103)
at 
org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:75)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:197)
at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:156)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java


[ 
https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069823#comment-14069823
 ] 

Hadoop QA commented on YARN-2319:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657046/YARN-2319.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4391//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4391//console

This message is automatically generated.

 Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
 ---

 Key: YARN-2319
 URL: https://issues.apache.org/jira/browse/YARN-2319
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
 Attachments: YARN-2319.0.patch, YARN-2319.1.patch


 MiniKdc only invoke start method not stop in 
 TestRMWebServicesDelegationTokens.java
 {code}
 testMiniKDC.start();
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069831#comment-14069831
 ] 

Hudson commented on YARN-2270:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5932 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5932/])
YARN-2270. Made TestFSDownload#testDownloadPublicWithStatCache be skipped when 
there’s no ancestor permissions. Contributed by Akira Ajisaka. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612460)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java


 TestFSDownload#testDownloadPublicWithStatCache fails in trunk
 -

 Key: YARN-2270
 URL: https://issues.apache.org/jira/browse/YARN-2270
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.1
Reporter: Ted Yu
Assignee: Akira AJISAKA
Priority: Minor
 Attachments: YARN-2270.2.patch, YARN-2270.patch


 From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console :
 {code}
 Running org.apache.hadoop.yarn.util.TestFSDownload
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload
 testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload)  
 Time elapsed: 0.137 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363)
 {code}
 Similar error can be seen here: 
 https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/
 Looks like future.get() returned null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069832#comment-14069832
 ] 

Zhijie Shen commented on YARN-2270:
---

Committed to trunk, branch-2, and branch-2.5. Thanks [~ajisakaa] for the patch, 
and [~vvasudev] for the review!

 TestFSDownload#testDownloadPublicWithStatCache fails in trunk
 -

 Key: YARN-2270
 URL: https://issues.apache.org/jira/browse/YARN-2270
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.1
Reporter: Ted Yu
Assignee: Akira AJISAKA
Priority: Minor
 Fix For: 2.5.0

 Attachments: YARN-2270.2.patch, YARN-2270.patch


 From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console :
 {code}
 Running org.apache.hadoop.yarn.util.TestFSDownload
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload
 testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload)  
 Time elapsed: 0.137 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363)
 {code}
 Similar error can be seen here: 
 https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/
 Looks like future.get() returned null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java