date:20140522


[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005667#comment-14005667
 ] 

Tsuyoshi OZAWA commented on YARN-2017:
--

Good job!

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.5.0

 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, 
 YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, 
 YARN-2017.6.patch, YARN-2017.7.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005674#comment-14005674
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

I'm rebasing a patch on YARN-2017. Please wait a moment.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1801) NPE in public localizer

2014-05-22 Thread Hong Zhiguo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo reassigned YARN-1801:
-

Assignee: Hong Zhiguo

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical

 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1801) NPE in public localizer

2014-05-22 Thread Hong Zhiguo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo updated YARN-1801:
--

Attachment: YARN-1801.patch

{code}
Path local = completed.get();
{code}
may throw ExecutionException and assoc may be null.

When both of them happen, we got NPE in
{code}
LOG.info(Failed to download rsrc  + assoc.getResource(),
  e.getCause());
{code}

And this is exactly the line ResourceLocalizationService.java:712 of commit 
dd9c059 (2013-10-05 YARN-1254).

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever


[ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005700#comment-14005700
 ] 

Hadoop QA commented on YARN-2049:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12646158/YARN-2049.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3788//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3788//console

This message is automatically generated.

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, 
 YARN-2049.4.patch, YARN-2049.5.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT

2014-05-22 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005718#comment-14005718
 ] 

Steve Loughran commented on YARN-2092:
--

This seems like from the HADOOP-10104 patch.  Which went in because the 2.2+ 
version of jackson was so out of date is was breaking other things.

I'm not sure its so much incompatible as that TEZ is trying to push in its own 
version of jackon, which is then leading to classpath mixing problems. Even if 
you try to push in one set of the JARs ahead of the other, things are going to 
break. I know, I've tried.

jackson 1.x should be compatible at run time with code build for previous 
versions. If there's a link problem there then it's something we can take up 
with the Jackson team. 



 Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 
 2.5.0-SNAPSHOT
 

 Key: YARN-2092
 URL: https://issues.apache.org/jira/browse/YARN-2092
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah

 Came across this when trying to integrate with the timeline server. Using a 
 1.8.8 dependency of jackson works fine against 2.4.0 but fails against 
 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user 
 jars are first in the classpath.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder

2014-05-22 Thread Binglin Chang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005722#comment-14005722
 ] 

Binglin Chang commented on YARN-2088:
-

Hi Zhiguo, Thanks for the comments, nice catch.
Those two lines are used in every record class... so delete them in a single 
place actually break code conversion, and it's not related to this jira.
We may discuss whether to delete them all in other jira.


 Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
 

 Key: YARN-2088
 URL: https://issues.apache.org/jira/browse/YARN-2088
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
 Attachments: YARN-2088.v1.patch


 Some fields(set,list) are added to proto builders many times, we need to 
 clear those fields before add, otherwise the result proto contains more 
 contents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore

2014-05-22 Thread Binglin Chang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005753#comment-14005753
 ] 

Binglin Chang commented on YARN-2030:
-

Hi Jian He,
Thanks for the comments, looks like PBImpl already has ProtoBase as super 
class, so we can't change interface to abstract class

{code}
public class ApplicationAttemptStateDataPBImpl
extends ProtoBaseApplicationAttemptStateDataProto 
implements ApplicationAttemptStateData {
{code}


 Use StateMachine to simplify handleStoreEvent() in RMStateStore
 ---

 Key: YARN-2030
 URL: https://issues.apache.org/jira/browse/YARN-2030
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
Assignee: Binglin Chang
 Attachments: YARN-2030.v1.patch, YARN-2030.v2.patch


 Now the logic to handle different store events in handleStoreEvent() is as 
 following:
 {code}
 if (event.getType().equals(RMStateStoreEventType.STORE_APP)
 || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) {
   ...
   if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
 ...
   } else {
 ...
   }
   ...
   try {
 if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
   ...
 } else {
   ...
 }
   } 
   ...
 } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)
 || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) {
   ...
   if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
 ...
   } else {
 ...
   }
 ...
 if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
   ...
 } else {
   ...
 }
   }
   ...
 } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) {
 ...
 } else {
   ...
 }
 }
 {code}
 This is not only confuse people but also led to mistake easily. We may 
 leverage state machine to simply this even no state transitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT

2014-05-22 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005804#comment-14005804
 ] 

Steve Loughran commented on YARN-2092:
--

I should add that the underlying issue is that the AM gets then entire CP from 
the {{yarn.lib.classpath}}. That's mandatory to pick up a version of the hadoop 
binaries (and -site.xml files) compatible with the rest of the cluster. But it 
brings in all the other dependencies which hadoop itself relies on. As hadoop 
evolves, this problem will only continue.

The only viable long-term solution is to somehow support OSGi-launched AMs, so 
the AM only gets the org.apache.hadoop classes from the hadoop JARs, and has to 
explicitly add everything itself. See HADOOP-7977 for this -maybe it's 
something we could target for hadoop 3.0 driven by the needs of AMs

 Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 
 2.5.0-SNAPSHOT
 

 Key: YARN-2092
 URL: https://issues.apache.org/jira/browse/YARN-2092
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah

 Came across this when trying to integrate with the timeline server. Using a 
 1.8.8 dependency of jackson works fine against 2.4.0 but fails against 
 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user 
 jars are first in the classpath.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-22 Thread Rohith (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005863#comment-14005863
]

Rohith commented on YARN-1366:
--

bq. I mean what will go wrong is we allow unregister without register? Is it
fundamentally wrong?
Allowing unregister without register, move application to FINISHED state(after
handling unregistered event at launched) which supposed to be Failed state. If
it can be acceptable, then its fine to go ahead.

ApplicationMasterService should Resync with the AM upon allocate call after
restart
---

Key: YARN-1366
URL: https://issues.apache.org/jira/browse/YARN-1366
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch,
YARN-1366.prototype.patch, YARN-1366.prototype.patch

The ApplicationMasterService currently sends a resync response to which the
AM responds by shutting down. The AM behavior is expected to change to
calling resyncing with the RM. Resync means resetting the allocate RPC
sequence number to 0 and the AM should send its entire outstanding request to
the RM. Note that if the AM is making its first allocate call to the RM then
things should proceed like normal without needing a resync. The RM will
return all containers that have completed since the RM last synced with the
AM. Some container completions may be reported more than once.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-05-22 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006066#comment-14006066
 ] 

Eric Payne commented on YARN-415:
-

The Generic Application History server stores all of the information about 
containers that are needed to calculate memory seconds and vcore seconds. Right 
now, since the Generic Application Server is tied closely with the Timeline 
Server, this does not work on a secured cluster. Also, the information is only 
available via REST API right now, and there would need to be some scripting and 
parsing of the REST APIs to rolled up metrics for each app. So, I think this 
JIRA still would be very helpful and useful.

FYI, On an unsecured cluster with the Generic Application History Server and 
the Timeline Server configured and running, the following REST APIs will give 
enough information about an app to calculate memory seconds and vcore seconds:

{panel:title=Get list of app attempts for a specified 
appID|titleBGColor=#F7D6C1}
curl --compressed -H Accept: application/json -X GET 
http://hostname:port/ws/v1/applicationhistory/apps/appID/appattempts
{panel}
{panel:title=For each app attempt, get all container info|titleBGColor=#F7D6C1}
curl --compressed -H Accept: application/json -X GET 
http://hostname:port/ws/v1/applicationhistory/apps/appID/appattempts/appAttemptID/containers
 
{panel}

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1474) Make schedulers services


 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.16.patch

Rebased on trunk. 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, 
 YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-05-22 Thread Abin Shahab (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006222#comment-14006222
 ] 

Abin Shahab commented on YARN-1964:
---

Does others have comments on it. [~acmurthy]] ?


 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-22 Thread Wei Yan (JIRA)

[
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006248#comment-14006248
]

Wei Yan commented on YARN-596:
--

Hey, [~sandyr], sorry for the late reply. Still confuse here.
So as you said, a queue is safe and doesn't allow preemption only it satisfies
the condition (usage.memory = fairshare.memory) (usage.vcores =
fairshare.vcores). This condition works fine for DRF. But for FairSharePolicy,
as the fairshare.vcores is always 0 (except for root), so this condition cannot
be satisfied and all queues are always allowed to preempt.

In fair scheduler, intra-application container priorities affect
inter-application preemption decisions
---

Key: YARN-596
URL: https://issues.apache.org/jira/browse/YARN-596
Project: Hadoop YARN
Issue Type: Bug
Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch,
YARN-596.patch, YARN-596.patch

In the fair scheduler, containers are chosen for preemption in the following
way:
All containers for all apps that are in queues that are over their fair share
are put in a list.
The list is sorted in order of the priority that the container was requested
in.
This means that an application can shield itself from preemption by
requesting it's containers at higher priorities, which doesn't really make
sense.
Also, an application that is not over its fair share, but that is in a queue
that is over it's fair share is just as likely to have containers preempted
as an application that is over its fair share.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever


[ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006453#comment-14006453
 ] 

Vinod Kumar Vavilapalli commented on YARN-2049:
---

Thanks for working on this, Zhijie!

Some comments on the patch

TimelineKerberosAuthenticator
 - Not clear what TimelineDelegationTokenResponse validateAndParseResponse() is 
doing with class loading, construction etc. Can you explain and may be also add 
code comments?

TimelineAuthenticationFilter
 - Explain what getConfiguration() overrides and add a code comment?

TimelineKerberosAuthenticationHandler
 - This borrows a lot of code from HttpFSKerberosAuthenticationHandler.java. We 
should refactor either here or in a separate JIRA.

Nits
 - TestDistributedShell change is unnecessary
 - TimelineDelegationTokenSelector: Wrap the debug logging in debugEnabled 
checks.
 - ApplicationHistoryServer.java
-- Forced config setting of the filter: What happens if  the cluster has 
another authentication filter? Is the guideline to override it (which is what 
the patch is doing)?

h4. Source code refactor

TimelineKerberosAuthenticationHandler
 - Rename to TimelineClientAuthenticationService?

TimelineKerberosAuthenticator
 - It seems like TimelineKerberosAuthenticator is completely client side code 
and so should be moved to the client module
 - To do that we will extract some of the constants and the 
DelegationTokenOperation enum as top level entities into the common module.

TimelineAuthenticationFilterInitializer
 - This is almost the same as the common AuthenticationFilterInitializer.java. 
Let's just refactor AuthenticationFilterInitializer.java and extend it to only 
change class names. Similarly to how TimelineAuthenticationFilter extends 
AuthenticationFilter.

TimelineDelegationTokenSecretManagerService:
 - We are sharing the configs for update/renewal etc with the ResourceManager. 
That seems fine for now - logically you want both the tokens to follow similar 
expiry and life-cycle
 - This also shares a bunch of code with 
org/apache/hadoop/lib/service/security/DelegationTokenManagerService. We may or 
may not want to reuse some code - just throwing it out.

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, 
 YARN-2049.4.patch, YARN-2049.5.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-05-22 Thread Mayank Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006459#comment-14006459
 ] 

Mayank Bansal commented on YARN-2074:
-

Thanks [~jianhe] for the patch. Overall looks good.
some nits

{code}
  maxAppAttempts = attempts.size()
{code}
Can we use this?
{code}
maxAppAttempts == getAttemptFailureCount()
{code}

{code}
  public boolean isPreempted() {
 return getDiagnostics().contains(SchedulerUtils.PREEMPTED_CONTAINER);
   }
{code}

I think we need to compare the exit status  (-102) instead of relying on string 
message.


 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

[
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006465#comment-14006465
]

Jian He commented on YARN-1408:
---

Hi [~sunilg], agree that we should remove container from
newlyAllocatedContainers when preemption happens. As per the race condition you
mentioned, we may also preempt ACQUIRED container?
In fact, I think the best container to be preempted is the ALLOCATED container
as these containers are not yet alive from the user's perspective. As per the
race condition that [RM lost the resource request], today the resource request
is decremented when container is allocated. we may change it to decrement the
resource request only when the container is pulled by the AM ? We can do this
separately if this makes sense.

Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task
timeout for 30mins
--

Key: YARN-1408
URL: https://issues.apache.org/jira/browse/YARN-1408
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
Assignee: Sunil G
Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch,
Yarn-1408.4.patch, Yarn-1408.patch

Capacity preemption is enabled as follows.
* yarn.resourcemanager.scheduler.monitor.enable= true ,
*
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
Queue = a,b
Capacity of Queue A = 80%
Capacity of Queue B = 20%
Step 1: Assign a big jobA on queue a which uses full cluster capacity
Step 2: Submitted a jobB to queue b which would use less than 20% of cluster
capacity
JobA task which uses queue b capcity is been preempted and killed.
This caused below problem:
1. New Container has got allocated for jobA in Queue A as per node update
from an NM.
2. This container has been preempted immediately as per preemption.
Here ACQUIRED at KILLED Invalid State exception came when the next AM
heartbeat reached RM.
ERROR
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
ACQUIRED at KILLED
This also caused the Task to go for a timeout for 30minutes as this Container
was already killed by preemption.
attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures


 [ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2074:
--

Attachment: YARN-2074.3.patch

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures


[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006525#comment-14006525
 ] 

Jian He commented on YARN-2074:
---

Thanks Xuan and Mayank for the review ! 
bq. maxAppAttempts == getAttemptFailureCount()
good point.
Fixed the attempt to compare against the exit status to determine preempted or 
not.

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever


 [ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2049:
--

Attachment: YARN-2049.6.patch

Update the patch accordingly

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, 
 YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever


[ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006542#comment-14006542
 ] 

Zhijie Shen commented on YARN-2049:
---

Thanks for review, Vinod and Varun! Please see the response bellow:

bq. 1. In the function managementOperation, should there be a null check for 
token?

There're following code before processing each dtOp:
{code}
if (dtOp.requiresKerberosCredentials()  token == null) {
  response.sendError(HttpServletResponse.SC_UNAUTHORIZED,
  MessageFormat.format(
  Operation [{0}] requires SPNEGO authentication established,
  dtOp));
  requestContinues = false;
{code}
Get and renew both require kerberos credentials, such that if token == null, 
the code will fall into this part. Cancel didn't require credentials before 
refer to HttpFS's code. However, I think we should enforce kerberos credentials 
for cancel as well. After that, the NPE risk is gone.

bq. In the function managementOperation, you call secretManager.cancelToken(dt, 
UserGroupInformation.getCurrentUser().getUserName()) - should you use 
getCurrentuser().getUserName? or ownerUGI.getUserName()? 

Good catch, we should use token.getUserName here as well.

bq. TimelineKerberosAuthenticator

Some errors may cause TimelineAuthenticator not getting the correct response. 
If the status code is not 200, the json content may contain the exception 
information from the server, we can use the information recover exception 
object. This is inspired by HttpFSUtils.validateResponse, but I changed to use 
Jackson to parse the json content here.

bq. TimelineAuthenticationFilter

In the configuration we can simply set the authentication type to kerberos, 
but in the timeline sever, we want to replace it with the class name of the 
customized authentication service. Otherwise, the standard authentication 
handler will be used instead. I added the code comments there.

bq. TimelineKerberosAuthenticationHandler
bq. TimelineDelegationTokenSecretManagerService.

Yeah, we need to look into how to reuse the existing code, but how about 
postpone it later? I'm going to file a separate Jira for code refactoring.

bq. TestDistributedShell change is unnecessary

Removed.

bq. TimelineDelegationTokenSelector: Wrap the debug logging in debugEnabled 
checks.

Added the debugEnabled checks.

bq. ApplicationHistoryServer.java

Actually it will not override the other initializers. Instead, I just append a 
TimelineAuthenticationFilterInitializer. Anyway, I enhance the condition here: 
not only the security should be enabled, but also kerberos authentication is 
desired.

bq. TimelineKerberosAuthenticationHandler

Done.

bq. TimelineKerberosAuthenticator.

Good suggestion. I split the code accordingly.

bq. TimelineAuthenticationFilterInitializer

AuthenticationFilterInitializer has a single method to do everything, and the 
prefix is a static variable, which makes me a bit difficult to override part of 
code without changing AuthenticationFilterInitializer. One another issue is 
that AuthenticationFilterInitializer requires user to supply a secret file, 
which is not actually required by AuthenticationFilter (HADOOP-10600).





 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, 
 YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

2014-05-22 Thread Mayank Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006548#comment-14006548
 ] 

Mayank Bansal commented on YARN-1408:
-

I agree with [~jianhe] and [~devaraj.k]
We should be able to preempt the container in ALLOCATED state. 

bq. oday the resource request is decremented when container is allocated. we 
may change it to decrement the resource request only when the container is 
pulled by the AM ?
I am not sure if thats the right thing as you dont want to run into other race 
conditions when container is been allocated however capacity is given to some 
other AM's?



 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs

2014-05-22 Thread Wei Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1913:
--

Attachment: YARN-1913.patch

 With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
 --

 Key: YARN-1913
 URL: https://issues.apache.org/jira/browse/YARN-1913
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Wei Yan
 Attachments: YARN-1913.patch, YARN-1913.patch


 It's possible to deadlock a cluster by submitting many applications at once, 
 and have all cluster resources taken up by AMs.
 One solution is for the scheduler to limit resources taken up by AMs, as a 
 percentage of total cluster resources, via a maxApplicationMasterShare 
 config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006609#comment-14006609
 ] 

Sandy Ryza commented on YARN-596:
-

Ah, I see what you're saying.  Good point.  In that case we'll probably need to 
push that check into the SchedulingPolicy and call it inside the loop in 
preemptContainer().

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins


[ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006613#comment-14006613
 ] 

Jian He commented on YARN-1408:
---

Seems more problem with the approach I mentioned, if the request is not updated 
at the time container is allocated, and AM doesn't do the following allocate, 
more containers will be allocated from the same request when NM heartbeats

 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-22 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006615#comment-14006615
 ] 

Wei Yan commented on YARN-596:
--

yes, we can check the queue's policy in the preCheck function. If DRF, we use 
Resources.fitsIn(); if Fair, we use DEFAULT_CALCULATOR. Sounds good?

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute

2014-05-22 Thread Ashwin Shankar (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashwin Shankar updated YARN-2012:
-

Description:
Currently 'default' rule in queue placement policy,if applied,puts the app in
root.default queue. It would be great if we can make 'default' rule optionally
point to a different queue as default queue .
This default queue can be a leaf queue or it can also be an parent queue if the
'default' rule is nested inside nestedUserQueue rule(YARN-1864).

was:
Currently 'default' rule in queue placement policy,if applied,puts the app in
root.default queue. It would be great if we can make 'default' rule optionally
point to a different queue as default queue . This queue should be an existing
queue,if not we fall back to root.default queue hence keeping this rule as
terminal.
This default queue can be a leaf queue or it can also be an parent queue if the
'default' rule is nested inside nestedUserQueue rule(YARN-1864).

Fair Scheduler : Default rule in queue placement policy can take a queue as
an optional attribute
-

Key: YARN-2012
URL: https://issues.apache.org/jira/browse/YARN-2012
Project: Hadoop YARN
Issue Type: Improvement
Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
Labels: scheduler
Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt

Currently 'default' rule in queue placement policy,if applied,puts the app in
root.default queue. It would be great if we can make 'default' rule
optionally point to a different queue as default queue .
This default queue can be a leaf queue or it can also be an parent queue if
the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006641#comment-14006641
 ] 

Sandy Ryza commented on YARN-596:
-

Sounds good

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever


[ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006645#comment-14006645
 ] 

Hadoop QA commented on YARN-2049:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12646401/YARN-2049.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/3789//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  org.apache.hadoop.yarn.client.TestRMAdminCLI

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3789//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3789//console

This message is automatically generated.

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, 
 YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006676#comment-14006676
 ] 

Sandy Ryza commented on YARN-596:
-

The current patch uses the queue's policy to preemptContainerPreCheck.  We 
should the parent's policy. (Consider the case of a leaf queue with FIFO under 
a parent queue with DRF - we should use DRF to decide whether we should skip 
the leaf queue).

Also, we should add a new method to SchedulingPolicy instead of checking with 
instanceof.

{code}
+  if (Resources.fitsIn(getResourceUsage(), getFairShare())) {
+return false;
+  } else {
+return true;
+  }
{code}
Can just use return Resources.fitsIn(getResourceUsage(), getFairShare()).


 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster


[ 
https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006682#comment-14006682
 ] 

Sandy Ryza commented on YARN-2073:
--

{code}
+  /** Preemption related variables */
{code}
Nit: use // like the other comments.

Can you add the new property in the Fair Scheduler doc?

{code}
+  updateRootQueueMetrics();
{code}
My understanding is that this shouldn't be needed in shouldAttemptPreemption.  
Have you observed otherwise?

Would it be possible to move the TestFairScheduler refactoring to a separate 
JIRA?  If it's too difficult to entangle at this point, I'm ok with it.

 FairScheduler starts preempting resources even with free resources on the 
 cluster
 -

 Key: YARN-2073
 URL: https://issues.apache.org/jira/browse/YARN-2073
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch, 
 yarn-2073-3.patch


 Preemption should kick in only when the currently available slots don't match 
 the request. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2095) Large MapReduce Job stops responding


 [ 
https://issues.apache.org/jira/browse/YARN-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2095.
---

Resolution: Invalid

[~sunliners81], we have run much bigger jobs (100K maps) and those that run for 
long time without any issues. There is only one limitation that I know of - in 
secure clusters tokens expire after 7 days.

In any case, please pursue this on user mailing lists and create a bug when you 
are sure there is one. Closing this as invalid for now, please reopen if you 
disagree.

 Large MapReduce Job stops responding
 

 Key: YARN-2095
 URL: https://issues.apache.org/jira/browse/YARN-2095
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.3 (x86_64) on vmware 10 running HDP-2.0.6
Reporter: Clay McDonald
Priority: Blocker

 Very large jobs (7,455 Mappers and 999 Reducers) hang. Jobs run well but 
 logging to container logs stop after running 33 hours. The job appears to be 
 hung. The status of the job is RUNNING. No error messages found in logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster


[ 
https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006705#comment-14006705
 ] 

Hadoop QA commented on YARN-2073:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12646425/yarn-2073-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3790//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3790//console

This message is automatically generated.

 FairScheduler starts preempting resources even with free resources on the 
 cluster
 -

 Key: YARN-2073
 URL: https://issues.apache.org/jira/browse/YARN-2073
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch, 
 yarn-2073-3.patch


 Preemption should kick in only when the currently available slots don't match 
 the request. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-05-22 Thread Mayank Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006735#comment-14006735
 ] 

Mayank Bansal commented on YARN-2074:
-

+1 LGTM

Thanks,
Mayank

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2024) IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state.


 [ 
https://issues.apache.org/jira/browse/YARN-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2024:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-431

 IOException in AppLogAggregatorImpl does not give stacktrace and leaves 
 aggregated TFile in a bad state.
 

 Key: YARN-2024
 URL: https://issues.apache.org/jira/browse/YARN-2024
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Affects Versions: 0.23.10, 2.4.0
Reporter: Eric Payne
Priority: Critical

 Multiple issues were encountered when AppLogAggregatorImpl encountered an 
 IOException in AppLogAggregatorImpl#uploadLogsForContainer while aggregating 
 yarn-logs for an application that had very large (150G each) error logs.
 - An IOException was encountered during the LogWriter#append call, and a 
 message was printed, but no stacktrace was provided. Message: ERROR: 
 Couldn't upload logs for container_n_nnn_nn_nn. Skipping 
 this container.
 - After the IOExceptin, the TFile is in a bad state, so subsequent calls to 
 LogWriter#append fail with the following stacktrace:
 2014-04-16 13:29:09,772 [LogAggregationService #17907] ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[LogAggregationService #17907,5,main] threw an Exception.
 java.lang.IllegalStateException: Incorrect state to start a new key: IN_VALUE
 at 
 org.apache.hadoop.io.file.tfile.TFile$Writer.prepareAppendKey(TFile.java:528)
 at 
 org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:262)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:128)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:164)
 ...
 - At this point, the yarn-logs cleaner still thinks the thread is 
 aggregating, so the huge yarn-logs never get cleaned up for that application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2082) Support for alternative log aggregation mechanism

[
https://issues.apache.org/jira/browse/YARN-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006753#comment-14006753
]

Vinod Kumar Vavilapalli commented on YARN-2082:
---

We should also consider some scalable solutions on HDFS itself - to post
process the logs automatically to reduce the file-count and may be NMs forming
a tree of aggregation (with network copy of logs) before hitting HDFS.

IAC, the pluggability is sort of a dup of the proposal at YARN-1440 (albeit for
a different reason)?

Support for alternative log aggregation mechanism
-

Key: YARN-2082
URL: https://issues.apache.org/jira/browse/YARN-2082
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Ming Ma

I will post a more detailed design later. Here is the brief summary and would
like to get early feedback.
Problem Statement:
Current implementation of log aggregation create one HDFS file for each
{application, nodemanager }. These files are relative small, in the range of
1-2 MB. In a large cluster with lots of application and many nodemanagers, it
ends up creating lots of small files in HDFS. This creates pressure on HDFS
NN on the following ways.
1. It increases NN Memory size. It is mitigated by having history server
deletes old log files in HDFS.
2. Runtime RPC hit on HDFS. Each log aggregation file introduced several NN
RPCs such as create, getAdditionalBlock, complete, rename. When the cluster
is busy, such RPC hit has impact on NN performance.
In addition, to support non-MR applications on YARN, we might need to support
aggregation for long running applications.
Design choices:
1. Don't aggregate all the logs, as in YARN-221.
2. Create a dedicated HDFS namespace used only for log aggregation.
3. Write logs to some key-value store like HBase. HBase's RPC hit on NN will
be much less.
4. Decentralize the application level log aggregation to NMs. All logs for a
given application are aggregated first by a dedicated NM before it is pushed
to HDFS.
5. Have NM aggregate logs on a regular basis; each of these log files will
have data from different applications and there needs to be some index for
quick lookup.
Proposal:
1. Make yarn log aggregation pluggable for both read and write path. Note
that Hadoop FileSystem provides an abstraction and we could ask alternative
log aggregator implement compatable FileSystem, but that seems to an overkill.
2. Provide a log aggregation plugin that write to HBase. The scheme design
needs to support efficient read on a per application as well as per
application+container basis; in addition, it shouldn't create hotspot in a
cluster where certain users might create more jobs than others. For example,
we can use hash($user+$applicationId} + containerid as the row key.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1545) [Umbrella] Prevent DoS of YARN components by putting in limits

2014-05-22 Thread Hong Zhiguo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006754#comment-14006754
 ] 

Hong Zhiguo commented on YARN-1545:
---

You mean we should define the upper bound of the number or length of fields 
inside the messages. Should we have these bounds configurable? or pre-defined 
as constants?

How about the rate of messages?  For example, a bad client performs query of 
getApplications at it's full speed.

 [Umbrella] Prevent DoS of YARN components by putting in limits
 --

 Key: YARN-1545
 URL: https://issues.apache.org/jira/browse/YARN-1545
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Vinod Kumar Vavilapalli

 I did a pass and found many places that can cause DoS on various YARN 
 services. Need to fix them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster

2014-05-22 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2073:
---

Attachment: yarn-2073-4.patch

Thanks for the review, Sandy. Updated the patch to reflect your suggestions 
except the Test refactoring. For the tests, it was easier to split and I think 
it is the right direction forward. If you don't mind, I would like to leave the 
patch as is. 

 FairScheduler starts preempting resources even with free resources on the 
 cluster
 -

 Key: YARN-2073
 URL: https://issues.apache.org/jira/browse/YARN-2073
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch, 
 yarn-2073-3.patch, yarn-2073-4.patch


 Preemption should kick in only when the currently available slots don't match 
 the request. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever


[ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006774#comment-14006774
 ] 

Hadoop QA commented on YARN-2049:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12646431/YARN-2049.7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  org.apache.hadoop.yarn.client.TestRMAdminCLI

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3791//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3791//console

This message is automatically generated.

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, 
 YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch, YARN-2049.7.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1936) Secured timeline client


[ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006776#comment-14006776
 ] 

Zhijie Shen commented on YARN-1936:
---

Vinod, thanks for review. See my response bellow:

bq. Make the event-put as one of the options -put

Good point. I make use of CommandLine to do simple CLI.

bq. Add delegation token only if timeline-service is enabled.

Added the check

bq. Also move this main to TimelineClientImpl

moved

bq. selectToken() can use a TimelineDelegationTokenSelector to find the token?

Use selector instead, and do some refactoring required.

bq. Can we add a simple test to validate the addition of the Delegation Token 
to the client credentials?

Added a test case


 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch, YARN-1936.2.patch, YARN-1936.3.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1936) Secured timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Attachment: YARN-1936.3.patch

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch, YARN-1936.2.patch, YARN-1936.3.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1545) [Umbrella] Prevent DoS of YARN components by putting in limits


[ 
https://issues.apache.org/jira/browse/YARN-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006777#comment-14006777
 ] 

Vinod Kumar Vavilapalli commented on YARN-1545:
---

I covered the details on the individual tickets - it's mostly about bounding 
buffers, lists etc.

When I filed this I was only focusing on application level stuff. A bad client 
firing off RPCs in rapid fire can and should be addressed at in the RPC layer 
itself IMO.

 [Umbrella] Prevent DoS of YARN components by putting in limits
 --

 Key: YARN-1545
 URL: https://issues.apache.org/jira/browse/YARN-1545
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Vinod Kumar Vavilapalli

 I did a pass and found many places that can cause DoS on various YARN 
 services. Need to fix them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006778#comment-14006778
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

[~kkambatl],  could you kick the Jenkins and check the latest patch?

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, 
 YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server


[ 
https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006801#comment-14006801
 ] 

Hudson commented on YARN-1938:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/563/])
YARN-1938. Added kerberos login for the Timeline Server. Contributed by Zhijie 
Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596710)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java


 Kerberos authentication for the timeline server
 ---

 Key: YARN-1938
 URL: https://issues.apache.org/jira/browse/YARN-1938
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.5.0

 Attachments: YARN-1938.1.patch, YARN-1938.2.patch, YARN-1938.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations


[ 
https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006796#comment-14006796
 ] 

Hudson commented on YARN-2089:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/563/])
YARN-2089. FairScheduler: QueuePlacementPolicy and QueuePlacementRule are 
missing audience annotations. (Zhihai Xu via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596765)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java


 FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing 
 audience annotations
 ---

 Key: YARN-2089
 URL: https://issues.apache.org/jira/browse/YARN-2089
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: zhihai xu
  Labels: newbie
 Fix For: 2.5.0

 Attachments: yarn-2089.patch


 We should mark QueuePlacementPolicy and QueuePlacementRule with audience 
 annotations @Private  @Unstable



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers


[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006797#comment-14006797
 ] 

Hudson commented on YARN-2017:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/563/])
YARN-2017. Merged some of the common scheduler code. Contributed by Jian He. 
(vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596753)
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerNode.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerNode.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
*

[jira] [Commented] (YARN-1962) Timeline server is enabled by default


[ 
https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006798#comment-14006798
 ] 

Hudson commented on YARN-1962:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/563/])
YARN-2081. Fixed TestDistributedShell failure after YARN-1962. Contributed by 
Zhiguo Hong. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596724)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java


 Timeline server is enabled by default
 -

 Key: YARN-1962
 URL: https://issues.apache.org/jira/browse/YARN-1962
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Mohammad Kamrul Islam
Assignee: Mohammad Kamrul Islam
 Fix For: 2.4.1

 Attachments: YARN-1962.1.patch, YARN-1962.2.patch


 Since Timeline server is not matured and secured yet, enabling  it by default 
 might create some confusion.
 We were playing with 2.4.0 and found a lot of exceptions for distributed 
 shell example related to connection refused error. Btw, we didn't run TS 
 because it is not secured yet.
 Although it is possible to explicitly turn it off through yarn-site config. 
 In my opinion,  this extra change for this new service is not worthy at this 
 point,.  
 This JIRA is to turn it off by default.
 If there is an agreement, i can put a simple patch about this.
 {noformat}
 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response 
 from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281)
 Caused by: java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
   at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR 
 impl.TimelineClientImpl: Failed to get the response from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515)
   at

[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext