[jira] [Updated] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics

2015-09-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3816:
-
Attachment: YARN-3816-YARN-2928-v2.3.patch

Update patch to v2.3 to fix findbugs issues.

> [Aggregation] App-level Aggregation for YARN system metrics
> ---
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729204#comment-14729204
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2267 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2267/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-09-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729244#comment-14729244
 ] 

Sunil G commented on YARN-4108:
---

Adding to the locality preemption improvement suggestion from [~jlowe]

While analyzing and preparing a POC for application priority preemption (lower 
priority apps to be preempted to give space for higher priority apps within a 
queue), we ran into similar lines of problem. There were few lower priority 
apps, and the demand from a higher priority app may be best served by selecting 
the to-be-preempted-containers from those nodes which are locally best for that 
higher priority app.
One of possible solution here is in selecting the apps/containers to preempt so 
that it can match the specific need of application. In ProportionalCPP, we 
consider the demand as a sum of resource demand from all needy apps from a 
queue which is less served. If we resolve/work on the demand per application 
level, then we can consider other characteristics such as node locality, 
user-limit etc from the potential to-be-preempted apps from target queue. 

But this needs a change in approach with existing one.

Coming to the approach from Wangda, overall it looks fine. But i feel we may 
end up killing or preempt some containers from all applications in queue (worst 
case). Now we select application to be preempted by low priroity or newly 
submitted etc. 
To be precise with an eg, If an app need to launch a container on node1, it may 
need to free up 4GB. This 4Gb will be fetched now within a node. So a high 
priority app also can loose a container (and some low priority container still 
running in other node). So point from Jason is very important, we may need to 
peek through nodes for coming to a scheduler allocation, And a best candidate 
node to be selected to preempt (a node where newly submitted app is running 
more OR low priority app is running more etc).

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4112) YARN HA secondary-RM/NM redirect not parseable by Ambari/curl in a Kerberoized deployment

2015-09-03 Thread Andrew Robertson (JIRA)
Andrew Robertson created YARN-4112:
--

 Summary: YARN HA secondary-RM/NM redirect not parseable by 
Ambari/curl in a Kerberoized deployment
 Key: YARN-4112
 URL: https://issues.apache.org/jira/browse/YARN-4112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.6.0
Reporter: Andrew Robertson
Priority: Minor


The secondary-RM-to-primary-RM (and NM) redirect issued by YARN in a 
kerberoized cluster is not a proper (http-location-301-style) redirect, thus 
Ambari - which uses curl - does not find the "right" node.  This, in turn, is 
triggering alerts in Ambari.

A network dump of the ambari poll against the secondary RM looks like:

Request:
"""
GET /jmx?qry=Hadoop:service=ResourceManager,name=RMNMInfo HTTP/1.1
...
"""

Response:
"""
HTTP/1.1 200 OK
...
Refresh: 3; url=http://{my-primary-rm}:8088/jmx
Content-Length: 106
Server: Jetty(6.1.26.hwx)

This is standby RM. Redirecting to the current active RM:
http://{my-primary-rm}:8088/jmx
"""

Comment from Jonathan Hurley jhur...@hortonworks.com -
---
This is caused by how YARN does HA mode. With two YARN RMs, the standby RM 
returns a 200 response with a JavaScript redirect instead of an 3xx 
redirection. When not using Kerberos, Ambari should be able to parse the 
headers and follow the JS-based redirect. However, on a Kerberized cluster, we 
use curl which cannot do this. Therefore, requests against the secondary RM 
will return an UNKNOWN response since it did get a 200. I think a few things 
can be improved here:

1) There should be a ticket filed for YARN to have their HA mode use a proper 
redirect
2) Ambari might not want to produce an UNKNOWN response here since it gives a 
false feeling that something went wrong.
---

I've also filed AMBARI-12995 with the ambari team.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4110) RMappImpl and RmAppAttemptImpl should override hashcode() & equals()

2015-09-03 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729078#comment-14729078
 ] 

nijel commented on YARN-4110:
-

Sorry attached wrong patch so deleting the same

> RMappImpl and RmAppAttemptImpl should override hashcode() & equals()
> 
>
> Key: YARN-4110
> URL: https://issues.apache.org/jira/browse/YARN-4110
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: nijel
>
> It is observed that RMAppImpl and RMAppAttemptImpl does not have hashcode() 
> and equals() implementations. These state objects should override these 
> implementations.
> # For RMAppImpl, we can use of ApplicationId#hashcode and 
> ApplicationId#equals.
> # Similarly, RMAppAttemptImpl, ApplicationAttemptId#hashcode and 
> ApplicationAttemptId#equals



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729169#comment-14729169
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #329 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/329/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729279#comment-14729279
 ] 

Hadoop QA commented on YARN-3816:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 57s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   8m  6s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 14s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 36s | The applied patch generated  2 
new checkstyle issues (total was 252, now 253). |
| {color:green}+1{color} | whitespace |   0m 35s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   5m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 59s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   7m 34s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| {color:green}+1{color} | yarn tests |   1m 36s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  58m 26s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753997/YARN-3816-YARN-2928-v2.3.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / e6afe26 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8998/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8998/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8998/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8998/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8998/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8998/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8998/console |


This message was automatically generated.

> [Aggregation] App-level Aggregation for YARN system metrics
> ---
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4081) Add support for multiple resource types in the Resource class

2015-09-03 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4081:

Attachment: YARN-4081-YARN-3926.003.patch

{quote}
1) Uses String instead of URI for resource key? I think it maybe more efficient 
to use String, it will be easier to construct when using it, uses less resource 
(hasn't tested, but I think it will be true according to #fields in String and 
URI). I can understand the motivation of solving conflicts of resource 
namespace, but I think namespace conflict is not the major use case AND String 
can define namespace as well.
{quote}

Fixed.

{quote}
2) Relationship between ResourceInformation and ResourceMapEntry: currently 
it's 1-1 mapping, a ResourceInformation has value/unit from ResourceMapEntry, 
they're kind of overlapping and also confusing. I think it's better to make 
ResourceInformation to be one for each resource type. ResourceMapEntry contains 
runtime information, and ResourceInformation contains configured information. 
This will also avoid create ResourceInformation instance when invoking 
Resource.getResourceInformation()

3) Resource unit: I like the design which can easily convert a internal value 
to human-readable value. But I think maybe we don't need to support define unit 
in ResourceMapEntry. There're some cons of it:

When we doing comparision of resources, we have to convert units, it's an 
extra overhead.
It doesn't make a lot of sense to me that keep internal unit of resources: 
We should handle it when constructing Resource (something like 
Resource.newInstance("memory", 12, "GB")). And we will use the standard unit to 
do internal computations.
We can define the standard unit in each "ResourceInformation" if you agree 
with #2.
{quote}

I spoke with Wangda offline an we agree that it makes more sense to do 
performance testing once we have the DRC changes in. Since this patch is going 
in to a branch, there's no issue committing and running a full suite of 
performance tests once we have the DRC changes in.

{quote}
4) Do you think it's better to have a global ResourceInformation map instead of 
storing it in each Resource instance?
{quote}

For now, I'd like to keep it per-resource instance but if it becomes an 
overhead, we can make it a global instance.

{quote}
5) Resource#compareTo/hashCode has debug logging.
{quote}

Fixed.

{quote}
6) It seems not necessary to instance ArrayList in Resource#compareTo. Just 
traverse the set can avoid create the temporary ArrayList.
{quote}
Good point. I decided to use the size of the set itself as the sort order and 
avoid the issue altogether.

> Add support for multiple resource types in the Resource class
> -
>
> Key: YARN-4081
> URL: https://issues.apache.org/jira/browse/YARN-4081
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4081-YARN-3926.001.patch, 
> YARN-4081-YARN-3926.002.patch, YARN-4081-YARN-3926.003.patch
>
>
> For adding support for multiple resource types, we need to add support for 
> this in the Resource class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729119#comment-14729119
 ] 

Sunil G commented on YARN-2005:
---

Hi  [~adhoot]
Thank you for updating  the patch. I have a comment here.

{{isWaitingForAMContainer}} is now used in 2 cases. To set the 
{{ContainerType}} and also in blacklist case. And this check is now hitting in 
every heartbeat from AM.

I think its better to set a state called {{amIsStarted}} in 
{{SchedulerApplicationAttempt}}. And this can be set from 2 places.
1. {{RMAppAttemptImpl#AMContainerAllocatedTransition}} can call a new scheduler 
api to set  {{amIsStarted}} flag when AM Container is launched and registered. 
We need to pass ContainerId to this new api to get attempt object and to set 
the flag.
2. {{AbstrctYarnScheduler#recoverContainersOnNode}} can also invoke this api  
to set this flag.

So now we can directly read from  {{SchedulerApplicationAttempt}} everytime 
when heartbeat call comes from AM. If we are not doing this in this ticket, I 
can open another ticket for this optimization. Please suggest your thoughts.

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4110) RMappImpl and RmAppAttemptImpl should override hashcode() & equals()

2015-09-03 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-4110:

Attachment: (was: 01-YARN-4110.patch)

> RMappImpl and RmAppAttemptImpl should override hashcode() & equals()
> 
>
> Key: YARN-4110
> URL: https://issues.apache.org/jira/browse/YARN-4110
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: nijel
>
> It is observed that RMAppImpl and RMAppAttemptImpl does not have hashcode() 
> and equals() implementations. These state objects should override these 
> implementations.
> # For RMAppImpl, we can use of ApplicationId#hashcode and 
> ApplicationId#equals.
> # Similarly, RMAppAttemptImpl, ApplicationAttemptId#hashcode and 
> ApplicationAttemptId#equals



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729104#comment-14729104
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2288 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2288/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-09-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729100#comment-14729100
 ] 

Jason Lowe commented on YARN-4108:
--

For the case where we are preempting containers but the other job/queue cannot 
take them because of user limit: that's clearly a bug in the preemption logic 
and not related to the problem of matching up preemption events to pending asks 
so we satisfy the preemption trigger.  We should never be preempting containers 
if a user limit would foil our ability to reassign those resources.  If we are 
then the preemption logic is miscalculating the amount of pending demand that 
triggered the preemption in the first place.

For the case where the pending ask that is triggering the preemption has very 
strict and narrow locality requirements: yeah, that's a tough one.  If the 
locality requirement can be relaxed then it's not too difficult -- by the time 
we preempt we'll have given up on looking for locality by then.  However if the 
locality requirement cannot be relaxed then preemption could easily thrash 
wildly if the resources that can be preempted do not satisfy the pending ask.  
We would need to be very concious of the request we're trying to satisfy -- 
preemption may not be able to satisfy the request at all in some cases.

I was thinking along the reservation lines as well.  When we are trying to 
satisfy a request on a busy cluster we already make a reservation on a node.  
When we decide to preempt we can move the request's reservation to the node 
where we decided to preempt containers.  The problem is that we are now 
changing the algorithm for deciding what gets shot -- it used to be least 
amount of work lost, but now with locality introduced into the equation there 
needs to be a weighting of container duration and locality in the mix.

This would be a lot more straightforward if the scheduler wasn't trying to 
peephole optimize by only looking at one node at a time when it schedules.  If 
the scheduler could look across nodes and figure out which node "wins" in terms 
of sufficiently preemptable resources with the lowest cost of preemption then 
it could the send the preemption requests/kills to the containers on that node 
and move the reservation to that node.  Looking at only one node at a time 
means we may have to do "scheduling opportunity" hacks to let it see enough 
nodes to make a good decision.

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4081) Add support for multiple resource types in the Resource class

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729287#comment-14729287
 ] 

Hadoop QA commented on YARN-4081:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 19s | Findbugs (version ) appears to 
be broken on YARN-3926. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   7m 48s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 44s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 33s | The applied patch generated  
91 new checkstyle issues (total was 10, now 101). |
| {color:red}-1{color} | whitespace |   0m 15s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 35s | The patch appears to introduce 3 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  50m 31s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  96m 57s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-api |
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps |
|   | 
hadoop.yarn.server.resourcemanager.reservation.TestCapacitySchedulerPlanFollower
 |
|   | hadoop.yarn.server.resourcemanager.TestClientRMService |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
|   | hadoop.yarn.server.resourcemanager.TestRM |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler |
|   | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler |
|   | 
hadoop.yarn.server.resourcemanager.reservation.TestFairSchedulerPlanFollower |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched |
|   | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | hadoop.yarn.server.resourcemanager.webapp.dao.TestFairSchedulerQueueInfo |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebAppFairScheduler |
|   | hadoop.yarn.server.resourcemanager.TestAppManager |
|   | hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA |
|   | hadoop.yarn.server.resourcemanager.TestApplicationACLs |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler |
|   | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebApp |
|   | hadoop.yarn.server.resourcemanager.webapp.TestNodesPage |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753993/YARN-4081-YARN-3926.003.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-3926 / c95993c |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8997/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-api.html
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8997/console |


This message was automatically generated.

> Add support for multiple resource types in the Resource 

[jira] [Commented] (YARN-4110) RMappImpl and RmAppAttemptImpl should override hashcode() & equals()

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729146#comment-14729146
 ] 

Hadoop QA commented on YARN-4110:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  2s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  9s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 53s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   1m 31s | The patch appears to introduce 2 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  57m 48s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  97m 47s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-resourcemanager |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753981/01-YARN-4110.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b469ac5 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8996/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8996/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8996/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8996/console |


This message was automatically generated.

> RMappImpl and RmAppAttemptImpl should override hashcode() & equals()
> 
>
> Key: YARN-4110
> URL: https://issues.apache.org/jira/browse/YARN-4110
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: nijel
>
> It is observed that RMAppImpl and RMAppAttemptImpl does not have hashcode() 
> and equals() implementations. These state objects should override these 
> implementations.
> # For RMAppImpl, we can use of ApplicationId#hashcode and 
> ApplicationId#equals.
> # Similarly, RMAppAttemptImpl, ApplicationAttemptId#hashcode and 
> ApplicationAttemptId#equals



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729059#comment-14729059
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1076 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1076/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics

2015-09-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729127#comment-14729127
 ] 

Varun Saxena commented on YARN-3816:


[~djp],
bq. Which issue you are raising? I can take a look as well.
I had raised HADOOP-12312 around a month back because I faced this issue in 
another JIRA I was handling.
This hasn't yet been fixed.

Till this is fixed we have to go by -1 reported against findbugs and then run 
findbugs manually to find out the warnings.

> [Aggregation] App-level Aggregation for YARN system metrics
> ---
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics

2015-09-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729155#comment-14729155
 ] 

Junping Du commented on YARN-3816:
--

I see. Will keep watching on this JIRA and make sure it active and get resolved 
ASAP.

> [Aggregation] App-level Aggregation for YARN system metrics
> ---
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Set YARN_FAIL_FAST to be false by default

2015-09-03 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729422#comment-14729422
 ] 

Anubhav Dhoot commented on YARN-4087:
-

In general if we are not failing the daemon if fail fast flag is false, we 
still need to ensure we are not leaving inconsistent state in RM. For eg in 
YARN-4032. YARN-2019 is the other case where we did not need to do anything. 
This would mean every patch from now on that uses fail fast to not crash the 
daemon should consider taking corrective action to ensure correctness. Does 
that make sense?

> Set YARN_FAIL_FAST to be false by default
> -
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch
>
>
> Increasingly, I feel setting this property to be false makes more sense 
> especially in production environment, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4107) Both RM becomes Active if all zookeepers can not connect to active RM

2015-09-03 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729435#comment-14729435
 ] 

Xuan Gong commented on YARN-4107:
-

bq. Is this when using ZKRMStateStore? If yes, the store should take care of 
transitioning the RM to standby via its FencingThread.

The main problem here is that the old active RM lost connections with all the 
zks. Even for the FencingThread, it still needs to reconnect to ZK, isn't it ?

> Both RM becomes Active if all zookeepers can not connect to active RM
> -
>
> Key: YARN-4107
> URL: https://issues.apache.org/jira/browse/YARN-4107
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4107.1.patch
>
>
> Steps to reproduce:
> 1) Run small randomwriter applications in background
> 2) rm1 is active and rm2 is standby 
> 3) Disconnect all Zks and Active RM
> 4) Check status of both RMs. Both of them are in active state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4107) Both RM becomes Active if all zookeepers can not connect to active RM

2015-09-03 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729436#comment-14729436
 ] 

Xuan Gong commented on YARN-4107:
-

bq. Is it caused by network split brain ?

We did it on purpose. Using the iptable to block the all the zk connections 
with the old active RM.

> Both RM becomes Active if all zookeepers can not connect to active RM
> -
>
> Key: YARN-4107
> URL: https://issues.apache.org/jira/browse/YARN-4107
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4107.1.patch
>
>
> Steps to reproduce:
> 1) Run small randomwriter applications in background
> 2) rm1 is active and rm2 is standby 
> 3) Disconnect all Zks and Active RM
> 4) Check status of both RMs. Both of them are in active state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4081) Add support for multiple resource types in the Resource class

2015-09-03 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4081:

Attachment: YARN-4081-YARN-3926.004.patch

Uploaded a new patch to address the findbugs and checkstyle issues. The test 
failures are unrelated to the patch. They're due to some issues with unzipping 
some files.

> Add support for multiple resource types in the Resource class
> -
>
> Key: YARN-4081
> URL: https://issues.apache.org/jira/browse/YARN-4081
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4081-YARN-3926.001.patch, 
> YARN-4081-YARN-3926.002.patch, YARN-4081-YARN-3926.003.patch, 
> YARN-4081-YARN-3926.004.patch
>
>
> For adding support for multiple resource types, we need to add support for 
> this in the Resource class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-03 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729503#comment-14729503
 ] 

Jonathan Eagles commented on YARN-4103:
---

You can do it if you like. Please ensure it makes it into trunk, branch-2, and 
branch-2.7.

> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4081) Add support for multiple resource types in the Resource class

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729660#comment-14729660
 ] 

Hadoop QA commented on YARN-4081:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m  6s | Findbugs (version ) appears to 
be broken on YARN-3926. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   8m  1s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 56s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 34s | The applied patch generated  3 
new checkstyle issues (total was 10, now 5). |
| {color:red}-1{color} | whitespace |   0m 16s | The patch has 3  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 39s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  50m 28s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  97m 19s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.resourcemanager.TestAppManager |
|   | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched |
|   | 
hadoop.yarn.server.resourcemanager.reservation.TestFairSchedulerPlanFollower |
|   | hadoop.yarn.server.resourcemanager.TestRMHA |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebApp |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
 |
|   | 
hadoop.yarn.server.resourcemanager.reservation.TestCapacitySchedulerPlanFollower
 |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebAppFairScheduler |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior
 |
|   | hadoop.yarn.server.resourcemanager.TestApplicationACLs |
|   | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
|   | hadoop.yarn.server.resourcemanager.webapp.dao.TestFairSchedulerQueueInfo |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate
 |
|   | hadoop.yarn.server.resourcemanager.TestRM |
|   | hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler |
|   | hadoop.yarn.server.resourcemanager.webapp.TestNodesPage |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps |
|   | hadoop.yarn.server.resourcemanager.TestClientRMService |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754042/YARN-4081-YARN-3926.004.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-3926 / c95993c |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9001/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/9001/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9001/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9001/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9001/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9001/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux 

[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-03 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729556#comment-14729556
 ] 

Sangjin Lee commented on YARN-4074:
---

Just to be clear, the current POC patch already handles the null case, and I'm 
going to update it to check for negative values. Is that reasonable?

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-03 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729511#comment-14729511
 ] 

Vrushali C commented on YARN-3901:
--


Thanks [~gtCarrera9] for the review. Let me try to give some explanations to 
your questions above. 

bq. Name of Attribute seems to be quite general. Maybe we want something more 
specific? From my understanding, Attribute acts as the "command" (as the 
meaning in design pattern) of the aggregation?

Yes, attribute indicates what action needs to be taken in the aggregation 
step/reading step. What name do you recommend? 

bq. TimelineSchemaCreator May conflict with YARN-4102. I'm fine with either 
order to put them in.
Yes, I thought about that, but the patch in YARN-4102 is not committed yet, so 
could not rebase. I am good with rebasing the YARN-3901 patch if YARN-4102 gets 
in.

bq. Are we assuming there will be at most two attributes for each column 
prefix? In FlowScanner we're only dealing with two attributes, one from 
compaction one from operations. But in FlowActivityColumnPrefix we're assuming 
there's a list of attributes?
No, there can be any number of attributes for a column prefix. Currently MIN, 
MAX and SUM happen to be exclusive in the sense, if you want a min for start 
time, it's unlikely that you want to be SUMing  up the start times. FlowScanner 
looks for application id from the aggregation compaction dimensions and for the 
MIN/MAX/SUM from the aggregation operations. This FlowScanner class is very 
different from FlowActivityColumnPrefix. The FlowActivityColumnPrefix or any 
class that generates a Put will deal with a list of attributes. 

bq. What is our plan on FlowActivityColumnPrefix#IN_PROGRESS_TIME? 
Yes, this is timestamp that needs to be put into the flow activity table for 
all (long) running applications. If an application in a flow starts on say Day1 
and runs through day 2, day 3 and ends on day4, then the flow activity table 
needs to have an entry for this flow for day2 and day3. This is the in progress 
time of that application, it is the TBD part being thought over in YARN-4069. 
We need to think if we want the RM to write it, or the App master or something 
else offline. 

bq. In FlowScanner, after aggregation (in nextInternal) we're simply adding 
aggregated data as a Cell. However I haven't found where we're guaranteeing the 
new node is not aggregate again (and we create another new cell for the 
aggregation result). Are we doing this deliberately or I'm missing anything 
here?
Hmm, not sure I got the question but let me try to explain what the FlowScanner 
should be doing. It will read each cell one by one. Say for start time column, 
it reads the cells. Now for a flow, we want that value which is the lowest for 
the start time of the flow. Hence these cells have a tag of MIN. So, the 
nextInternal will return one cell with the min value for the column start time. 
Similarly for max and for SUM, it sums up the cell values. Hope this helps.

Also, I will double check the formatting related comments and update as 
necessary. Appreciate the review! 



> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the 

[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3641:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.


> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3526:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1. Ran compilation and TestRMFailover before the push. 
Patch applied cleanly.

> ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
> -
>
> Key: YARN-3526
> URL: https://issues.apache.org/jira/browse/YARN-3526
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 2.6.0
> Environment: Red Hat Enterprise Linux Server 6.4 
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>  Labels: 2.6.1-candidate, BB2015-05-TBR
> Fix For: 2.6.1, 2.7.1
>
> Attachments: YARN-3526.001.patch, YARN-3526.002.patch
>
>
> On a QJM HA cluster, view RM web UI to track job status, it shows
> This is standby RM. Redirecting to the current active RM: 
> http://:8088/proxy/application_1427338037905_0008/mapreduce
> it refreshes every 3 sec but never going to the correct tracking page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729515#comment-14729515
 ] 

Varun Saxena commented on YARN-4074:


The 2nd point I guess even I can handle even in YARN-4075. I can verify limit 
and if its 0 or negative, forward null to storage layer. If its null, 
DEFAULT_LIMIT will be applied.

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729558#comment-14729558
 ] 

Varun Saxena commented on YARN-4074:


[~sjlee0], yeah it handles null case. I meant we can handle negative values 
even at the REST layer(as part of YARN-4075) i.e. if the limit is negative I 
can forward null to storage layer which would mean default limit being applied. 

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729560#comment-14729560
 ] 

Varun Saxena commented on YARN-4074:


Its fine though if you are handling negatives as part of YARN-4074.

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-03 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729561#comment-14729561
 ] 

Varun Vasudev commented on YARN-4103:
-

Sounds good. I'll commit it tomorrow morning IST.

> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4044) Running applications information changes such as movequeue is not published to TimeLine server

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729444#comment-14729444
 ] 

Naganarasimha G R commented on YARN-4044:
-

[~sunilg], As per the offline discussion we had, currently you are populating 
the entityInfo Object with the modified priority or the queue because of which 
ATS is able to provide the updated info. So yes both of my queries gets 
answered with it . But one point which i can think of here is would it be 
better to capture the modified priority and queue (if any) as part of eventInfo 
? so that initial submission information as well as modification information 
will be captured which might be helpful later in analysis. 

> Running applications information changes such as movequeue is not published 
> to TimeLine server
> --
>
> Key: YARN-4044
> URL: https://issues.apache.org/jira/browse/YARN-4044
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4044.patch, 0002-YARN-4044.patch
>
>
> SystemMetricsPublisher need to expose an appUpdated api to update any change 
> for a running application.
> Events can be 
>   - change of queue for a running application.
> - change of application priority for a running application.
> This ticket intends to handle both RM and timeline side changes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-03 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729389#comment-14729389
 ] 

Varun Vasudev commented on YARN-4103:
-

+1 for the latest patch. Will you go ahead and commit it or would you like me 
to do it?

> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4102) Add a "skip existing table" mode for timeline schema creator

2015-09-03 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729390#comment-14729390
 ] 

Joep Rottinghuis commented on YARN-4102:


Yup that looks good. I noticed one more thing, you decide if that should be 
changed or not, but the createTable method throws an IOException but you catch 
the more generic Exception. Would it make sense to catch only IOExceptions?

> Add a "skip existing table" mode for timeline schema creator
> 
>
> Key: YARN-4102
> URL: https://issues.apache.org/jira/browse/YARN-4102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-4102-YARN-2928.001.patch, 
> YARN-4102-YARN-2928.002.patch
>
>
> When debugging timeline POCs, we may need to create hbase tables that are 
> added in some ongoing patches. Right now, our schema creator will exit when 
> it hits one existing table. While this is a correct behavior with end users, 
> this introduces much trouble in debugging POCs: every time we have to disable 
> all existing tables, drop them, run the schema creator to generate all 
> tables, and regenerate all test data. 
> Maybe we'd like to add an "incremental" mode so that the creator will only 
> create non-existing tables? This is pretty handy in deploying our POCs. Of 
> course, consistency has to be kept in mind across tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729308#comment-14729308
 ] 

Jason Lowe commented on YARN-4059:
--

bq. do you think if it acceptable to you if adding an option to CS choose to 
use heartbeat-based counting or time-based counting?

Time-based counting would be a bit better since at least it would help resolve 
the current bug where we are not counting full nodes as scheduling 
opportunities.  However we could approach this problem from a different angle 
as well.  Even if we do a time-based allocation, there's this problematic 
scenario:

- Highest priority app is asking for a lot of containers on a busy cluster
- We free up a bunch of resources via preemption
- One of the early resources we allocate to that app is very likely to have 
locality (since it's asking for so many).
- Now we'll reset the scheduling opportunities or timer and the app will wait 
for quite a bit before it will take anything that isn't perfect locality.
- In the meantime all those free resources end up going to other, lower 
priority apps, possibly the ones we originally preempted

Wondering if we should scale down the delay for allocation based on how full 
the cluster appears to be.  If the cluster is nearly full then we probably 
don't want to be particularly picky about what containers we're getting.  If we 
do delay then it's likely the scarce resource will be snarfed up by another app 
who isn't very picky and now we're back to waiting for anything again.

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3874) Optimize and synchronize FS Reader and Writer Implementations

2015-09-03 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729331#comment-14729331
 ] 

Li Lu commented on YARN-3874:
-

bq. For fallback from HBase when HBase cluster is temporary unavailable. 
YARN-4061 is proposed to resolve this problem. I believe a single HDFS storage 
won't be enough because we also need to keep data consistency after the HBase 
cluster is recovered. It is not quite desirable if we put part of the data on 
to HBase and the other part on to HDFS. I've started to work on that JIRA and 
hope to post some patch after the web UI milestone. If you have any suggestions 
there, please feel free to let me know. 

> Optimize and synchronize FS Reader and Writer Implementations
> -
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch, 
> YARN-3874-YARN-2928.02.patch, YARN-3874-YARN-2928.03.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-03 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729378#comment-14729378
 ] 

Jonathan Eagles commented on YARN-4103:
---

Looks like YARN-1553 introduced this error. You can see that it is present both 
in the RM web services and the RMAppBlock by looking at a specific app page 
source for the RM:port/cluster/app/app_id. The browser is very accommodating 
and takes you to the correct address which is why this has been hidden for a 
long time. If you are ok with this, I would like to check this in.

> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4112) YARN HA secondary-RM/NM redirect not parseable by Ambari/curl in a Kerberoized deployment

2015-09-03 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729404#comment-14729404
 ] 

Xuan Gong commented on YARN-4112:
-

[~arobertson]
Looks like duplicate with https://issues.apache.org/jira/browse/YARN-2605.

> YARN HA secondary-RM/NM redirect not parseable by Ambari/curl in a 
> Kerberoized deployment
> -
>
> Key: YARN-4112
> URL: https://issues.apache.org/jira/browse/YARN-4112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Andrew Robertson
>Priority: Minor
>
> The secondary-RM-to-primary-RM (and NM) redirect issued by YARN in a 
> kerberoized cluster is not a proper (http-location-301-style) redirect, thus 
> Ambari - which uses curl - does not find the "right" node.  This, in turn, is 
> triggering alerts in Ambari.
> A network dump of the ambari poll against the secondary RM looks like:
> Request:
> """
> GET /jmx?qry=Hadoop:service=ResourceManager,name=RMNMInfo HTTP/1.1
> ...
> """
> Response:
> """
> HTTP/1.1 200 OK
> ...
> Refresh: 3; url=http://{my-primary-rm}:8088/jmx
> Content-Length: 106
> Server: Jetty(6.1.26.hwx)
> This is standby RM. Redirecting to the current active RM:
> http://{my-primary-rm}:8088/jmx
> """
> Comment from Jonathan Hurley jhur...@hortonworks.com -
> ---
> This is caused by how YARN does HA mode. With two YARN RMs, the standby RM 
> returns a 200 response with a JavaScript redirect instead of an 3xx 
> redirection. When not using Kerberos, Ambari should be able to parse the 
> headers and follow the JS-based redirect. However, on a Kerberized cluster, 
> we use curl which cannot do this. Therefore, requests against the secondary 
> RM will return an UNKNOWN response since it did get a 200. I think a few 
> things can be improved here:
> 1) There should be a ticket filed for YARN to have their HA mode use a proper 
> redirect
> 2) Ambari might not want to produce an UNKNOWN response here since it gives a 
> false feeling that something went wrong.
> ---
> I've also filed AMBARI-12995 with the ambari team.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729674#comment-14729674
 ] 

Hadoop QA commented on YARN-3591:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 36s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  1s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 52s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  0s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  1 
new checkstyle issues (total was 171, now 169). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 15s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   7m 30s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 20s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753942/YARN-3591.9.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 53c38cc |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9002/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9002/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9002/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9002/console |


This message was automatically generated.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, 
> YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch, YARN-3591.9.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4102) Add a "skip existing table" mode for timeline schema creator

2015-09-03 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4102:

Attachment: YARN-4102-YARN-2928.003.patch

Nice catch [~jrottinghuis]! Fixed in the v3 patch. Thanks for your help! 

> Add a "skip existing table" mode for timeline schema creator
> 
>
> Key: YARN-4102
> URL: https://issues.apache.org/jira/browse/YARN-4102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-4102-YARN-2928.001.patch, 
> YARN-4102-YARN-2928.002.patch, YARN-4102-YARN-2928.003.patch
>
>
> When debugging timeline POCs, we may need to create hbase tables that are 
> added in some ongoing patches. Right now, our schema creator will exit when 
> it hits one existing table. While this is a correct behavior with end users, 
> this introduces much trouble in debugging POCs: every time we have to disable 
> all existing tables, drop them, run the schema creator to generate all 
> tables, and regenerate all test data. 
> Maybe we'd like to add an "incremental" mode so that the creator will only 
> create non-existing tables? This is pretty handy in deploying our POCs. Of 
> course, consistency has to be kept in mind across tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2766:
--
   Labels: 2.6.1-candidate  (was: )
Fix Version/s: 2.6.1

Pulled this into 2.6.1 as a dependency for YARN-3700. Patch applied cleanly.

Ran compilation before the push. 

>  ApplicationHistoryManager is expected to return a sorted list of 
> apps/attempts/containers
> --
>
> Key: YARN-2766
> URL: https://issues.apache.org/jira/browse/YARN-2766
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0, 2.6.1
>
> Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, 
> YARN-2766.patch
>
>
> {{TestApplicationHistoryClientService.testContainers}} and 
> {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail 
> because the test assertions are assuming a returned Collection is in a 
> certain order.  The collection comes from a HashMap, so the order is not 
> guaranteed, plus, according to [this 
> page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html],
>  there are situations where the iteration order of a HashMap will be 
> different between Java 7 and 8.
> We should fix the test code to not assume a specific ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729749#comment-14729749
 ] 

Wangda Tan commented on YARN-4059:
--

[~jlowe], thanks for your comments:
bq. Wondering if we should scale down the delay for allocation based on how 
full the cluster appears to be
I think this cannot handle the case if an app wants only a small proportion of 
cluster. High cluster utilization doesn't always mean the asked proportion is 
also highly utilized.

Do you think if following is a acceptable plan to you?

Instead of resetting missed-opportunity (or missed-time) every time we get new 
container with expect, we will deduct a time from total missed-time. An example 
is:
Assume we set node-local-delay to 5 sec.
And AM waits 20 sec to get a node-local container, we will set missed-time to 
20 - 5 = 15 sec. Before the missed-time downgrade to less than 5 sec, app 
accepts what RM allocated instead of being picky. This approach considers 
accumulated waiting time of a given app (maybe given priority as well). If an 
app already waits for a long time, it can get containers allocated quickly if 
any resources becomes available.

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS

2015-09-03 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729815#comment-14729815
 ] 

Hitesh Shah commented on YARN-3942:
---

Some ideas from an offline discussion with [~bikassaha] and [~vinodkv]:

- option 1) Could we just use leveldb as an LRU cache instead of a memory 
based cache to handle the OOM issue?
- option 2) Could we just take the data from HDFS and write it out to 
leveldb and using the level db to serve data out? This would address the OOM 
issue too. 

\cc [~jlowe] [~jeagles]


> Timeline store to read events from HDFS
> ---
>
> Key: YARN-3942
> URL: https://issues.apache.org/jira/browse/YARN-3942
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3942.001.patch
>
>
> This adds a new timeline store plugin that is intended as a stop-gap measure 
> to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
> v2.  The intent of this plugin is to provide a workable solution for running 
> the Tez UI against the timeline server on a large-scale clusters running many 
> thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3700) ATS Web Performance issue at load time when large number of jobs

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3700:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1. Had to fix merge conflicts in WebServices.java, 
AppsBlock.java. Had to rewrite the documentation in apt.

Ran compilation and TestApplicationHistoryClientService, 
TestApplicationHistoryManagerOnTimelineStore before the push.


> ATS Web Performance issue at load time when large number of jobs
> 
>
> Key: YARN-3700
> URL: https://issues.apache.org/jira/browse/YARN-3700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, webapp, yarn
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.8.0
>
> Attachments: YARN-3700.1.patch, YARN-3700.2.1.patch, 
> YARN-3700.2.2.patch, YARN-3700.2.patch, YARN-3700.3.patch, YARN-3700.4.patch
>
>
> Currently, we will load all the apps when we try to load the yarn 
> timelineservice web page. If we have large number of jobs, it will be very 
> slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-03 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-4113:


 Summary: RM should respect retry-interval when uses 
RetryPolicies.RETRY_FOREVER
 Key: YARN-4113
 URL: https://issues.apache.org/jira/browse/YARN-4113
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Priority: Critical


Found one issue in RMProxy how to initialize RetryPolicy: In 
RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), it 
uses RetryPolicies.RETRY_FOREVER which doesn't respect 
{{yarn.resourcemanager.connect.retry-interval.ms}} setting.

RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test without 
properly setup localhost name: 
{{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
14G DEBUG exception message to system before it dies. This will be very bad if 
we do the same thing in a production cluster.

We should fix two places:
- Make RETRY_FOREVER can take retry-interval as constructor parameter.
- Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Set YARN_FAIL_FAST to be false by default

2015-09-03 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Attachment: YARN-4087.3.patch

> Set YARN_FAIL_FAST to be false by default
> -
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch
>
>
> Increasingly, I feel setting this property to be false makes more sense 
> especially in production environment, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-03 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729886#comment-14729886
 ] 

Vrushali C commented on YARN-3901:
--


bq. From the code I can see we add the newly created cell to the list of cells 
(cells.add(someNewCell)). Is this operation only modifying the returned value 
of the next call, or it permanently writes the aggregated values back to HBase? 
This may sound silly but I'm not very familiar with the observer coprocessors.

Yes, in this code, we are only "reading" or returning back cells to the client 
(hbase client). But when we add in the compaction/flush coprocessors, they will 
write back to hbase as well. 

bq. Maybe a name that is more specific will help? How about something like 
AggregationPutAttribute?
Hmm. It's not really an attribute for the Put, it's more like the 
characteristic of the cell value itself. We could use these attributes outside 
of Aggregation as well, so don't want the agg prefix here. Still thinking what 
to call it. Perhaps FlowValueAttribute? 

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3725:
--
Attachment: YARN-3725-branch-2.6.1.txt

Attaching the 2.6.1 patch that I committed.

> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.1
>
> Attachments: YARN-3725-branch-2.6.1.txt, YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3725:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1 after fixing a few merge issues.

Ran compilation and TestTimelineAuthenticationFilter before the push.

> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS

2015-09-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729843#comment-14729843
 ] 

Jason Lowe commented on YARN-3942:
--

Option 1 will add some latency (not clear how much yet) to initializing the 
cache, and it could take quite a bit of time to build it depending upon how 
many dags were run in the same session and the amount of data from each dag.

If I understand option 2 properly, it proposes to have the scanner read all the 
data, not just the summary data, out of HDFS and store it in the main leveldb.  
The problem we run into with that approach is that for our production scale and 
desired retention periods it would generate a very, very large set of leveldb 
databases that must be stored locally, and query performance starts to degrade 
as the leveldb databases get really large.

Option 1 is more viable for us, assuming we won't have horrendous latency 
issues trying to build a substantial database from a monster session.  Option 2 
is not as attractive, although I could see it being appealing to those that 
don't need to worry about huge leveldb size problems.

> Timeline store to read events from HDFS
> ---
>
> Key: YARN-3942
> URL: https://issues.apache.org/jira/browse/YARN-3942
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3942.001.patch
>
>
> This adds a new timeline store plugin that is intended as a stop-gap measure 
> to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
> v2.  The intent of this plugin is to provide a workable solution for running 
> the Tez UI against the timeline server on a large-scale clusters running many 
> thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.

2015-09-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729706#comment-14729706
 ] 

MENG DING commented on YARN-1651:
-

Hi, [~leftnoteasy]

I think it is fine to reuse Expire for now for container increase expiration. 
We probably need to address this properly in the JIRA that tracks container 
resource increase roll back. (I think Container resource increase expiration 
should be tracked as a Scheduler Event, e.g., 
SchedulerEventType.CONTAINER_INCREASE_EXPIRE)

I have a few more comments or questions regarding the patch:

* Regarding sanity checks:
** The following functions can be removed? 
{{ApplicationMaster.checkDuplicatedIncreaseDecreaseRequest()}}
** About {{RMServerUtils.checkDuplicatedIncreaseDecreaseRequest()}}. It seems 
that this function throws exception whenever there is a duplicated id. Shall we 
handle the case where if there are both increase and decrease requests for the 
same id, we can ignore the increase but keep the decrease request?
** Will it be better to combine all sanity checks into one function, e.g., 
{{validateIncreaseDecreaseRequest(List 
incRequests, List decRequests)}}, such that it 
will check both duplicated IDs, and the resource validity for increase and 
decrease requests? 
** For {{validateIncreaseDecreaseRequest}}, we don't check minimum allocation 
now, is it intended? I see that later on you normalize the request so that it 
will be at least minimum allocation. Just want to confirm.

* For {{SchedulerApplicationAttempt.pullNewlyUpdatedContainers}}. 
** This function is used by both pullNewlyIncreasedContainers(), and 
pullNewlyDecreasedContainers(). Why do we need to call 
{{updateContainerAndNMToken}} for decreased containers?  It also unnecessarily 
send a ACQUIRE_UPDATED_CONTAINER event for every decreased container?
** We should probably check null before adding updatedContainer?
{code:title=pullNewlyUpdatedContainers}
  Container updatedContainer = updateContainerAndNMToken(rmContainer, 
false);
  returnContainerList.add(updatedContainer);
{code}

* It seems {{RMNodeImpl.pullNewlyIncreasedContainers()}} is empty?

* The following function doesn't seem to be used?
{code:title=AppSchedulingInfo}
  public synchronized void notifyContainerStopped(RMContainer rmContainer) {
// remove from pending increase request map if it exists
removeIncreaseRequest(rmContainer.getAllocatedNode(),
rmContainer.getAllocatedPriority(), rmContainer.getContainerId());
  }
{code}

* In {{IncreaseContainerAllocator.assignContainers}}:
** I think the following is a typo, should be {{if (cannotAllocateAnything)}}, 
right?
{code}
  if (shouldUnreserve) {
LOG.debug("We cannot allocate anything because of low headroom, "
+ "headroom=" + resourceLimits.getHeadroom());
  }
{code}
** Not sure if I understand the logic. Why only break when 
node.getReservedContainer() == null? Shouldn't we break out of the loop here no 
matter what?
{code}
   while (iter.hasNext()) {
  ...
  ...
  // Try to allocate the increase request
  assigned = allocateIncreaseRequest(node, increaseRequest);
  if (node.getReservedContainer() == null) {
// if it's not a reserved increase request, we will record
// priority/containerId so that we can remove the request later
increasedContainerPriority = priority;
increasedContainerId = rmContainer.getContainerId();
break;
  }
   }  
{code}
** Is the following needed? 
 {code}
  if (increasedContainerId != null) {
// If we increased (not reserved) a new increase request, we should
// remove it from request map.
application.removeIncreaseRequest(nodeId, increasedContainerPriority,
increasedContainerId);
  }
{code}
I think earlier in the {{allocateIncreaseRequest()}} function, if a new 
increase is successfully allocated, 
{{application.increaseContainer(increaseRequest)}} will have removed the 
increase request already?
* In {{RMContainerImpl.java}}
IIUC, {{containerIncreased}} indicates that a increase is done in scheduler, 
and {{containerIncreasedAndAcquired}} indicates that a increase has been 
acquired by AM. 
If so, then in {{NMReportedContainerChangeIsDoneTransition}}
{code}
public void transition(RMContainerImpl container, RMContainerEvent event) {
  if (container.containerIncreased) {
// If container is increased but not acquired by AM, we will start
// containerAllocationExpirer for this container in this transition.
container.containerAllocationExpirer.unregister(event.getContainerId());
container.containerIncreasedAndAcquired = false;
  }
}
{code}
Shouldn't it be changed to:
{code}
public void transition(RMContainerImpl container, RMContainerEvent event) {
  if 

[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729798#comment-14729798
 ] 

Jason Lowe commented on YARN-4059:
--

bq. I think this cannot handle the case if an app wants only a small proportion 
of cluster.
If the app only wants a small portion of the cluster then it already scales 
down the amount of time it will wait in getLocalityWaitFactor, so there needs 
to be a substantial request to get a substantial wait.

The problem I think we're going to run into with a time-based approach is that 
we don't know what time an individual request arrived since we only store the 
aggregation of requests for a particular priority.  I think it might be tricky 
to also track when a request becomes "eligible" for allocation.  For example, 
if the app has been sitting behind other applications in the queue and user 
limits are why it isn't getting containers then we do _not_ want to think that 
the app has already waited a long time for a local container.  It hasn't really 
waited any time from an opportunity perspective because user limits prevented 
it from getting what it wanted.  The cluster could be almost completely empty 
and then when the limits finally allow it to allocate it will be so far behind 
time-wise that we'll schedule it very poorly.  Similarly we could have 
satisfied a portion of the request at a certain priority, then user limits kick 
in, and many minutes later when the containers exit it may look like we have 
been trying to find locality for all that time which is incorrect.

If we can find a way to get the time bookkeeping right I think it could sort of 
work.  However as the cluster usage approaches capacity we get into priority 
inversion problems when apps at the front of the queue pass up containers due 
to locality and the apps behind them readily take them.  That can severely 
prolong the time it takes the apps to get what they are asking for, hence the 
thought that we may want to consider total cluster load when weighing how long 
we should be trying.

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4109) Exception on RM scheduler page loading with labels

2015-09-03 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan reassigned YARN-4109:
--

Assignee: Mohammad Shahid Khan

> Exception on RM scheduler page loading with labels
> --
>
> Key: YARN-4109
> URL: https://issues.apache.org/jira/browse/YARN-4109
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Mohammad Shahid Khan
>
> Configure node label and load scheduler Page
> {code}
> 2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
> handling URI: /cluster/scheduler
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
>   at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
>   at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615)
>   at 
> org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>   at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>   at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>   at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>   at org.mortbay.jetty.Server.handle(Server.java:326)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>   at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>   at 

[jira] [Commented] (YARN-4107) Both RM becomes Active if all zookeepers can not connect to active RM

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728549#comment-14728549
 ] 

Hadoop QA commented on YARN-4107:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 52s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  8s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 50s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 31s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  53m 45s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  93m 27s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753865/YARN-4107.1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 09c64ba |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8993/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8993/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8993/console |


This message was automatically generated.

> Both RM becomes Active if all zookeepers can not connect to active RM
> -
>
> Key: YARN-4107
> URL: https://issues.apache.org/jira/browse/YARN-4107
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4107.1.patch
>
>
> Steps to reproduce:
> 1) Run small randomwriter applications in background
> 2) rm1 is active and rm2 is standby 
> 3) Disconnect all Zks and Active RM
> 4) Check status of both RMs. Both of them are in active state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2015-09-03 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728555#comment-14728555
 ] 

zhihai xu commented on YARN-2170:
-

Hi [~hex108], It is a good catch. Java doesn't allow overriding of static 
methods. That is why getVersion always get the version of hadoop common. 
 

> Fix components' version information in the web page 'About the Cluster'
> ---
>
> Key: YARN-2170
> URL: https://issues.apache.org/jira/browse/YARN-2170
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Minor
> Attachments: YARN-2170.patch
>
>
> In the web page 'About the Cluster', YARN's component's build version(e.g. 
> ResourceManager) is the same as Hadoop version now. It is caused by   calling 
> getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728559#comment-14728559
 ] 

Hadoop QA commented on YARN-2005:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  18m 38s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 8 new or modified test files. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  4s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 45s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m 21s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 44s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   2m  1s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |  54m 38s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 103m 48s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753897/YARN-2005.008.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 09c64ba |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8992/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8992/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8992/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8992/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8992/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8992/console |


This message was automatically generated.

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4109) Exception on RM scheduler page loading with labels

2015-09-03 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4109:
--

 Summary: Exception on RM scheduler page loading with labels
 Key: YARN-4109
 URL: https://issues.apache.org/jira/browse/YARN-4109
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt


Configure node label and load scheduler Page

{code}
2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: 

[jira] [Updated] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-09-03 Thread Lavkesh Lahngir (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavkesh Lahngir updated YARN-3591:
--
Attachment: YARN-3591.9.patch

Thanks [~vvasudev] for comments.
Updated the patch.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, 
> YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch, YARN-3591.9.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-09-03 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3216:
---
Priority: Critical  (was: Major)

> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4044) Running applications information changes such as movequeue is not published to TimeLine server

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728639#comment-14728639
 ] 

Naganarasimha G R commented on YARN-4044:
-

Hi [~sunilg], IIUC intention of this jira is to show in the ATS UI with the 
modifed queue or priority if any right ?. If so it requires modification in 
{{ApplicationHistoryManagerOnTimelineStore.convertToApplicationReport}} needs 
to capture the event {{ApplicationMetricsConstants.UPDATED_EVENT_TYPE}} and 
update the ApplicationReport object. At the same time we need to ensure that if 
multiple application updates are done then we need to show the latest update in 
the UI. As per the logic what i  see in 
{{LeveldbTimelineStore.getEntity(String, String, Long, EnumSet, 
LeveldbIterator, byte[], int)}} seems like all the events are captured and 
returned in which case we need to check return the report with the information 
present in latest {{ApplicationMetricsConstants.UPDATED_EVENT_TYPE}} 

> Running applications information changes such as movequeue is not published 
> to TimeLine server
> --
>
> Key: YARN-4044
> URL: https://issues.apache.org/jira/browse/YARN-4044
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4044.patch, 0002-YARN-4044.patch
>
>
> SystemMetricsPublisher need to expose an appUpdated api to update any change 
> for a running application.
> Events can be 
>   - change of queue for a running application.
> - change of application priority for a running application.
> This ticket intends to handle both RM and timeline side changes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-03 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728743#comment-14728743
 ] 

Varun Vasudev commented on YARN-4103:
-

Thanks for the patch [~jeagles]. Can you clarify something - according to 
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API
 - the logsLink has the scheme. Is the existing documentation wrong?

+1 for the latest patch.

> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728656#comment-14728656
 ] 

Naganarasimha G R commented on YARN-3216:
-

Hi [~sunilg] & [~leftnoteasy],
This issue is very critical for our usecase where in we use cluster by having 2 
partitions (non DEFAULT_PARTITION s). So if the Max-AM-Resource-Percentage is 
calculated based on DEFAULT_PARTITION size then it practically limits to only 
one app being submitted. 
Also i feel even though its hard for debugging its better to opt for approach 2 
as we can clearly specify AMResourceLimit for each partition and further we are 
having some jira's YARN-3946 which try to indicate the reasons for Application 
not being launched. Thoughts ?


> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-09-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728663#comment-14728663
 ] 

Sunil G commented on YARN-3216:
---

hi [~Naganarasimha Garla]
Option2 may come with more complexity and user may not understand that 
calculation very well. One of the reason is, if we have multiple partitions for 
queue, the resource size is also reducing per partition. And its possible that 
, even to launch one application, we may need to cross this limit totally in 
one partition OR even stop that application from launching. Yes, its more apt 
to do it, but we also have to see that how much complexity it adds. we can 
discuss more and can decide on approach by weighing the simpleness and 
correctiveness in the approach. 

> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4110) RMappImpl and RmAppAttemptImpl should override hashcode() & equals()

2015-09-03 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-4110:
---

 Summary: RMappImpl and RmAppAttemptImpl should override hashcode() 
& equals()
 Key: YARN-4110
 URL: https://issues.apache.org/jira/browse/YARN-4110
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Rohith Sharma K S
Assignee: Rohith Sharma K S


It is observed that RMAppImpl and RMAppAttemptImpl does not have hashcode() and 
equals() implementations. These state objects should override these 
implementations.

# For RMAppImpl, we can use of ApplicationId#hashcode and ApplicationId#equals.
# Similarly, RMAppAttemptImpl, ApplicationAttemptId#hashcode and 
ApplicationAttemptId#equals



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728858#comment-14728858
 ] 

Naganarasimha G R commented on YARN-3813:
-

Hi All \[ [~nijel],[~rohithsharma],[~devaraj.k],[~vinodkv] & [~sunilg] \],
Would it be good to support the YARN-2487 scenario also here where in 
application is only in Submitted and Accepted state for a particular period 
then we can kill the application ? Basically we can also accept the states of 
Application along with application timeout period based on which we need to 
kill it. thoughts ?


> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4029) Update LogAggregationStatus to store on finish

2015-09-03 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4029:
---
Attachment: 0003-YARN-4029.patch

Updated testcase for store updation.

> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, 
> 0003-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728772#comment-14728772
 ] 

Naganarasimha G R commented on YARN-3216:
-

Hi [~sunilg], Ideal would have been that Max-AM-Resource-Percentage can be 
configured for per queue per partition then it would have been clear but at the 
minimum i feel it should be 2nd option. In terms of debug ability in the case 
which [~leftnoteasy] mentioned ??but also can lead to too many AMs launched 
under a single partition??, if user is viewing the web ui, user need to see 
each of the partitions which is accessible to the queue and find out which 
partition is using more am resource and then try to resolve. But with the 2nd 
option it would be clear whats  Max-AM-Resource-Percentage per partition with 
changes in UI to point AM resource usage per partition per queue. 

> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-09-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728790#comment-14728790
 ] 

Sunil G commented on YARN-3216:
---

Synced up with [~Naganarasimha Garla] offline. 
One of the concern with option2 as such, is the calculation of max-am-resource 
per partition level which will be a subset of Queue itself. Assuming a 
configuration (its a corner case scenario), where queue has 10GB resource with 
0.1 as max-am-resource-percent. Now considering 2 partition sharing this queue 
with 50%, max am resource will be 500MB per partition. Earlier, a single 
application could have got 1Gb (as Naga told, over utilizing from other 
partition). Such case will pop-up as we go with Option2. 
I feel this also can be considered as a point to decide over these 2 options. 
Also I agree that having a non-used DEFAULT-PARTITION sharing am resources may 
also lead issue in future too. So earlier we get rid of that, its more better. 
:). So my suggestion is implementing option 2, with one of partition can borrow 
AM resources from other partition. 

[~leftnoteasy], could you also share your thoughts over here and we can syncup 
offline as needed to discuss this.

> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-261) Ability to kill AM attempts

2015-09-03 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728827#comment-14728827
 ] 

Rohith Sharma K S commented on YARN-261:


We have requirement for killing the app attempts. It would be very useful if it 
go in. 
[~aklochkov] Would you mind rebasing the patch please? If you are busy , shall 
I dig more into patch and I will rebase it. Does it fine?

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
>Assignee: Andrey Klochkov
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel reassigned YARN-3813:
---

Assignee: nijel

> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4029) Update LogAggregationStatus to store on finish

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728872#comment-14728872
 ] 

Hadoop QA commented on YARN-4029:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 24s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 42s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 51s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 48s | The applied patch generated  1 
new checkstyle issues (total was 132, now 133). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 26s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 29s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  51m 45s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  90m 24s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestApplicationACLs |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.resourcemanager.TestResourceManager |
|   | hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl |
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.TestRMAuditLogger |
|   | org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA 
|
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753951/0003-YARN-4029.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 09c64ba |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8995/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8995/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8995/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8995/console |


This message was automatically generated.

> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, 
> 0003-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4036) Findbugs warnings in hadoop-yarn-server-common

2015-09-03 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-4036:

Attachment: findbugsHtml.html

Attaching the findbugs report generated locally.

> Findbugs warnings in hadoop-yarn-server-common
> --
>
> Key: YARN-4036
> URL: https://issues.apache.org/jira/browse/YARN-4036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: findbugsHtml.html
>
>
> Refer to 
> https://issues.apache.org/jira/browse/YARN-3232?focusedCommentId=14679146=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14679146



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728876#comment-14728876
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8394 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8394/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728885#comment-14728885
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #339 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/339/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-3813:

Attachment: 0001-YARN-3813.patch

Sorry for the long delay..

Adding an initial patch.
The action on timeout is considered as KILL.
Please have a look. I will update the patch with more test cases after initial 
review.

Thanks

> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: 0001-YARN-3813.patch, YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728911#comment-14728911
 ] 

Sunil G commented on YARN-3813:
---

As I see this, the trigger point to identify an application to be Timedout (or 
killed) is based on the elapsed time (calculated from its submission time). If 
an application can be registered to RMAppTimeOutMonitor after submission, I 
feel we may need not have to worry about internal state. 

> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: 0001-YARN-3813.patch, YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728917#comment-14728917
 ] 

Naganarasimha G R commented on YARN-3813:
-

True, your point is correct so basically it should fail its not yet in running 
state even a boolean parameter for this should be sufficient I think !

> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: 0001-YARN-3813.patch, YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728918#comment-14728918
 ] 

Naganarasimha G R commented on YARN-3813:
-

True, your point is correct so basically it should fail its not yet in running 
state even a boolean parameter for this should be sufficient I think !

> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: 0001-YARN-3813.patch, YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4044) Running applications information changes such as movequeue is not published to TimeLine server

2015-09-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728938#comment-14728938
 ] 

Sunil G commented on YARN-4044:
---

Thank You [~Naganarasimha] for sharing the comments.

bq.ApplicationHistoryManagerOnTimelineStore.convertToApplicationReport needs to 
capture the event ApplicationMetricsConstants.UPDATED_EVENT_TYPE 
I feel this is not needed. We are already not handling 
{{ACLS_UPDATED_EVENT_TYPE}} also. I think we use different handling in 
{{convertToApplicationReport}} for CREATED_EVENT_TYPE and FINISHED_EVENT_TYPE 
to record appStarted time and appFinished time (plus few other finished details)
I used {{entityInfo}} to store Queue and Priority details. Hence the first part 
of code in {{convertToApplicationReport}} covers reading that details and will 
fed to {{ApplicationReportExt}} object.  Please share your opinion.

bq.we need to check return the report with the information present in latest 
ApplicationMetricsConstants.UPDATED_EVENT_TYPE
One doubt here. We already are using {{EnumSet.allOf(Field.class)}} in 
{{ApplicationHistoryManagerOnTimelineStore.getApplication}} and hence it has 
{{LAST_EVENT_ONLY}} field. So we will read the last saved entity. Please help 
to correct me if I am wrong.

> Running applications information changes such as movequeue is not published 
> to TimeLine server
> --
>
> Key: YARN-4044
> URL: https://issues.apache.org/jira/browse/YARN-4044
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4044.patch, 0002-YARN-4044.patch
>
>
> SystemMetricsPublisher need to expose an appUpdated api to update any change 
> for a running application.
> Events can be 
>   - change of queue for a running application.
> - change of application priority for a running application.
> This ticket intends to handle both RM and timeline side changes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3813) Support Application timeout feature in YARN.

2015-09-03 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728898#comment-14728898
 ] 

nijel commented on YARN-3813:
-

This patch will address the initial issue. But this will kill the application 
even it is in RUNNING state.

As i understand the idea is to configure the states which the monitor needs to 
consider to kill the application. correct ?

But one doubt i have is whether the user will be aware of all the intermediate 
states for an app ?

> Support Application timeout feature in YARN. 
> -
>
> Key: YARN-3813
> URL: https://issues.apache.org/jira/browse/YARN-3813
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: nijel
>Assignee: nijel
> Attachments: 0001-YARN-3813.patch, YARN Application Timeout .pdf
>
>
> It will be useful to support Application Timeout in YARN. Some use cases are 
> not worried about the output of the applications if the application is not 
> completed in a specific time. 
> *Background:*
> The requirement is to show the CDR statistics of last few  minutes, say for 
> every 5 minutes. The same Job will run continuously with different dataset.
> So one job will be started in every 5 minutes. The estimate time for this 
> task is 2 minutes or lesser time. 
> If the application is not completing in the given time the output is not 
> useful.
> *Proposal*
> So idea is to support application timeout, with which timeout parameter is 
> given while submitting the job. 
> Here, user is expecting to finish (complete or kill) the application in the 
> given time.
> One option for us is to move this logic to Application client (who submit the 
> job). 
> But it will be nice if it can be generic logic and can make more robust.
> Kindly provide your suggestions/opinion on this feature. If it sounds good, i 
> will update the design doc and prototype patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4111) Killed application diagnostics message should be set rather having static mesage

2015-09-03 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-4111:
---

 Summary: Killed application diagnostics message should be set 
rather having static mesage
 Key: YARN-4111
 URL: https://issues.apache.org/jira/browse/YARN-4111
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Rohith Sharma K S
Assignee: Rohith Sharma K S


Application can be killed either by *user via ClientRMService* OR *from 
scheduler*. Currently diagnostic message is set statically i.e {{Application 
killed by user.}} neverthless of application killed by scheduler. This brings 
the confusion to the user after application is Killed that he did not kill 
application at all but diagnostic message depicts that 'application is killed 
by user'.

It would be useful if the diagnostic message are different for each cause of 
KILL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4111) Killed application diagnostics message should be set rather having static mesage

2015-09-03 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reassigned YARN-4111:
---

Assignee: nijel  (was: Rohith Sharma K S)

> Killed application diagnostics message should be set rather having static 
> mesage
> 
>
> Key: YARN-4111
> URL: https://issues.apache.org/jira/browse/YARN-4111
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: nijel
>
> Application can be killed either by *user via ClientRMService* OR *from 
> scheduler*. Currently diagnostic message is set statically i.e {{Application 
> killed by user.}} neverthless of application killed by scheduler. This brings 
> the confusion to the user after application is Killed that he did not kill 
> application at all but diagnostic message depicts that 'application is killed 
> by user'.
> It would be useful if the diagnostic message are different for each cause of 
> KILL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4110) RMappImpl and RmAppAttemptImpl should override hashcode() & equals()

2015-09-03 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reassigned YARN-4110:
---

Assignee: nijel  (was: Rohith Sharma K S)

> RMappImpl and RmAppAttemptImpl should override hashcode() & equals()
> 
>
> Key: YARN-4110
> URL: https://issues.apache.org/jira/browse/YARN-4110
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: nijel
>
> It is observed that RMAppImpl and RMAppAttemptImpl does not have hashcode() 
> and equals() implementations. These state objects should override these 
> implementations.
> # For RMAppImpl, we can use of ApplicationId#hashcode and 
> ApplicationId#equals.
> # Similarly, RMAppAttemptImpl, ApplicationAttemptId#hashcode and 
> ApplicationAttemptId#equals



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3368) Improve YARN web UI

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728906#comment-14728906
 ] 

Naganarasimha G R commented on YARN-3368:
-

Hi 
Well came across an issue while installing as per the readme.md in ubuntu 
14.04, 64 bit took little alternate steps to get it running (but in other 
Ubuntu setup steps in Readme.md was sufficient)
* sudo apt-get update
* sudo apt-get install nodejs
* sudo apt-get install npm

test following commands,
node --help
npm  --help

If the output is not coming then we can follow below steps :
*sudo apt-get remove npm
*sudo apt-get remove nodejs
*wget http://nodejs.org/dist/v0.12.0/node-v0.12.7-linux-x64.tar.gz
*sudo tar -C /usr/local --strip-components 1 -xzf 
node-v0.12.7-linux-x64.tar.gz

verify 
*ls -l /usr/local/bin/node
*ls -l /usr/local/bin/npm

* sudo npm install bower -g
* sudo npm install -g ember-cli
* cd /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui
* npm install && bower install
  
no changes in yarn-app.js as per the configuration default settings of 
"http://localhost:1337/localhost:8088; should work
* sudo npm install -g corsproxy
* nohup corsproxy > corsproxy.log 2>&1 &
* ember serve
* visit "http://localhost:4200;

> Improve YARN web UI
> ---
>
> Key: YARN-3368
> URL: https://issues.apache.org/jira/browse/YARN-3368
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
> Attachments: Applications-table-Screenshot.png, 
> Queue-Hierarchy-Screenshot.png, YARN-3368.poc.1.patch
>
>
> The goal is to improve YARN UI for better usability.
> We may take advantage of some existing front-end frameworks to build a 
> fancier, easier-to-use UI. 
> The old UI continue to exist until  we feel it's ready to flip to the new UI.
> This serves as an umbrella jira to track the tasks. we can do this in a 
> branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728866#comment-14728866
 ] 

Varun Vasudev commented on YARN-3970:
-

Committed to trunk and branch-2. Thanks [~Naganarasimha]!

> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-3970:

Fix Version/s: 2.8.0

> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728900#comment-14728900
 ] 

Hudson commented on YARN-3970:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #346 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/346/])
YARN-3970. Add REST api support for Application Priority. Contributed by 
Naganarasimha G R. (vvasudev: rev b469ac531af1bdda01a04ae0b8d39218ca292163)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md


> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3970) REST api support for Application Priority

2015-09-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728909#comment-14728909
 ] 

Naganarasimha G R commented on YARN-3970:
-

Thanks for review and commit [~vvasudev], [~sunilg] & [~rohithsharma]

> REST api support for Application Priority
> -
>
> Key: YARN-3970
> URL: https://issues.apache.org/jira/browse/YARN-3970
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3970.20150828-1.patch, YARN-3970.20150829-1.patch, 
> YARN-3970.20150831-1.patch, YARN-3970.20150901-1.patch, 
> YARN-3970.20150901-2.patch
>
>
> REST api support for application priority.
> - get/set priority of an application
> - get default priority of a queue
> - get cluster max priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730126#comment-14730126
 ] 

Hadoop QA commented on YARN-4087:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m  9s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 58s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 53s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 57s | The applied patch generated  4 
new checkstyle issues (total was 58, now 58). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |  54m 14s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 105m 38s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754094/YARN-4087.3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / ed78b14 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9003/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9003/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9003/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9003/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9003/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9003/console |


This message was automatically generated.

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-03 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729890#comment-14729890
 ] 

Li Lu commented on YARN-3901:
-

bq. Yes, in this code, we are only "reading" or returning back cells to the 
client (hbase client). But when we add in the compaction/flush coprocessors, 
they will write back to hbase as well.
Oh I see. This is the missing piece that confused me. Thanks for the 
clarification! 

bq. We could use these attributes outside of Aggregation as well, so don't want 
the agg prefix here. 
OK, if this is the plan, let's leave it here to unblock the critical path of 
the whole JIRA? We can clean this up later. 

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-03 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729905#comment-14729905
 ] 

Jian He commented on YARN-4087:
---

Right, we should make sure  the state is correct if the error is ignored.

Uploaded a new patch which addressed Vinod's comment about the inconsistent 
state.
Also updated the description about the fixes included in this jira.

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixes state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-03 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Description: 
Several fixes:
1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
production environment.
2. Fixe state-store to also notify app/attempt if state-store error is ignored 
so that app/attempt is not stuck at *_SAVING state
3. If HA is enabled and if there's any state-store error, after the retry 
operation failed, we always transition RM to standby state.  Otherwise, we may 
see two active RMs running. YARN-4107 is one example.

  was:
Several fixes:
1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
production environment.
2. Fixes state-store to also notify app/attempt if state-store error is ignored 
so that app/attempt is not stuck at *_SAVING state
3. If HA is enabled and if there's any state-store error, after the retry 
operation failed, we always transition RM to standby state.  Otherwise, we may 
see two active RMs running. YARN-4107 is one example.


> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-03 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729910#comment-14729910
 ] 

Joep Rottinghuis commented on YARN-3901:


Not sure if [~gtCarrera9] is asking what happens if we have consolidated a 
"finished" app into the flow aggregate and then we get an older batched put 
from the same app (from an older AM that wasn't properly killed or something). 
That is currently an open problem that we can address by storing the list of x 
most recently completed apps in a tag on the total sum. We can't quite store 
all app IDs that have been collapsed into the flow sum, because that list could 
potentially be really long. We could keep a list of the last x apps, to 
radically reduce the likelihood that stale batched writes end up messing up the 
aggregation if there were to be a race condition (however rare that might be). 
I think we can add that guarding behavior in a separate jira and not further 
complicate the first cut (that might suffer from this rare race condition).

As [~sjlee0] pointed out and we were discussing offline this morning 
scan.setMaxResultSize(limit); doesn't limit the # rows that are returned, but 
limits the size in bytes. Not sure we want to address that here, or if we'll 
let him adjust that in his patch.
We should limit the number of rows we retrieve from scan and if needed (as 
[~sjlee0] pointed out) add a PageFilter with a limit in addition to the limit 
to further restrict what is prefetched in a buffer (gets more complicated when 
scan spans region servers).

It is a little confusing to read the patch and see where ColumnPrefix has just 
an additional argument in the store method for Attribute... attributes and when 
the method with a String qualifier is changed to one with a byte[]. There seems 
to be a discrepancy in how EntityTablePrefix and ApplicationTablePrefix are 
handles. I'm not sure if is needed to have a getColumnQualifier with a long in 
ColumnHelper, but we may have to review this together interactively behind two 
laptops.

In HBaseTimelineWriterImpl. onApplicationFinished
You have an old comment:
281// clear out the hashmap in case the downstream store/coprocessor
282   // added any more attributes during the put for END_TIME
that no longer makes sense. Also, I'd simply do:
{code}
storeFlowMetrics(rowKey, metrics, attribute1, 
AggregationOperations.SUM_FINAL.getAttribute());
{code}

I'd update the javadoc in AggregationOperations before the enum definitions. 
They are the old style and no longer make sense in the more generic case.
SUM indicates that the values need to be summed up, MIN means that only the 
minimum needs to be kept etc.

Initially I found the method name getIncomingAttributes and corresponding 
member names somewhat confusing (from the method perspective it isn't the 
incoming values, it is the outgoing values). Perhaps combinedAttribute and 
combineAttributes(...) makes more sense, but the provided logic seems correct.

FlowActivityColumnPrefix.IN_PROGRESS_TIME needs a better javadoc description to 
describe its use and meaning.

The coprocessor methods need a little more javadoc to explain what is going on. 
To the casual reader this is total voodoo.
The preGetOp creates a new scanner (ok), then does a single next on it (why?) 
and then bypasses the environment (huh?).
Similarly if in preScannerOpen we already set scan.setMaxVersions(); then why 
is the same still needed in PreGetOp, but in PostScannerOpen we don't do it 
anymore (presumably already done in the preOpen).

I like the more generic FlowRunCoprocessor (although it can have a name that is 
not associated with a table, because behavior is generic, and arg names such as 
frpa are probably artifact from previous version).
In getTagFromAttribute, is it possible to recognize a operation from an 
AggregationCompactionDimension without relying on an exception and catching it? 
For example, can you do AggregationOperation.isA(Attribute a) or something like 
that?

The other thing I realize with the coprocessor is this. It nicely maps 
attributes to tags, but we unnecessarily bloat every single put with the 
operation.
We could get creative and use a different column prefix for min and max 
columns. Then the coprocessor can pick that up during read/flush/compaction. 
That makes queries (and filters) much harder. So for now we're probably stuck 
with tagging each value. Perhaps not so bad for min and max given that after 
flush and compact we store only one value.

For SUM we will always have an Aggregation dimension, so adding a SUM tag then 
isn't needed. We assume an aggregation dimension w/o agg operation would 
default to SUM. We do certainly need to tag values with SUM_FINAL.

Aside from that, in FlowRunProcessor.prePut do we have to keep doing 
Tag.fromList(tags) for each cell, or can we create a Tag  once and 
re-use it?

When reading through the 

[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS

2015-09-03 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729917#comment-14729917
 ] 

Bikas Saha commented on YARN-3942:
--

In Option 1 the latency would be the time to read the entire session file for a 
session that has run many DAGs, right?
Since we know the file size to be read, could we return a message saying 
something like "scanning file size FOO. Expect BAR latency"?

> Timeline store to read events from HDFS
> ---
>
> Key: YARN-3942
> URL: https://issues.apache.org/jira/browse/YARN-3942
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3942.001.patch
>
>
> This adds a new timeline store plugin that is intended as a stop-gap measure 
> to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
> v2.  The intent of this plugin is to provide a workable solution for running 
> the Tez UI against the timeline server on a large-scale clusters running many 
> thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4102) Add a "skip existing table" mode for timeline schema creator

2015-09-03 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730086#comment-14730086
 ] 

Joep Rottinghuis commented on YARN-4102:


s/in/it

> Add a "skip existing table" mode for timeline schema creator
> 
>
> Key: YARN-4102
> URL: https://issues.apache.org/jira/browse/YARN-4102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-4102-YARN-2928.001.patch, 
> YARN-4102-YARN-2928.002.patch, YARN-4102-YARN-2928.003.patch
>
>
> When debugging timeline POCs, we may need to create hbase tables that are 
> added in some ongoing patches. Right now, our schema creator will exit when 
> it hits one existing table. While this is a correct behavior with end users, 
> this introduces much trouble in debugging POCs: every time we have to disable 
> all existing tables, drop them, run the schema creator to generate all 
> tables, and regenerate all test data. 
> Maybe we'd like to add an "incremental" mode so that the creator will only 
> create non-existing tables? This is pretty handy in deploying our POCs. Of 
> course, consistency has to be kept in mind across tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-03 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Description: 
Several fixes:
1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
production environment.
2. Fixes state-store to also notify app/attempt if state-store error is ignored 
so that app/attempt is not stuck at *_SAVING state
3. If HA is enabled and if there's any state-store error, after the retry 
operation failed, we always transition RM to standby state.  Otherwise, we may 
see two active RMs running. YARN-4107 is one example.

  was:Increasingly, I feel setting this property to be false makes more sense 
especially in production environment, 

Summary: Followup fixes after YARN-2019 regarding RM behavior when 
state-store error occurs  (was: Set YARN_FAIL_FAST to be false by default)

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixes state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues

2015-09-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3850:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1 after fixing a minor conflict in TestLogAggregation.java.

Ran compilation and TestLogAggregationService, TestContainerLogsPage before the 
push. Patch applied cleanly.

> NM fails to read files from full disks which can lead to container logs being 
> lost and other issues
> ---
>
> Key: YARN-3850
> URL: https://issues.apache.org/jira/browse/YARN-3850
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Blocker
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.1
>
> Attachments: YARN-3850.01.patch, YARN-3850.02.patch
>
>
> *Container logs* can be lost if disk has become full(~90% full).
> When application finishes, we upload logs after aggregation by calling 
> {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
> checks the eligible directories on call to 
> {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
> return nothing. So none of the container logs are aggregated and uploaded.
> But on application finish, we also call 
> {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
> application directory which contains container logs. This is because it calls 
> {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
> as well.
> So we are left with neither aggregated logs for the app nor the individual 
> container logs for the app.
> In addition to this, there are 2 more issues :
> # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so 
> NM will fail to serve up logs from full disks from its web interfaces.
> # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full 
> disks so it is possible that on container recovery, PID file is not found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-09-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730060#comment-14730060
 ] 

Wangda Tan commented on YARN-4108:
--

Thanks [~jlowe]/[~sunilg] for sharing your thoughts!

I think the ideal case is as what [~jlowe] mentioned: {{When we decide to 
preempt we can move the request's reservation to the node where we decided to 
preempt containers.}}. If ProportionalCPP has this capability, most of the 
problems will be resolved. But one major issue is, it will be hard to handle 
requests and preemption together -- We have a very complex ordering algorithm 
to determine which request/application will be served, we will do some 
simulation to get the ordering, which is very expensive.

Along with Jason's suggestion, we can achieve this by implementing algorithm 
within scheduler's allocation cycle, which can simplify preemption logic AND 
makes preemption logic synchronized with allocation logic.
- ProportionalCPP will decide how to balance resources between queues, which is 
as simple as: how much of resources of each queue/application/user can be 
preempted. Please note that, PCPP won't mark any of to-be-preempted container 
in this stage.
- Scheduler's allocation cycle will proactively trigger preemption, we don't 
need to do this every node heartbeat. To make this efficient, *we can do the 
preemption check once every X (configurable) node heartbeat.*

Logic may look like:
{code}

node-do-heartbeat(node, application) {
if (do-preemption-check) {
// Preemptable resource is decided by ProportionalCPP AND the 
application
Resource preemptable = node.getPreemptable(application)

if (node.available + preemptable > application.next_request) {
// mark to-be-preempted containers
} else {
// reserve application.next_request if it could be 
reserved.
// preemptable containers will keep running if reserved 
container can be allocated
}
}
}
{code}

This has same order of magnitudes time complex, and it should be able to solve 
the progressive preemption problem.

cc: [~curino].

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3216) Max-AM-Resource-Percentage should respect node labels

2015-09-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730073#comment-14730073
 ] 

Wangda Tan commented on YARN-3216:
--

Thanks for sharing your thoughts, [~sunilg], [~Naganarasimha].
Reconsidered this problem, I prefer to go with option-2, partition should be 
considered as a sub-cluster, so it should have its own 
max-am-resource-percentage. I think what we can do is:
- Have a per-queue-per-partition "max am percentage" configuration, if it not 
specific, we assume it inherits queue's configured one.
- We guarantee to launch at least one AM container in every partition no matter 
which is the AM-resource-percentage setting.
- If we change node's label, we should update per-queue-per-partition-am-used 
resource as well. (We have YARN-4082 committed).

Sounds like a plan? :)

> Max-AM-Resource-Percentage should respect node labels
> -
>
> Key: YARN-3216
> URL: https://issues.apache.org/jira/browse/YARN-3216
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3216.patch
>
>
> Currently, max-am-resource-percentage considers default_partition only. When 
> a queue can access multiple partitions, we should be able to compute 
> max-am-resource-percentage based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729983#comment-14729983
 ] 

Wangda Tan commented on YARN-4059:
--

Hi [~jlowe],

Thanks for explanation, now I can better understand why YARN choose to 
count-based wait instead of time-based wait at the begining.

bq. If the app only wants a small portion of the cluster then it already scales 
down the amount of time it will wait in getLocalityWaitFactor.
I think this is another thing we need to fix, currently, it uses 
#request-container * localityWaitFactor as the minimum wait threshold of 
offswitch, if an app asks for #containers >> (#hosts-per-rack) (let's say, ask 
for 10k containers, 4 racks in the cluster, each rack has 5 nodes), and when 
the expected racks are not available, the app needs to wait maybe 10+ minutes 
to get one offswitch container. I would like to make this becomes more 
determinated: it will be a fixed number, such as waiting for 5 secs for 
rack-local and goes to off-switch.

bq. The problem I think we're going to run into with a time-based approach is 
that we don't know what time an individual request arrived since we only store 
the aggregation of requests for a particular priority.
I totally agree with this, here's one solution in our mind that may solve the 
problem, I discussed this offline with [~vinodkv], it seems works end-to-end:

- When an app is able to allocate one container on a node, but it prefer to 
wait, it will reserve on a node. (Current behavior is reservation happens only 
app get enough missed-opportunity)
- Benefits of doing this before missed-opportunity are: 1) application 
officially declares "this is my node" so rest of applications will be skipped. 
2) We already has mechanism to avoid excessive reservation, so one high 
priority app cannot block a whole cluster if it only asks for few containers.
- Redefine of locality-delay to be: number of time that one app willing to wait 
to *allocate a single container for a given app/priority*. This is very 
determinstic to me (much more deterministic than existing count-based delay).
- We will start the waiting-timer once we reserved a container on a node, 
waiting-timer is a property of reserved RMContainer if we choose to move the 
reservation, wait-timer will be kept.
- And this solution supports per-app/per-priority locality-delay, it doesn't 
affect by how many nodes/racks in the cluster.

Thoughts?

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >