date:20140908

Zhijie Shen created YARN-2520:
-

 Summary: Scalable and High Available Timeline Server
 Key: YARN-2520
 URL: https://issues.apache.org/jira/browse/YARN-2520
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.5.0, 3.0.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen


YARN-2032 will provide a scalable and reliable timeline store based on HBase. 
However a single instance of the timeline server is not scalable enough to 
handle a large volume of user requests, being the single bottleneck.

As the timeline server is the stateless machine, it's not difficult to start 
multiple timeline server instances and write into the same HBase timeline 
store. We can make use of Zookeeper to register all the timeline servers, as HA 
RMs do, and client can randomly pick one server to publish the timeline 
entities for load balancing.

Moreover, since multiple timeline servers are started together, they are 
actually back up each other, solving the high availability problem as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2520) Scalable and High Available Timeline Server


 [ 
https://issues.apache.org/jira/browse/YARN-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2520:
--
Attachment: Federal Timeline Servers.jpg

Attach simple architecture figure about the federal timeline servers.

 Scalable and High Available Timeline Server
 ---

 Key: YARN-2520
 URL: https://issues.apache.org/jira/browse/YARN-2520
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 3.0.0, 2.5.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: Federal Timeline Servers.jpg


 YARN-2032 will provide a scalable and reliable timeline store based on HBase. 
 However a single instance of the timeline server is not scalable enough to 
 handle a large volume of user requests, being the single bottleneck.
 As the timeline server is the stateless machine, it's not difficult to start 
 multiple timeline server instances and write into the same HBase timeline 
 store. We can make use of Zookeeper to register all the timeline servers, as 
 HA RMs do, and client can randomly pick one server to publish the timeline 
 entities for load balancing.
 Moreover, since multiple timeline servers are started together, they are 
 actually back up each other, solving the high availability problem as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2521) Reliable TimelineClient

Zhijie Shen created YARN-2521:
-

 Summary: Reliable TimelineClient
 Key: YARN-2521
 URL: https://issues.apache.org/jira/browse/YARN-2521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.5.0, 3.0.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen


The timeline server is likely to be in outage. It would be beneficial if the 
timeline client can cache the timeline entity locally after the application 
pass it to the client, and before the client successfully hands it over to the 
server.

To prevent the entity from being lost, we may want to persist it into the 
secondary storage, such as HDFS and Leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2512) Allow for origin pattern matching in cross origin filter


[ 
https://issues.apache.org/jira/browse/YARN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125420#comment-14125420
 ] 

Hudson commented on YARN-2512:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #674 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/674/])
YARN-2512. Allowed pattern matching for origins in CrossOriginFilter. 
Contributed by Jonathan Eagles. (zjshen: rev 
a092cdf32de4d752456286a9f4dda533d8a62bca)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestCrossOriginFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/CrossOriginFilter.java
* hadoop-yarn-project/CHANGES.txt


 Allow for origin pattern matching in cross origin filter
 

 Key: YARN-2512
 URL: https://issues.apache.org/jira/browse/YARN-2512
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Fix For: 2.6.0

 Attachments: YARN-2512-v1.patch


 Extending the feature set of allowed origins. Now a * in a pattern 
 indicates this allowed origin is a pattern and will be matched including 
 multiple sub-domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch


[ 
https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125418#comment-14125418
 ] 

Hudson commented on YARN-2515:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #674 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/674/])
YARN-2515. Updated ConverterUtils#toContainerId to parse epoch. Contributed by 
Tsuyoshi OZAWA (jianhe: rev 0974f434c47ffbf4b77a8478937fd99106c8ddbd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerId.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestConverterUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestContainerId.java


 Update ConverterUtils#toContainerId to parse epoch
 --

 Key: YARN-2515
 URL: https://issues.apache.org/jira/browse/YARN-2515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-2515.1.patch, YARN-2515.2.patch


 ContaienrId#toString was updated on YARN-2182. We should also update 
 ConverterUtils#toContainerId to parse epoch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2507) Document Cross Origin Filter Configuration for ATS


[ 
https://issues.apache.org/jira/browse/YARN-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125419#comment-14125419
 ] 

Hudson commented on YARN-2507:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #674 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/674/])
YARN-2507. Documented CrossOriginFilter configurations for the timeline server. 
Contributed by Jonathan Eagles. (zjshen: rev 
56dc496a1031621d2b701801de4ec29179d75f2e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/TimelineServer.apt.vm
* hadoop-yarn-project/CHANGES.txt


 Document Cross Origin Filter Configuration for ATS
 --

 Key: YARN-2507
 URL: https://issues.apache.org/jira/browse/YARN-2507
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation, timelineserver
Affects Versions: 2.6.0
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Fix For: 2.6.0

 Attachments: YARN-2507-v1.patch


 CORS support was added for ATS as part of YARN-2277. This jira is to document 
 configuration for ATS CORS support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch


[ 
https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125509#comment-14125509
 ] 

Hudson commented on YARN-2515:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1865 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1865/])
YARN-2515. Updated ConverterUtils#toContainerId to parse epoch. Contributed by 
Tsuyoshi OZAWA (jianhe: rev 0974f434c47ffbf4b77a8478937fd99106c8ddbd)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestConverterUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestContainerId.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerId.java


 Update ConverterUtils#toContainerId to parse epoch
 --

 Key: YARN-2515
 URL: https://issues.apache.org/jira/browse/YARN-2515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-2515.1.patch, YARN-2515.2.patch


 ContaienrId#toString was updated on YARN-2182. We should also update 
 ConverterUtils#toContainerId to parse epoch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2507) Document Cross Origin Filter Configuration for ATS


[ 
https://issues.apache.org/jira/browse/YARN-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125510#comment-14125510
 ] 

Hudson commented on YARN-2507:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1865 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1865/])
YARN-2507. Documented CrossOriginFilter configurations for the timeline server. 
Contributed by Jonathan Eagles. (zjshen: rev 
56dc496a1031621d2b701801de4ec29179d75f2e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/TimelineServer.apt.vm
* hadoop-yarn-project/CHANGES.txt


 Document Cross Origin Filter Configuration for ATS
 --

 Key: YARN-2507
 URL: https://issues.apache.org/jira/browse/YARN-2507
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation, timelineserver
Affects Versions: 2.6.0
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Fix For: 2.6.0

 Attachments: YARN-2507-v1.patch


 CORS support was added for ATS as part of YARN-2277. This jira is to document 
 configuration for ATS CORS support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2512) Allow for origin pattern matching in cross origin filter


[ 
https://issues.apache.org/jira/browse/YARN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125511#comment-14125511
 ] 

Hudson commented on YARN-2512:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1865 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1865/])
YARN-2512. Allowed pattern matching for origins in CrossOriginFilter. 
Contributed by Jonathan Eagles. (zjshen: rev 
a092cdf32de4d752456286a9f4dda533d8a62bca)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/CrossOriginFilter.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestCrossOriginFilter.java


 Allow for origin pattern matching in cross origin filter
 

 Key: YARN-2512
 URL: https://issues.apache.org/jira/browse/YARN-2512
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Fix For: 2.6.0

 Attachments: YARN-2512-v1.patch


 Extending the feature set of allowed origins. Now a * in a pattern 
 indicates this allowed origin is a pattern and will be matched including 
 multiple sub-domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2517) Implement TimelineClientAsync


[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125664#comment-14125664
 ] 

Hadoop QA commented on YARN-2517:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667168/YARN-2517.1.patch
  against trunk revision 0974f43.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4842//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4842//console

This message is automatically generated.

 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations


[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125673#comment-14125673
 ] 

Hadoop QA commented on YARN-2494:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667170/YARN-2494.patch
  against trunk revision 0974f43.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 6 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/4843//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  org.apache.hadoop.yarn.label.TestFileSystemNodeLabelManager
  org.apache.hadoop.yarn.label.TestNodeLabelManager

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4843//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4843//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4843//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4843//console

This message is automatically generated.

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-08 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-002.patch

Patch -002

# adds persistence policy
# {{RegistryOperationsService}} implements callbacks for various RM events, and 
implements the setup/purge behaviour underneath.
# adds a new class in the resource manager, {{RegistryService}}. This bridges 
from YARN to the registry by subscribing to application and container events, 
translating and forwarding to the  {{RegistryOperationsService}} where they may 
trigger setup/purge operations 
# Hooks this up to the RM
# Extends the DistributedShell by enabling it to register service records with 
the different persistence options.
# Adds a test to verify the distributed shell does register the entries, and 
that the purgeable ones are purged after the application completes.

This means the {{TestDistributedShell}} test is now capable of verifying that 
YARN applications can register themselves, that they can then be discovered, 
and that the RM cleans up after they terminate.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125706#comment-14125706
]

Hadoop QA commented on YARN-913:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12667181/YARN-913-002.patch
against trunk revision 0974f43.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 25 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4844//console

This message is automatically generated.

Add a way to register long-lived services in a YARN cluster
---

Key: YARN-913
URL: https://issues.apache.org/jira/browse/YARN-913
Project: Hadoop YARN
Issue Type: New Feature
Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf,
2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt,
YARN-913-001.patch, YARN-913-002.patch, yarnregistry.pdf, yarnregistry.tla

In a YARN cluster you can't predict where services will come up -or on what
ports. The services need to work those things out as they come up and then
publish them somewhere.
Applications need to be able to find the service instance they are to bond to
-and not any others in the cluster.
Some kind of service registry -in the RM, in ZK, could do this. If the RM
held the write access to the ZK nodes, it would be more secure than having
apps register with ZK themselves.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2517) Implement TimelineClientAsync


[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125714#comment-14125714
 ] 

Vinod Kumar Vavilapalli commented on YARN-2517:
---

I am not entirely sure we need a parallel client for this. The other clients 
needed async clients because
 - they had loads of functionality that made sense in the blocking and 
non-blocking modes
 - the client code really needed call-back hooks to act on the results.

Timeline Client's only responsibility is to post events. There are only two 
use-cases: Clients need a sync write through, or an asynchronous write, the end 
of which they don't care about. I think we should simply have a mode in the 
existing client to post events asynchronously without any further need for 
call-back handlers.

What do others think?

 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch

2014-09-08 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125744#comment-14125744
 ] 

Tsuyoshi OZAWA commented on YARN-2515:
--

Thanks for your review, Jian!

 Update ConverterUtils#toContainerId to parse epoch
 --

 Key: YARN-2515
 URL: https://issues.apache.org/jira/browse/YARN-2515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-2515.1.patch, YARN-2515.2.patch


 ContaienrId#toString was updated on YARN-2182. We should also update 
 ConverterUtils#toContainerId to parse epoch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-09-08 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125760#comment-14125760
 ] 

Tsuyoshi OZAWA commented on YARN-2517:
--

Thanks for your comment, Vinod.

{quote}
an asynchronous write, the end of which they don't care about. I think we 
should simply have a mode in the existing client to post events asynchronously 
without any further need for call-back handlers.
{quote}

Make sense. We can assure at-most-once semantics without any callbacks. How 
about adding a {{flush()}} API to TimelineClient for asynchronous mode? It 
helps users to know whether contents of current buffer are written to Timeline 
Server or not.

 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

[
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125777#comment-14125777
]

Zhijie Shen commented on YARN-2517:
---

[~vinodkv], thanks for your feedback. The reason why an async client (or async
HTTP REST call) is going to be good is to unblock the current thread if it is
doing the important management logic. For example, in YARN-2033, we have a
bunch of logic to dispatch the entity putting action on a separate thread, to
make the application life cycle management move on. Given an async client, it
could be far more simplified. I think from the user point of view, it may be a
useful feature as well.

I'm fine whether we can two classes, sync for one and async for the other, or
one class for both modes, while the former option complies with the previous
client design. I think the callback is necessary, at least onError.
TimelinePutResponse will give the user a summary of why his uploaded entity is
not accepted by the timeline server. Based on the response, the user can
determine whether the app should neglect the problem and move on, or stop
immediately.

Implement TimelineClientAsync
-

Key: YARN-2517
URL: https://issues.apache.org/jira/browse/YARN-2517
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
Attachments: YARN-2517.1.patch

In some scenarios, we'd like to put timeline entities in another thread no to
block the current one.
It's good to have a TimelineClientAsync like AMRMClientAsync and
NMClientAsync. It can buffer entities, put them in a separate thread, and
have callback to handle the responses.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2518) Support in-process container executor


[ 
https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125797#comment-14125797
 ] 

Allen Wittenauer commented on YARN-2518:


This is pretty much incompatible with security. So it should probably fail the 
nodemanager process under that condition.

 Support in-process container executor
 -

 Key: YARN-2518
 URL: https://issues.apache.org/jira/browse/YARN-2518
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.5.0
 Environment: Linux, Windows
Reporter: BoYang
Priority: Minor
  Labels: container, dispatch, in-process, job, node

 Node Manage always creates a new process for a new application. We have hit a 
 scenario where we want the node manager to execute the application inside its 
 own process, so we get fast response time. It would be nice if Node Manager 
 or YARN can provide native support for that.
 In general, the scenario is that we have a long running process which can 
 accept requests and process the requests inside its own process. Since YARN 
 is good at scheduling jobs, we want to use YARN to dispatch jobs (e.g. 
 requests in JSON) to the long running process. In that case, we do not want 
 YARN container to spin up a new process for each request. Instead, we want 
 YARN container to send the request to the long running process for further 
 processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2461) Fix PROCFS_USE_SMAPS_BASED_RSS_ENABLED property in YarnConfiguration

2014-09-08 Thread Ray Chiang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125809#comment-14125809
 ] 

Ray Chiang commented on YARN-2461:
--

Same observation as before.  No need for a new unit test for fixed property 
value.

 Fix PROCFS_USE_SMAPS_BASED_RSS_ENABLED property in YarnConfiguration
 

 Key: YARN-2461
 URL: https://issues.apache.org/jira/browse/YARN-2461
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie
 Attachments: YARN-2461-01.patch


 The property PROCFS_USE_SMAPS_BASED_RSS_ENABLED has an extra period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated


[ 
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125808#comment-14125808
 ] 

Allen Wittenauer commented on YARN-2256:


Someone correct me if I'm wrong, but I'm fairly certain that the intent of this 
information is to be the equivalent of the HDFS audit log.  In other words, 
setting these to debug completely defeats the purpose.  Instead, I suspect the 
real culprit is that the log4j settings are wrong for the node manager process.

 Too many nodemanager audit logs are generated
 -

 Key: YARN-2256
 URL: https://issues.apache.org/jira/browse/YARN-2256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena
 Attachments: YARN-2256.patch


 Following audit logs are generated too many times(due to the possibility of a 
 large number of containers) :
 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
 container
 2. In RM - Audit logs corresponding to AM allocating a container and AM 
 releasing a container
 We can have different log levels even for NM and RM audit logs and move these 
 successful container related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated


 [ 
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2256:
---
Assignee: Varun Saxena

 Too many nodemanager audit logs are generated
 -

 Key: YARN-2256
 URL: https://issues.apache.org/jira/browse/YARN-2256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-2256.patch


 Following audit logs are generated too many times(due to the possibility of a 
 large number of containers) :
 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
 container
 2. In RM - Audit logs corresponding to AM allocating a container and AM 
 releasing a container
 We can have different log levels even for NM and RM audit logs and move these 
 successful container related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2348) ResourceManager web UI should display server-side time instead of UTC time


 [ 
https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2348:
---
Assignee: Leitao Guo

 ResourceManager web UI should display server-side time instead of UTC time
 --

 Key: YARN-2348
 URL: https://issues.apache.org/jira/browse/YARN-2348
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Assignee: Leitao Guo
 Attachments: 3.before-patch.JPG, 4.after-patch.JPG, YARN-2348.2.patch


 ResourceManager web UI, including application list and scheduler, displays 
 UTC time in default,  this will confuse users who do not use UTC time. This 
 web UI should display server-side time in default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2422) yarn.scheduler.maximum-allocation-mb should not be hard-coded in yarn-default.xml


 [ 
https://issues.apache.org/jira/browse/YARN-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2422:
---
Assignee: Gopal V

 yarn.scheduler.maximum-allocation-mb should not be hard-coded in 
 yarn-default.xml
 -

 Key: YARN-2422
 URL: https://issues.apache.org/jira/browse/YARN-2422
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Gopal V
Assignee: Gopal V
Priority: Minor
 Attachments: YARN-2422.1.patch


 Cluster with 40Gb NM refuses to run containers 8Gb.
 It was finally tracked down to yarn-default.xml hard-coding it to 8Gb.
 In case of lack of a better override, it should default to - 
 ${yarn.nodemanager.resource.memory-mb} instead of a hard-coded 8Gb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2097) Documentation: health check return status


 [ 
https://issues.apache.org/jira/browse/YARN-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2097:
---
Assignee: Rekha Joshi

 Documentation: health check return status
 -

 Key: YARN-2097
 URL: https://issues.apache.org/jira/browse/YARN-2097
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Allen Wittenauer
Assignee: Rekha Joshi
  Labels: newbie
 Attachments: YARN-2097.1.patch


 We need to document that the output of the health check script is ignored on 
 non-0 exit status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2518) Support in-process container executor

2014-09-08 Thread BoYang (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125834#comment-14125834
]

BoYang commented on YARN-2518:
--

In my rough testing, it did not fail the node manager process. In my Container
Executor implementation (launchContainer method), I register a new application
master, send a message to another long running process, and unregister the
application master. I can see the application finished successfully.

Of course, that was my very draft initial testing. We could fine-tune the code
to make it work better. But technically it seems doable now. Thus I am curious
whether the YARN community could take this feature and provide official support.

Support in-process container executor
-

Key: YARN-2518
URL: https://issues.apache.org/jira/browse/YARN-2518
Project: Hadoop YARN
Issue Type: New Feature
Components: nodemanager
Affects Versions: 2.5.0
Environment: Linux, Windows
Reporter: BoYang
Priority: Minor
Labels: container, dispatch, in-process, job, node

Node Manage always creates a new process for a new application. We have hit a
scenario where we want the node manager to execute the application inside its
own process, so we get fast response time. It would be nice if Node Manager
or YARN can provide native support for that.
In general, the scenario is that we have a long running process which can
accept requests and process the requests inside its own process. Since YARN
is good at scheduling jobs, we want to use YARN to dispatch jobs (e.g.
requests in JSON) to the long running process. In that case, we do not want
YARN container to spin up a new process for each request. Instead, we want
YARN container to send the request to the long running process for further
processing.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2518) Support in-process container executor

[
https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125838#comment-14125838
]

Allen Wittenauer commented on YARN-2518:

Sorry, I wasn't clear: if this feature goes in, it must fail the nodemanager
process if security is enabled due to running tasks as the yarn user being
extremely insecure.

Support in-process container executor
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-09-08 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125846#comment-14125846
 ] 

Gera Shegalov commented on YARN-2377:
-

[~kasha], do you agree with the points above?

 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-2377.v01.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2097) Documentation: health check return status


[ 
https://issues.apache.org/jira/browse/YARN-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125852#comment-14125852
 ] 

Hadoop QA commented on YARN-2097:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12646615/YARN-2097.1.patch
  against trunk revision 302d9a0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4845//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4845//console

This message is automatically generated.

 Documentation: health check return status
 -

 Key: YARN-2097
 URL: https://issues.apache.org/jira/browse/YARN-2097
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Allen Wittenauer
Assignee: Rekha Joshi
  Labels: newbie
 Attachments: YARN-2097.1.patch


 We need to document that the output of the health check script is ignored on 
 non-0 exit status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2512) Allow for origin pattern matching in cross origin filter


[ 
https://issues.apache.org/jira/browse/YARN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125886#comment-14125886
 ] 

Hudson commented on YARN-2512:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1890 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1890/])
YARN-2512. Allowed pattern matching for origins in CrossOriginFilter. 
Contributed by Jonathan Eagles. (zjshen: rev 
a092cdf32de4d752456286a9f4dda533d8a62bca)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestCrossOriginFilter.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/CrossOriginFilter.java


 Allow for origin pattern matching in cross origin filter
 

 Key: YARN-2512
 URL: https://issues.apache.org/jira/browse/YARN-2512
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Fix For: 2.6.0

 Attachments: YARN-2512-v1.patch


 Extending the feature set of allowed origins. Now a * in a pattern 
 indicates this allowed origin is a pattern and will be matched including 
 multiple sub-domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2507) Document Cross Origin Filter Configuration for ATS


[ 
https://issues.apache.org/jira/browse/YARN-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125885#comment-14125885
 ] 

Hudson commented on YARN-2507:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1890 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1890/])
YARN-2507. Documented CrossOriginFilter configurations for the timeline server. 
Contributed by Jonathan Eagles. (zjshen: rev 
56dc496a1031621d2b701801de4ec29179d75f2e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/TimelineServer.apt.vm
* hadoop-yarn-project/CHANGES.txt


 Document Cross Origin Filter Configuration for ATS
 --

 Key: YARN-2507
 URL: https://issues.apache.org/jira/browse/YARN-2507
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation, timelineserver
Affects Versions: 2.6.0
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Fix For: 2.6.0

 Attachments: YARN-2507-v1.patch


 CORS support was added for ATS as part of YARN-2277. This jira is to document 
 configuration for ATS CORS support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch


[ 
https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125884#comment-14125884
 ] 

Hudson commented on YARN-2515:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1890 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1890/])
YARN-2515. Updated ConverterUtils#toContainerId to parse epoch. Contributed by 
Tsuyoshi OZAWA (jianhe: rev 0974f434c47ffbf4b77a8478937fd99106c8ddbd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerId.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestContainerId.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestConverterUtils.java
* hadoop-yarn-project/CHANGES.txt


 Update ConverterUtils#toContainerId to parse epoch
 --

 Key: YARN-2515
 URL: https://issues.apache.org/jira/browse/YARN-2515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-2515.1.patch, YARN-2515.2.patch


 ContaienrId#toString was updated on YARN-2182. We should also update 
 ConverterUtils#toContainerId to parse epoch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-09-08 Thread Sangjin Lee (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125908#comment-14125908
]

Sangjin Lee commented on YARN-1530:
---

{quote}
The bottleneck is still there. Essentially I don’t see any difference between
publishing entities via HTTP REST interface and via HDFS in terms of
scalability.
{quote}

IMO, option (1) necessarily entails less frequent imports into the store by
ATS. Obviously, if ATS still imports the HDFS files at the same speed as the
timeline entries are generated, there would be no difference in scalability.
This option would make sense only if the imports are less frequent. It also
would mean that as a trade-off reads would be more stale. I believe Robert's
document points out all those points.

Regarding option (2), I think your point is valid that it would be a transition
from a thin client to a fat client. And along with that would be some
complications as you point out.

However, I'm not too sure if it would make changing the data store much more
complicated than other scenarios. I think the main problem of switching the
data store is when not all writers are updated to point to the new data store.
If writes are in progress, and the clients are being upgraded, there would be
some inconsistencies between clients that were already upgraded and started
writing to the new store and those that are not upgraded yet and still writing
to the old store. If you have a single writer (such as the current ATS design),
then it would be simpler. But then again, if we consider a scenario such as a
cluster of ATS instances, the same problem exists there. I think that specific
problem could be solved by holding the writes in some sort of a backup area
(e.g. hdfs) before the switch starts, and recovering/re-enabling once all the
writers are upgraded.

The idea of a cluster of ATS instances (multiple write/read instances) sounds
interesting. It might be able to address the scalability/reliability problem at
hand. We'd need to think through and poke holes to see if the idea holds up
well, however. It would need to address how load balancing would be done and
whether it would be left up to the user, for example.

[Umbrella] Store, manage and serve per-framework application-timeline data
--

Key: YARN-1530
URL: https://issues.apache.org/jira/browse/YARN-1530
Project: Hadoop YARN
Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Attachments: ATS-Write-Pipeline-Design-Proposal.pdf,
ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf,
application timeline design-20140116.pdf, application timeline
design-20140130.pdf, application timeline design-20140210.pdf

This is a sibling JIRA for YARN-321.
Today, each application/framework has to do store, and serve per-framework
data all by itself as YARN doesn't have a common solution. This JIRA attempts
to solve the storage, management and serving of per-framework data from
various applications, both running and finished. The aim is to change YARN to
collect and store data in a generic manner with plugin points for frameworks
to do their own thing w.r.t interpretation and serving.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2518) Support in-process container executor

2014-09-08 Thread BoYang (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125907#comment-14125907
]

BoYang commented on YARN-2518:
--

Yeah, there might be some issues with this, which need to be figure out. Thanks
Allen for bring it out. I just come with YARN recently and cannot clearly
identify all potential issues now.

My point is that this in-process container executor seems to be a generic need
from different people. I kind of see several discussion about this in my
search. Some uses dummy process (for example, Impala?) as a proxy to relay the
task to the long running process for further processing.

So if the YARN community can realize the need for this common scenario, bring
it up for further discussion, and explore the possibilities to support it
natively, that will be really appreciated. And it will probably benefit a lot
of other people or projects as well, and make YARN a even more generic
framework to be adopted more broadly.

Support in-process container executor
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2154) FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request


[ 
https://issues.apache.org/jira/browse/YARN-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125950#comment-14125950
 ] 

Karthik Kambatla commented on YARN-2154:


Just discussed this with [~ashwinshankar77] offline. He rightly pointed out the 
sort order should take usage into account. I ll post what the order should be, 
as soon as I get to consult my notes. 

 FairScheduler: Improve preemption to preempt only those containers that would 
 satisfy the incoming request
 --

 Key: YARN-2154
 URL: https://issues.apache.org/jira/browse/YARN-2154
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Today, FairScheduler uses a spray-gun approach to preemption. Instead, it 
 should only preempt resources that would satisfy the incoming request. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


 [ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2459:
--
Attachment: YARN-2459.4.patch

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125959#comment-14125959
 ] 

Jian He commented on YARN-2459:
---

bq. Add one in TestRMRestart to get an app rejected and make sure that the 
final-status gets recorded
Added.
bq. Another one in RMStateStoreTestBase to ensure it is okay to have an 
updateApp call without a storeApp call like in this case.
Turns out RMStateStoreTestBase already has this test.
{code}
// test updating the state of an app/attempt whose initial state was not
// saved.
{code}

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125990#comment-14125990
]

Zhijie Shen commented on YARN-1530:
---

[~sjlee0], thanks for your feedback. Here're some additional thoughts and
clarifications upon your comments.

bq. This option would make sense only if the imports are less frequent.

To be more specific, I mean sending the same amount of entities (not too big,
if too big HTTP REST request has to chunk them into some continuous HTTP
requests with reasonable size) via HTTP REST or HDFS should perform similar.
HTTP REST may be better because of less secondary storage I/O (ethernet should
be fast than disk). HTTP REST doesn't prevent the user from batching the
entities and put them once, and current API supports it. It's up to the user to
put the entity immediately for realtime/near-realtime inquiry, or to batch
entities if the can tolerant some delay.

However, I agree HDFS or some other single-node storage technique is a
interesting part to prevent losing the entities when they are not published to
the timeline server yet, in particular when we batching them.

bq. Regarding option (2), I think your point is valid that it would be a
transition from a thin client to a fat client.
bq. However, I'm not too sure if it would make changing the data store much
more complicated than other scenarios.

I'm also not very sure about the necessary changes. As what I mentioned before,
timeline server doesn't simply put the entities into the data store. One
immediate problem I can come up with is the authorization. I'm not sure it's
going to be logically correct to check the user's access in the client at the
user's side. If we move authorization to the data store, HBase supports access
control, but Levedb seems not. And I'm not sure HBase access control is enough
for timeline sever's specific logic. Still need to think more about it.

As the client is growing fatter, it's difficult to maintain different versions
of clients. For example, if we do some incompatible optimization for the
storage schema, only the new client can write into it, while the old client
will not work any more. Moreover, as most writing logic is conducted at user
land, which is not predictable, it is likely to raise some unexpected failure
than a well setup server. In general, I prefer to keep the client simple, such
that the future client distribution and maintenance could be of less effort.

bq. But then again, if we consider a scenario such as a cluster of ATS
instances, the same problem exists there.

Right the same problem will exist at the server side, but the web front has
isolated it from the users. Compared to the clients at the application, the ATS
instances are a relatively small controllable set that we can pause them and do
proper upgrading process. How do you think?

[Umbrella] Store, manage and serve per-framework application-timeline data
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager


[ 
https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125994#comment-14125994
 ] 

Vinod Kumar Vavilapalli commented on YARN-2080:
---

Some comments on the patch:
 - Configuration
-- admission.enable - Rename to reservations.enable?
-- RM_SCHEDULER_ENABLE_RESERVATIONS - RM_RESERVATIONS_ENABLE, 
DEFAULT_RM_SCHEDULER_ENABLE_RESERVATIONS - DEFAULT_RM_RESERVATIONS_ENABLE
-- reservation.planfollower.time-step - 
reservation-system.plan-follower.time-step
-- RM_PLANFOLLOWER_TIME_STEP, DEFAULT_RM_PLANFOLLOWER_TIME_STEP - 
RM_RESERVATION_SYSTEM_PLAN_FOLLOWER_TIME_STEP, 
DEFAULT_RM_RESERVATION_SYSTEM_PLAN_FOLLOWER_TIME_STEP
- A meta question about configuration: It seems like if I pick up a scheduler 
and enable reservations, the system-class, the plan-follower should be picked 
up automatically instead of them being standalone configs. Can we do that? 
Otherwise the following
-- reservation.class - reservation-system.class? 
-- RM_RESERVATION, DEFAULT_RM_RESERVATION - RM_RESERVATION_SYSTEM_CLASS, 
DEFAULT_RM_RESERVATION_SYSTEM_CLASS 
-- reservation.plan.follower - reservation-system.plan-follower
-- RM_RESERVATION_PLAN_FOLLOWER, DEFAULT_RM_RESERVATION_PLAN_FOLLOWER - 
RM_RESERVATION_SYSTEM_PLAN_FOLLOWER, DEFAULT_RM_RESERVATION_SYSTEM_PLAN_FOLLOWER
 - YarnClient.submitReservation(): We don't return a queue-name anymore after 
the latest YARN-1708? There are javadoc refs to the queue-name being returned.
 - ClientRMService
-- If reservations are not enabled, we get a host of Reservation is not 
enabled. Please enable  try again everytime which is not desirable. See 
checkReservationSystem(). This log and a bunch of similar logs in 
ReservationInputValidator may either be (1) deleted or (2) actually belong to 
the audit-log (RMAuditLogger) - we don't need to double-log
-- checkReservationACLs: Today anyone who can submit applications can also 
submit reservations. We may want to separate them, if you agree, I'll file a 
ticket for future separation of these ACLs.
 - AbstractReservationSystem
-- getPlanFollower() - createPlanFollower()
-- create and init plan-follower should be in serviceInit()?
-- getNewReservationId(): Use ReservationId.newInstance()
 - ReservationInputValidator: Deleting a request shouldn't need 
validateReservationUpdateRequest-validateReservationDefinition. We only need 
the ID validation.
 - CapacitySchedulerConfiguration: I don't understand the semantics of configs 
- average-capacity, reservable.queue, reservation-window, 
reservation-enforcement-window, instantaneous-max-capacity,  - yet as they are 
not used in this patch. Can we drop them (and their setters/getters) here and 
move them to the JIRA that actually uses them?

Tests
 - TestYarnClient: You can use the newInstance methods and avoid using pb 
implementations and the setters directly (for e.g {{new 
ReservationDeleteRequestPBImpl()}}
 - TestClientRMService:
-- ReservationRequest.setLeaseDuration() was renamed to be simply 
setDuration() in YARN-1708. Seems like there are other such occurrences in the 
patch.
-- Similary to TestYarnClient, use record.newInstance methods instead of 
directly invoking PBImpls.

Can't understand CapacityReservationSystem yet as I have to dig into the 
details of YARN-1709.

 Admission Control: Integrate Reservation subsystem with ResourceManager
 ---

 Key: YARN-2080
 URL: https://issues.apache.org/jira/browse/YARN-2080
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Subramaniam Krishnan
Assignee: Subramaniam Krishnan
 Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch


 This JIRA tracks the integration of Reservation subsystem data structures 
 introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring 
 of YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2154) FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2014-09-08 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126005#comment-14126005
 ] 

Sandy Ryza commented on YARN-2154:
--

I'd like to add another constraint that I've been thinking about into the mix.  
We don't necessarily need to implement it in this JIRA, but I think it's worth 
considering how it would affect the approach.

A queue should only be able to preempt a container from another queue if every 
queue between the starved queue and their least common ancestor is starved.  
This essentially means that we consider preemption and fairness hierarchically. 
 If the marketing and engineering queues are square in terms of resources, 
starved teams in engineering shouldn't be able to take resources from queues in 
marketing - they should only be able to preempt from queues within engineering.



 FairScheduler: Improve preemption to preempt only those containers that would 
 satisfy the incoming request
 --

 Key: YARN-2154
 URL: https://issues.apache.org/jira/browse/YARN-2154
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Today, FairScheduler uses a spray-gun approach to preemption. Instead, it 
 should only preempt resources that would satisfy the incoming request. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2256) Too many nodemanager and resourcemanager audit logs are generated

2014-09-08 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2256:
---
Summary: Too many nodemanager and resourcemanager audit logs are generated  
(was: Too many nodemanager audit logs are generated)

 Too many nodemanager and resourcemanager audit logs are generated
 -

 Key: YARN-2256
 URL: https://issues.apache.org/jira/browse/YARN-2256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-2256.patch


 Following audit logs are generated too many times(due to the possibility of a 
 large number of containers) :
 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a 
 container
 2. In RM - Audit logs corresponding to AM allocating a container and AM 
 releasing a container
 We can have different log levels even for NM and RM audit logs and move these 
 successful container related logs to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126036#comment-14126036
 ] 

Hadoop QA commented on YARN-2459:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667215/YARN-2459.4.patch
  against trunk revision df8c84c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4846//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4846//console

This message is automatically generated.

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch, YARN-2459.5.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126037#comment-14126037
 ] 

Jian He commented on YARN-2459:
---

New patch added some comments in the test case

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch, YARN-2459.5.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


 [ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2459:
--
Attachment: YARN-2459.5.patch

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch, YARN-2459.5.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126073#comment-14126073
]

Vinod Kumar Vavilapalli commented on YARN-2033:
---

Mostly looks fine, this is a rapidly changing part of the code-base! I get a
feeling we need some umbrella cleanup effort to make consistent usage w.r.t
history-service/timeline-service. Anyways, some comments
- RMApplicationHistoryWriter is not really needed anymore. We did document it
to be unstable/alpha too. We can remove it directly instead of deprecating it.
It's a burden to support two interface hierarchies. I'm okay doing it
separately though.
- YarnClientImpl: Calls using AHSClient shouldn't rely on timeline-publisher
yet, we should continue to use APPLICATION_HISTORY_ENABLED for that till we get
rid of AHSClient altogether. We should file a ticket for this too.
- You removed the unstable annotations from ApplicationContext APIs. We should
retain them, this stuff isn't stable yet.
- Rename YarnMetricsPublisher - {Platform|System}MetricsPublisher to avoid
confusing it with host/daemon metrics that exist outside today?

Investigate merging generic-history into the Timeline Store
---

Key: YARN-2033
URL: https://issues.apache.org/jira/browse/YARN-2033
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf,
YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch,
YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch,
YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch,
YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch

Having two different stores isn't amicable to generic insights on what's
happening with applications. This is to investigate porting generic-history
into the Timeline Store.
One goal is to try and retain most of the client side interfaces as close to
what we have today.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2256) Too many nodemanager and resourcemanager audit logs are generated

2014-09-08 Thread Varun Saxena (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126119#comment-14126119
]

Varun Saxena commented on YARN-2256:

bq. Someone correct me if I'm wrong, but I'm fairly certain that the intent of
this information is to be the equivalent of the HDFS audit log. In other words,
setting these to debug completely defeats the purpose. Instead, I suspect the
real culprit is that the log4j settings are wrong for the node manager process.

[~aw], the issue raised was basically for both NM and RM. I have updated the
description to reflect that. The issue here is that some of the container
related operations' audit logs in both NM and RM are too frequent and too many.
This may impact performance as well.

Now, there are 2 solutions possible, either remove these logs or change the log
level, so that they do not occur in live environment and can be opened only
when required.
As I wasnt sure if these audit logs have to be removed or not, I changed the
log level for some of these logs in RM and all of them in NM. To ensure this, I
supported printing of audit logs at different levels, as is done in HBase (as
per my info). This is handled as part of YARN-2287

Now for NM, you are correct, Log level can be changed in log4j properties to
suppress these logs if required. But for RM, as not all logs have to be
suppressed, this cant be done. So to be consistent, I added log levels for both
NM and RM.

If its agreeable to remove these audit logs, that can be a possible solution as
well. Pls suggest.

Too many nodemanager and resourcemanager audit logs are generated
-

Key: YARN-2256
URL: https://issues.apache.org/jira/browse/YARN-2256
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Attachments: YARN-2256.patch

Following audit logs are generated too many times(due to the possibility of a
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a
container
2. In RM - Audit logs corresponding to AM allocating a container and AM
releasing a container
We can have different log levels even for NM and RM audit logs and move these
successful container related logs to DEBUG.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126135#comment-14126135
 ] 

Hadoop QA commented on YARN-2459:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667234/YARN-2459.5.patch
  against trunk revision d989ac0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4847//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4847//console

This message is automatically generated.

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch, YARN-2459.5.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

[
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126140#comment-14126140
]

Vinod Kumar Vavilapalli commented on YARN-2440:
---

Just caught up with the discussion. I can get behind an absolute limit too.
Specifically in the context of heterogeneous clusters where uniform %
configurations can go really bad where the only resort will then be to do
per-node configuration - not ideal. Would that be a valid use-case for putting
in the absolute limit? [~jlowe]? Even if it were, I am okay punting that off to
a separate JIRA.

Comments on the patch:
- containers-limit-cpu-percentage -
{{yarn.nodemanager.resource.percentage-cpu-limit}} to be consistent? Similarly
NM_CONTAINERS_CPU_PERC? I don't like the tag 'resource', it should have been
'resources' but it is what it is.
- You still have refs to YarnConfiguration.NM_CONTAINERS_CPU_ABSOLUTE in the
patch. Similarly the javadoc in NodeManagerHardwareUtils needs to be updated if
we are not adding the absolute cpu config. It should no longer refer to number
of cores that should be used for YARN containers
- TestCgroupsLCEResourcesHandler: You can use mockito if you only want to
override num-processors in TestResourceCalculatorPlugin. Similarly in
TestNodeManagerHardwareUtils.
- The tests may fail on a machine with 4 cores? :)

Cgroups should allow YARN containers to be limited to allocated cores
-

Key: YARN-2440
URL: https://issues.apache.org/jira/browse/YARN-2440
Project: Hadoop YARN
Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch,
apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch,
screenshot-current-implementation.jpg

The current cgroups implementation does not limit YARN containers to the
cores allocated in yarn-site.xml.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem

2014-09-08 Thread Subramaniam Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramaniam Krishnan updated YARN-1709:
---
Attachment: YARN-1709.patch

Thanks [~chris.douglas] for your exhaustive review. I am uploading a patch that 
has the following fixes:
  * Cloned _ZERO_RESOURCE_, _minimumAllocation_ and _maximumAllocation_ to 
prevent leaking of mutable data
   * Removed MessageFormat. Have to concat strings in few cases where they are 
both logged and included as part of exception message
  * Fixed the code readability and lock scope in _addReservation()_
  * Added assertions for _isWriteLockedByCurrentThread()_ in private methods 
that assume locks
  * Removed redundant _this_ in get methods
  * toString uses StringBuilder instead of StringBuffer now
  * Fixed Javadoc - content (_getEarliestStartTime()_) and whitespaces
  * Made _ReservationInterval_ immutable, good catch

The ReservationSystem uses UTCClock (added as part of YARN-1708) to enforce UTC 
times.  

 Admission Control: Reservation subsystem
 

 Key: YARN-1709
 URL: https://issues.apache.org/jira/browse/YARN-1709
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Carlo Curino
Assignee: Subramaniam Krishnan
 Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, 
 YARN-1709.patch


 This JIRA is about the key data structure used to track resources over time 
 to enable YARN-1051. The Reservation subsystem is conceptually a plan of 
 how the scheduler will allocate resources over-time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126156#comment-14126156
 ] 

Jian He commented on YARN-2308:
---

Looked at this again, I think the solution mentioned by [~sunilg] is reasonable:
bq. During RMAppRecoveredTransition in RMAppImpl, may be we can check recovered 
app queue (can get this from submission context) is still a valid queue? If 
this queue not present, recovery for that app can be made failed, and may be 
need to do some more RMApp clean up. Sounds doable?
We can check if the queue exists on recovery. If not, directly return FAILED 
state and no need to add the attempts anymore.  Thoughts ?


 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: jira2308.patch, jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126158#comment-14126158
 ] 

Karthik Kambatla commented on YARN-1458:


Thanks Zhihai for working on this. I like the first approach: uploading a patch 
with minor nit fixes. Let me know if this looks good to you. 

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1458:
---
Attachment: yarn-1458-5.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


 [ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2459:
--
Attachment: YARN-2459.6.patch

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2448) RM should expose the resource types considered during scheduling when AMs register


[ 
https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126162#comment-14126162
 ] 

Vinod Kumar Vavilapalli commented on YARN-2448:
---

+1, this looks good. Checking this in..

 RM should expose the resource types considered during scheduling when AMs 
 register
 --

 Key: YARN-2448
 URL: https://issues.apache.org/jira/browse/YARN-2448
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch, 
 apache-yarn-2448.2.patch


 The RM should expose the name of the ResourceCalculator being used when AMs 
 register, as part of the RegisterApplicationMasterResponse.
 This will allow applications to make better decisions when scheduling. 
 MapReduce for example, only looks at memory when deciding it's scheduling, 
 even though the RM could potentially be using the DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126172#comment-14126172
 ] 

Karthik Kambatla commented on YARN-1458:


By the way, I like the first approach mainly because of its simplicity and 
readability. 

In the while loop that was running forever, we could optionally keep track of 
the resource-usage from the previous iteration and see if we are making 
progress. 

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1709) Admission Control: Reservation subsystem


[ 
https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126193#comment-14126193
 ] 

Hadoop QA commented on YARN-1709:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667251/YARN-1709.patch
  against trunk revision d989ac0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4848//console

This message is automatically generated.

 Admission Control: Reservation subsystem
 

 Key: YARN-1709
 URL: https://issues.apache.org/jira/browse/YARN-1709
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Carlo Curino
Assignee: Subramaniam Krishnan
 Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, 
 YARN-1709.patch


 This JIRA is about the key data structure used to track resources over time 
 to enable YARN-1051. The Reservation subsystem is conceptually a plan of 
 how the scheduler will allocate resources over-time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback


[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126202#comment-14126202
 ] 

Jian He commented on YARN-415:
--

Looks good to me. Just one  more question, I kind of lose context why we need 
this check; seems we don't need, because the returned 
ApplicationResourceUsageReport for non-active attempt is anyways null.
{code}
// Only add in the running containers if this is the active attempt.
RMAppAttempt currentAttempt = rmContext.getRMApps()
   .get(attemptId.getApplicationId()).getCurrentAppAttempt();
if (currentAttempt != null 
currentAttempt.getAppAttemptId().equals(attemptId)) {
{code}

 Capture aggregate memory allocation at the app-level for chargeback
 ---

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
 YARN-415.201409040036.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126204#comment-14126204
 ] 

zhihai xu commented on YARN-1458:
-

Hi [~kasha], thanks for the review, The first approach has simplicity and 
readability advantage but it can't cover all the corner cases.
the alternative approach can fix zero weight with non-zero minShare but the 
first approach can't. 
Both approach can fix zero weight with zero minShare. We may have limitation to 
keep track of the resource-usage from the previous iteration and see if we are 
making progress, For example for a very small weight, there may be 0 value 
return from resourceUsedWithWeightToResourceRatio  after multiple iteration.
thanks
zhihai

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126210#comment-14126210
 ] 

Karthik Kambatla commented on YARN-1458:


bq. the alternative approach can fix zero weight with non-zero minShare but the 
first approach can't
I see. Good point. I was wondering if there were cases we might want to check 
for {{if (currentRU - previousRU  epsilon || currentRU  totalResource)}}. The 
zero weight and non-zero minshare should be handled by such a check, no? 

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126236#comment-14126236
]

Zhijie Shen commented on YARN-2033:
---

[~vinodkv], thanks for your comments. I've updated the patch accordingly.

bq. RMApplicationHistoryWriter is not really needed anymore. We did document it
to be unstable/alpha too. We can remove it directly instead of deprecating it.
It's a burden to support two interface hierarchies. I'm okay doing it
separately though.

Seems to make sense. Previously I created a ticket for deprecating the old
history store stack. Let me update that jira.

bq. YarnClientImpl: Calls using AHSClient shouldn't rely on timeline-publisher
yet, we should continue to use APPLICATION_HISTORY_ENABLED for that till we get
rid of AHSClient altogether. We should file a ticket for this too.

In the newer patch, I revert the change in YarnClientImpl, making it use
APPLICATION_HISTORY_ENABLED. And ApplicationHistoryServer checks
APPLICATION_HISTORY_STORE for backward compatibility. This can be simplified
once the old history store stack is removed. Also I simplify the configuration
check in SystemMetricsPublisher. I'll create a jira for getting rid of
AHSClient.

bq. You removed the unstable annotations from ApplicationContext APIs. We
should retain them, this stuff isn't stable yet.

ApplicationContext is for internal usage only, not user-faced interface. So I
think it should be removed not to confuse people.

bq. Rename YarnMetricsPublisher - {Platform|System}
MetricsPublisher to avoid confusing it with host/daemon metrics that exist
outside today?

Renamed all yarnmetrics - systemmetrics.

Investigate merging generic-history into the Timeline Store
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store


 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--
Attachment: YARN-2033.8.patch

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
 YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
 YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126240#comment-14126240
 ] 

Hadoop QA commented on YARN-1458:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667252/yarn-1458-5.patch
  against trunk revision d989ac0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4849//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4849//console

This message is automatically generated.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126249#comment-14126249
 ] 

zhihai xu commented on YARN-1458:
-

Yes, it works, it can fix the zero weight with non-zero minShare if we compare 
with previous result.
But the alternative approach will be a little faster compare to the first 
approach(less computation and less schedulables in the calculation after 
filtering fixed shared schedulables). Either approach is ok for me.
I will submit a patch on the first approach to compare with previous result.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126259#comment-14126259
 ] 

Hadoop QA commented on YARN-2459:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667254/YARN-2459.6.patch
  against trunk revision d989ac0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4850//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4850//console

This message is automatically generated.

 RM crashes if App gets rejected for any reason and HA is enabled
 

 Key: YARN-2459
 URL: https://issues.apache.org/jira/browse/YARN-2459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
 YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch


 If RM HA is enabled and used Zookeeper store for RM State Store.
 If for any reason Any app gets rejected and directly goes to NEW to FAILED
 then final transition makes that to RMApps and Completed Apps memory 
 structure but that doesn't make it to State store.
 Now when RMApps default limit reaches it starts deleting apps from memory and 
 store. In that case it try to delete this app from store and fails which 
 causes RM to crash.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store


 [ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2320:
--
Summary: Removing old application history store after we store the history 
data to timeline store  (was: Deprecate existing application history store 
after we store the history data to timeline store)

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store


[ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126272#comment-14126272
 ] 

Zhijie Shen commented on YARN-2320:
---

According to Vinod's comments: 
https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073

We may think of removing the old store stack directly.

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2014-09-08 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126284#comment-14126284
 ] 

Xuan Gong commented on YARN-2308:
-

bq. We can check if the queue exists on recovery. If not, directly return 
FAILED state and no need to add the attempts anymore. Thoughts ?

If we are doing this, the RMAppAttempt will show the *in-correct* state in 
ApplicationHistoryStore

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: jira2308.patch, jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store


[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126340#comment-14126340
 ] 

Hadoop QA commented on YARN-2033:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667269/YARN-2033.8.patch
  against trunk revision d989ac0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 17 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4851//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4851//console

This message is automatically generated.

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
 YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
 YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store


 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--
Attachment: YARN-2033.9.patch

Fix one typo in the class name

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
 YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
 YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1712) Admission Control: plan follower


[ 
https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126397#comment-14126397
 ] 

Jian He commented on YARN-1712:
---

Thanks Subra and Carlo working on the patch. Some comments and questions on the 
patch:

- I think the default queue can be initialized upfront when PlanQueue is 
initialized in CapacityScheduler
{code}
  // Add default queue if it doesnt exist
  if (scheduler.getQueue(defPlanQName) == null) {
{code}
- Consolidate the comments into 2 lines
{code}
  // identify the reservations that have expired and new reservations that
  // have to
  // be activated
{code}
- Exceptions like the following are ignored. Is this intentional ?
{code}
 } catch (YarnException e) {
  LOG.warn(
  Exception while trying to release default queue capacity for 
plan: {},
  planQueueName, e);
}
{code}
- may be create a common method to calculate lhsRes and rhsRes
{code}
  CSQueue lhsQueue = scheduler.getQueue(lhs.getReservationId().toString());
  if (lhsQueue != null) {
lhsRes =
Resources.subtract(
lhs.getResourcesAtTime(now),
Resources.multiply(clusterResource,
lhsQueue.getAbsoluteCapacity()));
  } else {
lhsRes = lhs.getResourcesAtTime(now);
  }
{code}
- allocatedCapacity, may rename to reservedResources
{code}
  Resource allocatedCapacity = Resource.newInstance(0, 0);
{code}
- Instead of doing the following:  
{code}
  for (CSQueue resQueue : resQueues) {
previousReservations.add(resQueue.getQueueName());
  }
  SetString expired =
  Sets.difference(previousReservations, curReservationNames);
  SetString toAdd =
  Sets.difference(curReservationNames, previousReservations);
{code}
we can do something like this to save some time cost. 
{code}
for queue in previousReservations:
if (queue not in curReservationNames)
expired.add(queue)
else:
   curReservationNames.remove(queue) // curReservationNames contains the 
ToAdd queues in the end
{code}
- Not sure if this method is only used by PlanFollower. If it is, we can change 
the return value to be a set of reservation names so that we don’t need to loop 
later to get all the reservation names..
{code}
  SetReservationAllocation currentReservations =
  plan.getReservationsAtTime(now);
{code}
- rename defPlanQName to defReservationQueue
{code}
  String defPlanQName = planQueueName + PlanQueue.DEFAULT_QUEUE_SUFFIX;
{code}
- The apps are already in current planQueue, IIUC, this is the 
defaultReservationQueue ? If that, I think we may change the queueName 
parameter to the proper defaultReservationQueue name. Also, 
AbstractYarnScheduler#moveAllApps is actually expecting the queue to be 
leafQueue(ReservationQueue), not planQueue(parentQueue).
{code}
// Move all the apps in these queues to the PlanQueue
moveAppsInQueues(toMove, planQueueName);
{code}
- I’m thinking if we can make PlanFollower synchronously move apps to the 
defaultQueue, for the following reasons:
{code}
1. IIUC, the logic for moveAll and killAll is that: the first Time 
synchronizePlan is called, it will try to move all expired apps; next Time 
synchronizePlan is called, it will kill all the previously not-yet-moved apps. 
If the synchronizePlan interval is very small,  it’s likely to kill most apps 
that are being move.
2. Exceptions from CapacityScheduler#moveApplication are currently just 
ignored, if doing asynchronously 
3. PlanFollower is now anyways locking the whole scheduler in synchronizePlan 
method (though I’m still thinking if we need to lock the whole scheduler as 
this is kind of costly.)
4. In AbstractYarnScheduler#moveAllApps, We can do the moveApp synchronously, 
and still send events to RMApp to update its bookkeeping if needed. (But I 
don’t think we need to send the event now.)
5. PlanFollower move logic should be much simpler if doing synchronously 
{code}

 Admission Control: plan follower
 

 Key: YARN-1712
 URL: https://issues.apache.org/jira/browse/YARN-1712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: reservations, scheduler
 Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.patch


 This JIRA tracks a thread that continuously propagates the current state of 
 an inventory subsystem to the scheduler. As the inventory subsystem store the 
 plan of how the resources should be subdivided, the work we propose in this 
 JIRA realizes such plan by dynamically instructing the CapacityScheduler to 
 add/remove/resize queues to follow the plan.



--
This message was sent by Atlassian JIRA

[jira] [Updated] (YARN-1250) Generic history service should support application-acls


 [ 
https://issues.apache.org/jira/browse/YARN-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1250:
--
Attachment: YARN-1250.2.patch

Update the patch according the latest patch of YARN-2033.

 Generic history service should support application-acls
 ---

 Key: YARN-1250
 URL: https://issues.apache.org/jira/browse/YARN-1250
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: GenericHistoryACLs.pdf, YARN-1250.1.patch, 
 YARN-1250.2.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2522) AHSClient may be not necessary


 [ 
https://issues.apache.org/jira/browse/YARN-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2522:
--
Issue Type: Sub-task  (was: Bug)
Parent: YARN-321

 AHSClient may be not necessary
 --

 Key: YARN-2522
 URL: https://issues.apache.org/jira/browse/YARN-2522
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Per discussion in 
 [YARN-2033|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073],
  it may be not necessary to have a separate AHSClient. The methods can be 
 incorporated into TimelineClient. APPLICATION_HISTORY_ENABLED is also useless 
 then.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-08 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2494:
-
Attachment: YARN-2494.patch

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store


[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126481#comment-14126481
 ] 

Hadoop QA commented on YARN-2033:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667301/YARN-2033.9.patch
  against trunk revision 7498dd7.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 17 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4852//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4852//console

This message is automatically generated.

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
 YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
 YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations


[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126493#comment-14126493
 ] 

Hadoop QA commented on YARN-2494:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667321/YARN-2494.patch
  against trunk revision 7498dd7.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4853//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4853//console

This message is automatically generated.

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126498#comment-14126498
 ] 

zhihai xu commented on YARN-1458:
-

Hi [~kasha], I just found an example to prove the first approach doesn't work 
when minShare is not zero(all queues have none zero minShare).
The following is the example:
We have 4 queues A,B,C and D: each have 0.25 weight, each have minShare 1024,
The cluster have resource 6144(6*1024)
using the first approach to compare with previous result, we will exit early in 
the loop with each Queue's fair share is 1024.
The reason is that computeShare will return minShare value 1024 when rMax 
=2048 in the following code:
{code}
private static int computeShare(Schedulable sched, double w2rRatio,
  ResourceType type) {
double share = sched.getWeights().getWeight(type) * w2rRatio;
share = Math.max(share, getResourceValue(sched.getMinShare(), type));
share = Math.min(share, getResourceValue(sched.getMaxShare(), type));
return (int) share;
  }
{code}
So for the first 12 iterations, the currentRU is not changed which is sum of 
all queues' minShare(4096).
If we use second approach, we will get the correct result: each Queue's fair 
share is 1536.
In this case, the second approach is definitely better than the first approach,
the first approach can't handle the case:all queues have none zero minShare.

I will create a new test case in the second approach patch.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 -

[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:

Attachment: YARN-1458.alternative2.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
 yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126533#comment-14126533
 ] 

zhihai xu commented on YARN-1458:
-

I uploaded a new patch YARN-1458.alternative2.patch which add a new test 
case:all queues have none zero minShare:
queueA and queueB each have eight 0.5 and minShare 1024,
the cluster have resource 8192. so each queue should have 4096 fair share.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
 YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
 yarn-1458-5.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-09-08 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126557#comment-14126557
 ] 

Tsuyoshi OZAWA commented on YARN-2517:
--

As Zhijie mentioned, we should have the callback if we need to check errors. 
IMHO, if we have a thread for the callback onError, we should also have 
onEntitiesPut since the complexity doesn't increase so much and it's useful 
to distinguish connection-level exceptions from entity-level errors.

 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2463) Add total cluster capacity to AllocateResponse

2014-09-08 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev resolved YARN-2463.
-
Resolution: Invalid

 Add total cluster capacity to AllocateResponse
 --

 Key: YARN-2463
 URL: https://issues.apache.org/jira/browse/YARN-2463
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev

 YARN-2448 exposes the ResourceCalculator being used by the scheduler so that 
 AMs can make better decisions when scheduling tasks. The 
 DominantResourceCalculator needs the total cluster capacity to function 
 correctly. We should add this information to the AllocateResponse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2463) Add total cluster capacity to AllocateResponse

2014-09-08 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126558#comment-14126558
 ] 

Varun Vasudev commented on YARN-2463:
-

Not required anymore since we don't expose the resource calculator.

 Add total cluster capacity to AllocateResponse
 --

 Key: YARN-2463
 URL: https://issues.apache.org/jira/browse/YARN-2463
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev

 YARN-2448 exposes the ResourceCalculator being used by the scheduler so that 
 AMs can make better decisions when scheduling tasks. The 
 DominantResourceCalculator needs the total cluster capacity to function 
 correctly. We should add this information to the AllocateResponse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field

Nishan Shetty created YARN-2523:
---

 Summary: ResourceManager UI showing negative value for 
Decommissioned Nodes field
 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.4.1
Reporter: Nishan Shetty
Priority: Minor


1. Decommission one NodeManager by configuring ip in excludehost file
2. Remove ip from excludehost file
3. Execute -refreshNodes command and restart Decommissioned NodeManager

Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2524) ResourceManager UI shows negative value for Decommissioned Nodes field

Nishan Shetty created YARN-2524:
---

 Summary: ResourceManager UI shows negative value for 
Decommissioned Nodes field
 Key: YARN-2524
 URL: https://issues.apache.org/jira/browse/YARN-2524
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Nishan Shetty


1. Decommission one NodeManager by configuring ip in excludehost file
2. Remove ip from excludehost file
3. Execute -refreshNodes command and restart Decommissioned NodeManager

Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2524) ResourceManager UI shows negative value for Decommissioned Nodes field


 [ 
https://issues.apache.org/jira/browse/YARN-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty resolved YARN-2524.
-
Resolution: Invalid

2 issues got created by mistake.

 ResourceManager UI shows negative value for Decommissioned Nodes field
 

 Key: YARN-2524
 URL: https://issues.apache.org/jira/browse/YARN-2524
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Nishan Shetty

 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field

2014-09-08 Thread Rohith (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-2523:


Assignee: Rohith

 ResourceManager UI showing negative value for Decommissioned Nodes field
 --

 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.4.1
Reporter: Nishan Shetty
Assignee: Rohith
Priority: Minor

 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field


 [ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2523:

Priority: Major  (was: Minor)

 ResourceManager UI showing negative value for Decommissioned Nodes field
 --

 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.4.1
Reporter: Nishan Shetty
Assignee: Rohith

 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field

2014-09-08 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126580#comment-14126580
 ] 

Rohith commented on YARN-2523:
--

Decommissioned Node metrics are set by NodeListManager. If decommission nodes 
rejoin, then RMNodeImpl#updateMetricsForRejoinedNode() again decrements metrics 
by 1 which cause negative value.

There should have check in RMNodeImpl#updateMetricsForRejoinedNode() for 
decommission state.
{code}
if (!ecludedHosts.contains(hostName)
   !ecludedHosts.contains(NetUtils.normalizeHostName(hostName))) {
metrics.decrDecommisionedNMs();
}
{code}

 ResourceManager UI showing negative value for Decommissioned Nodes field
 --

 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Assignee: Rohith

 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely