[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393360#comment-15393360
 ] 

Wangda Tan commented on YARN-4091:
--

Thanks [~ChenGe],

Would like to request more reviews from [~sunilg] / [~eepayne] / [~jlowe].

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.1.patch, YARN-4091.2.patch, 
> YARN-4091.3.patch, YARN-4091.preliminary.1.patch, app_activities.json, 
> node_activities.json
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393119#comment-15393119
 ] 

Hadoop QA commented on YARN-4091:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 8s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
36s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 39s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 3s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
50s {color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 7s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 55s 
{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 38s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 42s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: The patch generated 72 
new + 378 unchanged - 1 fixed = 450 total (was 379) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 13s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
54s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 7s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 41s {color} 
| {color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 55m 49s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 180m 22s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.nodemanager.TestDirectoryCollection |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
| Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesReservation
 |
|   | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesReservation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12820053/YARN-4091.3.patch |
| JIRA Issue | 

[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-25 Thread Chen Ge (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392897#comment-15392897
 ] 

Chen Ge commented on YARN-4091:
---

We also run scheduler load simulator(sls) using fake data. There are 2000 nodes 
in total. In one second, 2000 node heartbeats occur.

Two APIs are provided as activity view. The first one is to record activities 
for one node heartbeat. The second one is to record application activities 
within a period of time, given applicationId and time.

If running in previous patch without changes, one node heartbeat costs 0.2ms 
approximately. If we only record application activities, the difference of 
running time is unnoticeable, less than 0.01 ms. But if we record a complete 
node heartbeat activities, the running time for each node heartbeat is 0.6ms, 
which is about 3X compared to the baseline. However, in practice, only a few 
nodes' activities will be recorded at the same time. For example, if there're 
30 nodes activities being recoreded at the same time (which is already a huge 
number to me). Compared to the time cost by 2000 node heartbeats, the time to 
record activities is small (around 3% more overhead), so it is neglectable and 
acceptable.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.1.patch, YARN-4091.2.patch, 
> YARN-4091.3.patch, YARN-4091.preliminary.1.patch, app_activities.json, 
> node_activities.json
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392595#comment-15392595
 ] 

Hadoop QA commented on YARN-4091:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 18s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 8s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
37s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 29s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
46s {color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
54s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 42s 
{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 13s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 13s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: The patch generated 74 
new + 377 unchanged - 1 fixed = 451 total (was 378) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
42s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 19s 
{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 4 new + 6597 
unchanged - 0 fixed = 6601 total (was 6597) {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 18s 
{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager
 generated 4 new + 989 unchanged - 0 fixed = 993 total (was 989) {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 46s {color} 
| {color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 50m 19s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 166m 39s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.nodemanager.TestDirectoryCollection |
| Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesReservation
 |
|   | 

[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-22 Thread Chen Ge (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390139#comment-15390139
 ] 

Chen Ge commented on YARN-4091:
---

Thanks [~sunilg] for comments and improvements. Here are corresponding 
modifications and some further comments.

For comment 1, I think multiple node heartbeats will not be invoked at the same 
time. They happen sequentially, so {{startNodeUpdateRecording}} will not be 
visited by two node heartbeats at the same time. There is no need to 
synchronize it.

For comment 2, {{activeRecordedNodes}} and {{recordingNodesAllocation}} are 
both to ensure recording a complete node update after request. 
{{recordingNodesAllocation}} puts the recorded node once 
{{activeRecordedNodes}} contains that node in {{startNodeUpdateRecording}}. 
Node adds to {{activeRecordedNodes}} once user requests it. If we avoid 
{{activeRecordedNodes}}, we may begin to record activity even at the middle of 
a node heartbeat. It is necessary to use {{activeRecordedNodes}} to wait until 
next node heartbeat.

We have addressed comment 3, 4, 5, 7 based on suggestions.

For comment 6, we have added a new intermediate util class called 
{{ActivitiesLogger}}. The operations there are classified into three classes: 
APP, QUEUE and NODE. They handle "start", "add" or "finish" operations from 
APP, QUEUE and NODE perspectives. Within CapacityScheduler, Queue or 
ContainerAllocator, it simply calls the helper functions in 
{{ActivitiesLogger}}. {{ActivitiesLogger}} will invoke the specific operations 
in {{ActivitiesManager}}.

Also for comment 8, we have made the activities API simpler. We delete the 
updateState operation and just keep startRecording, addActivity, 
finishNodeAllocation and finishRecording. We combine similar calls and optimize 
passed parameters as clean as possible.

As for minor nits, we change the function name as suggested.

Thanks again for the valuable comments.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.1.patch, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390041#comment-15390041
 ] 

Hadoop QA commented on YARN-4091:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 27s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
9s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
24s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 29s {color} 
| {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager
 generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 23s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 109 new + 377 unchanged - 1 fixed = 486 total (was 378) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 5s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 5 new + 0 unchanged - 0 fixed = 5 total (was 0) {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 20s 
{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager
 generated 48 new + 989 unchanged - 0 fixed = 1037 total (was 989) {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 53m 45s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
15s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 69m 18s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
|  |  Inconsistent synchronization of 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.activitiesManager;
 locked 63% of time  Unsynchronized access at AbstractYarnScheduler.java:63% of 
time  Unsynchronized access at AbstractYarnScheduler.java:[line 797] |
|  |  Unread public/protected field:At ActivityNodeInfo.java:[line 49] |
|  |  Unread public/protected field:At ActivityNodeInfo.java:[line 50] |
|  |  Unread public/protected field:At ActivityNodeInfo.java:[line 47] |
|  |  Unread public/protected field:At ActivityNodeInfo.java:[line 48] |
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestChildQueueOrder |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits |
|   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimitsByPartition
 |
|   | 

[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-18 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382513#comment-15382513
 ] 

Sunil G commented on YARN-4091:
---

Thanks [~ChenGe] and [~leftnoteasy]. As discussed offline, we will now track 
only one heartbeat activity per request. So my point 1 and 2 is fine. I have 
gone through code also a little bit.

*Few additional comments:*
1. I think Its better we synchronize {{startNodeUpdateRecording}}, as it can be 
invoked from multiple node heartbeat same time. Also we have some variables 
which is not atomic in nature {{recordNextAvailableNode}}
2. {{recordingNodesAllocation}} itself might be enough to track the active 
nodes which are doing scheduling activity. Since this is a concurrent hash map, 
we can get the key set to work with. If so, we can avoid {{activeRecordedNodes}}
3. If a node is dead and we were recording some activity earlier,then we are 
not flushing out that data. A timer mechanism may be needed only for that.
4. I think all new activity monitor can be packed under a new one 
{{org/apache/hadoop/yarn/server/resourcemanager/scheduler/activitymonitor}}
5. Could we place activityManager in {{CapacitySchedulerContext}}. So that we 
can have getter method rather chasing all method signature in 
{{AbstractCSQueue}} etc
6. I think we should not have activity start/stop/update code in various places 
in scheduler code. Now in {{LeafQueue}}, we have separate methods written such 
as recordActivity , finishAppAllocationRecording etc. Similar in allocate code 
also. I think all such methods should be inside {{activityManager}} and those 
apis need to be public (with proper java doc). I think its better if we have 
more clarity for the interface from {{activityManager}}. Even though if we have 
a new intermediate util class / helper class, which can work as a wrapper of 
{{activityManager}}, its fine. We can pull all these extra codes from scheduler 
end.
7. Instead of using {{Date}} in various places in code, I think we can use 
{{SystemClock}} or {{MonotonicClock}}.
8. In few places, i can see below code.
{code}
@@ -92,9 +973,24 @@ public synchronized CSAssignment assignContainers(Resource 
clusterResource,
   application, node.getPartition(), currentResourceLimits)) {
 application.updateAMContainerDiagnostics(AMState.ACTIVATED,
 "User capacity has reached its maximum limit.");
+recordActivity(node, getQueueName(),
+application.getApplicationId().toString(),
+application.getPriority().toString(), ActivityState.REJECTED,
+ActivityDiagnosticConstant.USER_CAPACITY_MAXIMUM_LIMIT,
+AllocationActivityType.app);
+updateActivityState(node, ActivityState.SKIPPED,
+ActivityDiagnosticConstant.EMPTY);
+finishAppAllocationRecording(application.getApplicationId(),
+ActivityState.REJECTED);
 continue;
   }
{code}
This is basically an error code handling piece of code. But we have 3 steps to 
record activity. I think we need to optimize and call one api to 
activityManager. If multiple app/node are interested, {{activityManager}} need 
to divert and make the call separate. From scheduler, its better we make it 
more simple and clean.


*Minor nits:*

turnOffAppUpdate —> turnOffActivityMonitoringForApp
addActivity -> addSchedulingActivityForNode
addAppActivity -> addSchedulingActivityForApp

I will try look in scheduler code more where we record activities and will 
share comments if any. Thank You.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message 

[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-13 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375131#comment-15375131
 ] 

Sunil G commented on YARN-4091:
---

Thanks [~ChenGe] for the patch and detailed doc.

Few initial comments, I will also share more feedback soon.

*REST api comments :*
1. For REST query ending with {{activities?nodeId=node-87}} I think it may scan 
all nodes in that host if there are multiple NMs running on same node. correct?
2. If we are supporting above option, could we pass node names in comma 
separated form to {{nodeId}} like  {{activities?nodeId=node-87,node-88}}   , 
May we can define a scope here for number of node manager to query as response 
o/p also need to be simpler to understand.
3. For {{app-activities?appId=application_1468198570845_0022}} I think o/p is 
different from node ? Could you also please attach REST o/p for app and node 
scenario.
4.   It is possible that some times we may look for relaxed scheduling by 
considering missed opportunities. So one round of nodes has to undergo 
heartbeats to get an allocation for few cases like (rack local/dflt partition 
from shared label) etc. Its better we add an option like collect scheduler 
activity for an app till missed opportunity is 0. Thoughts?
5. 


*General Comments :*
1. ActivityManager is a class which holds all the informations regarding 
scheduling activities tracker. Over the time, I think we might need to consider 
cases like cleanup of some out standing requests, internal aggregation to 
compact and re-order collected data across heartbeats. For all these cases, I 
think its better we can make ActivityManager as an extended service for 
scheduler. So it can start a thread associated with service to do all 
monitoring and cleanup. This is just a thought, pls feel free to share your 
opinion as its a good to have option.
2. I am in favor of having the current direct simple call to start/update/stop 
scheduling activity. But will it be better if we define an read-write interface 
and clearly define who will read the data, and who can write to the activity 
manager. On a second thought, could we raise events to ActivityManager from 
scheduler and we can make it asynchronous for writes. It may become more clear 
and simple. Thoughts?


> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-07-12 Thread Chen Ge (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374343#comment-15374343
 ] 

Chen Ge commented on YARN-4091:
---

Hi all,

Given "YARN-4091.preliminary.1.patch" I uploaded above, here are some brief 
descriptions about newly added classes and test REST API.

Newly Added Classes:
ActivityManager:
A class to store node or application allocations. It mainly contains 
operations for allocation start, add, update and finish.

NodeAllocation:
It contains allocation information for one allocation in a node 
heartbeat. Detailed allocation activities are first stored in 
"AllocationActivity" as operations, then transformed to a tree structure. Tree 
structure starts from root queue and ends in leaf queue, application or 
container allocation.

AllocationActivity:
It records an activity operation in allocation, which can be classified 
as queue, application or container activity. Other information include state, 
diagnostic, priority.

ActivityNode:
It represents tree node in "NodeAllocation" tree structure. Each node 
may represent queue, application or container in allocation activity. Node may 
have children node if successfully allocated to next level.

ActivityDiagnosticConstant:
Collection of diagnostics.

ActivityState:
Collection of activity operation states.

AllocationState:
Collection of allocation final states.

AllocationActivityType:
Collection of types for activity operation.

AppAllocation:
It contains allocation information for one application within a period 
of time. Each application allocation may have several allocation attempts.

ActivitiesInfo:
DAO object to display node allocation activity.

NodeAllocationInfo:
DAO object to display each node allocation in node heartbeat.

ActivityNodeInfo:
DAO object to display node information in allocation tree. It 
corresponds to "ActivityNode" class.

AppActivitiesInfo:
DAO object to display application activity.

AppAllocationInfo:
DAO object to display application allocation detailed information.


Test REST API:
look at next node’s activities(by default):
http://localhost:18088/ws/v1/cluster/scheduler/activities

Only look at specific node:

http://localhost:18088/ws/v1/cluster/scheduler/activities?nodeId=node-87:75
OR without port number
http://localhost:18088/ws/v1/cluster/scheduler/activities?nodeId=node-87

look at activities for specific application within a period of time(3s 
in default):

http://localhost:18088/ws/v1/cluster/scheduler/app-activities?appId=application_1468198570845_0022

http://localhost:18088/ws/v1/cluster/scheduler/app-activities?appId=application_1468198570845_0022=5.2


Test class:
TestRMWebServicesCapacitySched.java

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched#testActivityJSON

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched#testAppActivityJSON

Thanks for review. Please feel free to put forward any suggestions for 
improvements.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2016-06-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328459#comment-15328459
 ] 

Wangda Tan commented on YARN-4091:
--

Folks,

Thanks for previous design doc from [~sunilg], [~rohithsharma] and [~nijel] and 
also POC code from Sunil. We plan to push this forward, if everybody agree, 
[~ChenGe] will take over and go ahead to finish this feature.

[~ChenGe] and I were working on the design doc recently, attached for review.

Please let me know your thoughts.

Thanks,

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-11-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988740#comment-14988740
 ] 

Naganarasimha G R commented on YARN-4091:
-

Hi [~sunilg], I think we need to update the document as there has been 
significant change in the approaches. Also we can create 2 subjiras for the 
REST based tracing of Single Node Update call, one for what what ever you are 
working on for CS and i would like to start working on similar implementation 
in FairScheduler side. Ok ?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-11-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988765#comment-14988765
 ] 

Sunil G commented on YARN-4091:
---

Hi [~Naganarasimha Garla]
Thank you for the comment. I will update doc with the changed approach, also I 
am almost ready with CS patch with a basic REST implementation, will soon put a 
patch here. Same REST subjira will hold good for both scheduler I think. But 
indeed we need an implementation of this approach in Fair too. 

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908935#comment-14908935
 ] 

Wangda Tan commented on YARN-4091:
--

[~sunilg], sure :), cannot wait to see it!

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-25 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908065#comment-14908065
 ] 

Sunil G commented on YARN-4091:
---

No problem. Thank you [~leftnoteasy] for sharing the thoughts.
bq.I think we can only store the next allocation data once request received,
Perfect. This will also help in putting less load on memory.

As we are almost fine with the approach, I feel I could come up with a 
prototype here. Is it ok or good enough to start with?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-24 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907113#comment-14907113
 ] 

Wangda Tan commented on YARN-4091:
--

[~sunilg], 
By some reason, I replied the JIRA but the comments is not here, sorry for the 
delay :(.

bq. Or we can dump this information as logs.
I would prefer to keep the structured message.

bq. I feel getting information back as REST o/p is more better and we utilize 
this framework in new UI.
Totally agree.

bq. Hence timing of the second REST query is important as the intended node 
heartbeat has to happen (or by the time query comes, more heartbeats from same 
node would have come)
I think we can only store *the next* allocation data once request received, and 
if there's another request comes before the data being fetched, YARN will 
discard the old one.
I think we don't have to keep up-to-date allocation, storing history data in 
memory is not a good idea to me.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-10 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739121#comment-14739121
 ] 

Sunil G commented on YARN-4091:
---

Thank you [~leftnoteasy] for sharing the thoughts.

Yes. the REST framework looks fine. But after the first response update as 
"pending fetching", a second REST query has to be done to see the real result. 
Or we can dump this information as logs. I feel getting information back as 
REST o/p is more better and we utilize this framework in new UI.  Hence timing 
of the second REST query is important as the intended node heartbeat has to 
happen (or by the time query comes, more heartbeats from same node would have 
come). Showing an aggregate debug information till second query is good, but I 
fear about the load on RM and the data produced. With a timelimit (or min count 
of number of heartbeats to debug) can help in this case. Thoughts?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-09 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737291#comment-14737291
 ] 

Sunil G commented on YARN-4091:
---

Thank you [~leftnoteasy]
I understood the scenario mentioned by you. Yes, such cases are not handled in 
earlier design. As you suggested, if we keep hierarchical structured debug 
information starting from a heartbeat, and also keeping the assignment order 
per-application, we can get these information also.

However, my doubt is , we cannot do this for each heartbeat. If we want to do a 
specific heartbeat for a specific node, we need input from external way. Such a 
command or REST query etc. 

So I feel we can have a generalized REST query which can take application or 
queue or node as input. And for some moment, Scheduler can fetch information 
(debug) in human readable format, it will satisfy all cases. Thoughts?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737337#comment-14737337
 ] 

Wangda Tan commented on YARN-4091:
--

[~sunilg],

bq. However, my doubt is , we cannot do this for each heartbeat. If we want to 
do a specific heartbeat for a specific node, we need input from external way. 
Such a command or REST query etc.

That is what I meant! We will do such debug logging totally on demand. In my 
mind, the REST API looks like:
- Request: contains nodeId as parameter.
- Response: "pending fetching" when the request accepted. After the requested 
nodeId finished heartbeat, it contains all debug information.

I feel like we may not need queue/application as input, since we can make sure 
node is doing heartbeat every few seconds, we doesn't know if a queue/app will 
be accessed. We can do highlight in web UI for specified queue/application.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-08 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735266#comment-14735266
 ] 

Wangda Tan commented on YARN-4091:
--

Thanks [~sunilg].

I can understand why you have this proposal, but I'm not sure if your approach 
works in following scenario. I feel getting a over-all state of an app and a 
last-container-assignment-state may not works well for them:

- App wants only a small proportion of a cluster (such as hard locality)
- Similar to above, app want to run on specific partition only
- App's leafqueue or parent queue beyond its limit
- App asks mappers in one partition (A), and reducers in another partition(B), 
when A has little available resource and B has more available resource. User 
wants to see why mappers allocation is slow.

And also, we cannot get order of allocation with your approach, which is an 
important thing to look at when we enable fairness/priority scheduling for apps.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-05 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731875#comment-14731875
 ] 

Sunil G commented on YARN-4091:
---

Thank you [~leftnoteasy] for the detailed information shared. From your input 
and also synced with [~rohithsharma] and [~nijel] offline, I am trying to 
summarize a view point for this. Very raw information is mentioned for now in 
REST response in example, we ll add detailed information later.

*Adding more diagnostics and debug information to Scheduler will help the user 
to get two levels of knowledge. So If we fetch this information with 2 REST api 
calls, specific reason for potential problem in scheduler can be identified and 
action can be taken*

*1*. What happened to an application recently in Scheduler (like status from 
node heartbeats)

*Example*:
- application might not have got containers it asked
  Reason: Userlimit for the application has reached
- application might still be in pending state, yet to get active.
  Reason: Am resource limit is exhausted, hence app cant be made active

*Benefit for user with this info*:  
   User will get to know the clear problem area to look for along with 
potential reason for it.
*How User can get this info*:
  Via REST api,  debug/diagnostic information can be fetched for a 
queue/application.
*Expected O/P*:
{noformat}
 queue - a:
  application : app1
  appState : RUNNING
  reasonPhrase : NA
  lastContainerAssignmentState : SKIPPED_ASSIGNMENT
  reasonPhrase : Userlimit quota is reached
  application : app2
  appState : ACCEPTED
  reasonPhrase : AM resource limit exhausted
{noformat}
   
 *2*. Data/Metrics information from scheduler which is particular to the 
problem identified in 1.

*Example*:
- User can fetch metrics information via REST such as the current queue 
cap, user limit configured, user limit calculated within scheduler etc.
- User can fetch metrics information via REST such as queue capacity, am 
resource % configured, am resource % calculated within RM, current demand etc.

This two level information will help user to take correct measure in cluster to 
fix the problem, such as increase priority of app, OR change queue of an 
application, OR kill some containers in node manually OR some auto tuning from 
AM also.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731679#comment-14731679
 ] 

Wangda Tan commented on YARN-4091:
--

Thanks folks working on this design 
[~sunilg]/[~zhiguohong]/[~rohithsharma]/[~nijel]/[~Naganarasimha]!

Took a look at design doc, and I also thought about these stuffs recently:

*Some general issues we need to think about before going too far:*
1) Since we can have thousands of nodes per sec, and there can be thousands of 
applications running concurrently in a cluster, we must consider what's the 
overhead of recording all these stuffs.
2) Do we really need record this per container?
3) How can YARN show this to customer (especially for admin).

*From my experience, how to troubleshoot resource allocation issue is:*
1) Why I have available resources in NMs, but my application cannot leverage 
it. 
2) Why allocate to other app (queue/user) instead of me.

And my typical approach to look at these issues is:
1) Enable debug logging of scheduler
2) Grep a host_name (which customer declares it has available resources), see 
what happened within one node heartbeat.

So for me, how this feature could be useful to me:
1) It's able to capture one node heartbeat information
2) Captured information has hierarchy
3) It may looks like
{code}
heartbeat
goto queue - a
goto queue - a.a1
goto app_1
goto app_1.priority
goto 
app_1.priority.resource_request
check - queue capacity 
(passed)
check - user limit 
(passed)
check - node locality 
failed 
goto app_1 ..
goto queue -b
{code}

IAW, it's a human readable version of DBEUG log for a single node heartbeat.

And I think admin can benefit from this as well.

Another point is, we don't need to do this for every node heartbeat, doing that 
on demand for one single node heartbeat should be enough for most of cases. 
Admin should know which node to look at.

*Some rough ideas about how the REST API looks like:*
REST Response:
- "What happened" (such as skip-becomes-of-locality / 
node-partition-not-matched, etc. AND status such as usedCapacity, etc.) and 
"Who" (queue/user/app)
- Parent event (We may need hierarchy of these events)

REST Request:
- It seems send a nodeId to look should be enough for now.

This could be a async API, client request to get next allocation report of a 
given NodeId, and scheduler response report when it becomes ready.
API of internal could reference to HTrace, not sure if we can directly leverage 
HTrace to do such logging. I like basic API deisng of HTrace, but we may not 
need complexity like Sampler/Storage, etc.

Thoughts?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731689#comment-14731689
 ] 

Wangda Tan commented on YARN-4091:
--

I found my above suggestion is a little similar to YARN-4104, we're thinking 
very similar thing, [~zhiguohong]! :) 
Instead of dry-run, I'd like to get real data on demand. And we need hierarchy 
of these data as well.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-08-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720243#comment-14720243
 ] 

Sunil G commented on YARN-4091:
---

Thank you [~Naganarasimha] for linking this issue. yes, this will be a subset 
here.

 Improvement: Introduce more debug/diagnostics information to detail out 
 scheduler activity
 --

 Key: YARN-4091
 URL: https://issues.apache.org/jira/browse/YARN-4091
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, resourcemanager
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Improvement on debugdiagnostic information - YARN.pdf


 As schedulers are improved with various new capabilities, more configurations 
 which tunes the schedulers starts to take actions such as limit assigning 
 containers to an application, or introduce delay to allocate container etc. 
 There are no clear information passed down from scheduler to outerworld under 
 these various scenarios. This makes debugging very tougher.
 This ticket is an effort to introduce more defined states on various parts in 
 scheduler where it skips/rejects container assignment, activate application 
 etc. Such information will help user to know whats happening in scheduler.
 Attaching a short proposal for initial discussion. We would like to improve 
 on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-08-27 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717880#comment-14717880
 ] 

Naganarasimha G R commented on YARN-4091:
-

Seems like goal of YARN-3946 is a subset of this jira

 Improvement: Introduce more debug/diagnostics information to detail out 
 scheduler activity
 --

 Key: YARN-4091
 URL: https://issues.apache.org/jira/browse/YARN-4091
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, resourcemanager
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Improvement on debugdiagnostic information - YARN.pdf


 As schedulers are improved with various new capabilities, more configurations 
 which tunes the schedulers starts to take actions such as limit assigning 
 containers to an application, or introduce delay to allocate container etc. 
 There are no clear information passed down from scheduler to outerworld under 
 these various scenarios. This makes debugging very tougher.
 This ticket is an effort to introduce more defined states on various parts in 
 scheduler where it skips/rejects container assignment, activate application 
 etc. Such information will help user to know whats happening in scheduler.
 Attaching a short proposal for initial discussion. We would like to improve 
 on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)