[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue

2017-03-07 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4236:
---
Attachment: YARN-4236-3.patch

updated patch :)

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, scheduler
>Reporter: Chang Li
>Assignee: Chang Li
>  Labels: oct16-medium
> Attachments: YARN-4236.2.patch, YARN-4236-3.patch, YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4236) Metric for aggregated resources allocation per queue

2017-03-07 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900663#comment-15900663
 ] 

Chang Li commented on YARN-4236:


Hey [~ebadger], I am interested in updating this patch, but I probably need to 
wait till weekend to work on this. Hope that's ok

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, scheduler
>Reporter: Chang Li
>Assignee: Chang Li
>  Labels: oct16-medium
> Attachments: YARN-4236.2.patch, YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5834) TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time to the incorrect value

2016-11-10 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653391#comment-15653391
 ] 

Chang Li commented on YARN-5834:


Thanks for reporting. Yes it's meant to be nmRmConnectionWaitMs. Provide 
branch-2 patch since this test does not exist in trunk

> TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time 
> to the incorrect value
> --
>
> Key: YARN-5834
> URL: https://issues.apache.org/jira/browse/YARN-5834
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Chang Li
>Priority: Minor
> Attachments: YARN-5834-branch-2.001.patch
>
>
> The function is TestNodeStatusUpdater#testNMRMConnectionConf()
> I believe the connectionWaitMs references below were meant to be 
> nmRmConnectionWaitMs.
> {code}
> conf.setLong(YarnConfiguration.NM_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
> nmRmConnectionWaitMs);
> conf.setLong(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
> connectionWaitMs);
> ...
>   long t = System.currentTimeMillis();
>   long duration = t - waitStartTime;
>   boolean waitTimeValid = (duration >= nmRmConnectionWaitMs) &&
>   (duration < (*connectionWaitMs* + delta));
>   if(!waitTimeValid) {
> // throw exception if NM doesn't retry long enough
> throw new Exception("NM should have tried re-connecting to RM during 
> " +
>   "period of at least " + *connectionWaitMs* + " ms, but " +
>   "stopped retrying within " + (*connectionWaitMs* + delta) +
>   " ms: " + e, e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5834) TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time to the incorrect value

2016-11-10 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-5834:
---
Attachment: YARN-5834-branch-2.001.patch

> TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time 
> to the incorrect value
> --
>
> Key: YARN-5834
> URL: https://issues.apache.org/jira/browse/YARN-5834
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Chang Li
>Priority: Minor
> Attachments: YARN-5834-branch-2.001.patch
>
>
> The function is TestNodeStatusUpdater#testNMRMConnectionConf()
> I believe the connectionWaitMs references below were meant to be 
> nmRmConnectionWaitMs.
> {code}
> conf.setLong(YarnConfiguration.NM_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
> nmRmConnectionWaitMs);
> conf.setLong(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
> connectionWaitMs);
> ...
>   long t = System.currentTimeMillis();
>   long duration = t - waitStartTime;
>   boolean waitTimeValid = (duration >= nmRmConnectionWaitMs) &&
>   (duration < (*connectionWaitMs* + delta));
>   if(!waitTimeValid) {
> // throw exception if NM doesn't retry long enough
> throw new Exception("NM should have tried re-connecting to RM during 
> " +
>   "period of at least " + *connectionWaitMs* + " ms, but " +
>   "stopped retrying within " + (*connectionWaitMs* + delta) +
>   " ms: " + e, e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5834) TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time to the incorrect value

2016-11-10 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li reassigned YARN-5834:
--

Assignee: Chang Li

> TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time 
> to the incorrect value
> --
>
> Key: YARN-5834
> URL: https://issues.apache.org/jira/browse/YARN-5834
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Chang Li
>Priority: Minor
> Attachments: YARN-5834-branch-2.001.patch
>
>
> The function is TestNodeStatusUpdater#testNMRMConnectionConf()
> I believe the connectionWaitMs references below were meant to be 
> nmRmConnectionWaitMs.
> {code}
> conf.setLong(YarnConfiguration.NM_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
> nmRmConnectionWaitMs);
> conf.setLong(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
> connectionWaitMs);
> ...
>   long t = System.currentTimeMillis();
>   long duration = t - waitStartTime;
>   boolean waitTimeValid = (duration >= nmRmConnectionWaitMs) &&
>   (duration < (*connectionWaitMs* + delta));
>   if(!waitTimeValid) {
> // throw exception if NM doesn't retry long enough
> throw new Exception("NM should have tried re-connecting to RM during 
> " +
>   "period of at least " + *connectionWaitMs* + " ms, but " +
>   "stopped retrying within " + (*connectionWaitMs* + delta) +
>   " ms: " + e, e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-11-09 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218-branch-2.003.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218-branch-2.003.patch, YARN-4218.006.patch, 
> YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, 
> YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, 
> YARN-4218.trunk.2.patch, YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, 
> YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-11-09 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.006.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.006.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.3.patch, 
> YARN-4218.4.patch, YARN-4218.5.patch, YARN-4218.branch-2.2.patch, 
> YARN-4218.branch-2.patch, YARN-4218.patch, YARN-4218.trunk.2.patch, 
> YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4218) Metric for resource*time that was preempted

2016-10-31 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624458#comment-15624458
 ] 

Chang Li commented on YARN-4218:


[~eepayne] hmm, there are tons of errors for protocol javadocs for missing 
descriptions of why YarnException is thrown and why IOExceptions are thrown. My 
changes never touches those protocols not sure why it generates those errors 
for me. To fulfill those thousands of missing javadocs for exceptions or param 
probably worth a feature... Also I didn't implement those protocols it's hard 
for me to write correct description...

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, 
> YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, 
> YARN-4218.trunk.2.patch, YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, 
> YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-31 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.trunk.3.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, 
> YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, 
> YARN-4218.trunk.2.patch, YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, 
> YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4218) Metric for resource*time that was preempted

2016-10-31 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621566#comment-15621566
 ] 

Chang Li commented on YARN-4218:


were having some troubles to run jobs on trunk, just figured that out.. Will 
submit a patch for trunk soon

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, 
> YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, 
> YARN-4218.trunk.2.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-30 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.branch-2.2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, 
> YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, 
> YARN-4218.trunk.2.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-30 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.5.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, 
> YARN-4218.branch-2.patch, YARN-4218.patch, YARN-4218.trunk.2.patch, 
> YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-30 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.branch-2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, 
> YARN-4218.branch-2.patch, YARN-4218.patch, YARN-4218.trunk.2.patch, 
> YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4269) Log aggregation should not swallow the exception during close()

2016-10-30 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15619473#comment-15619473
 ] 

Chang Li commented on YARN-4269:


Thanks for taking a look at this jira [~shaneku...@gmail.com]. I think it's not 
straight forward to pass in the log priority since we are using a Log interface 
not some specific log like log4j

> Log aggregation should not swallow the exception during close()
> ---
>
> Key: YARN-4269
> URL: https://issues.apache.org/jira/browse/YARN-4269
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Reporter: Chang Li
>Assignee: Chang Li
>  Labels: oct16-easy
> Attachments: YARN-4269.2.patch, YARN-4269.3.patch, YARN-4269.patch
>
>
> the log aggregation thread ignores exception thrown by close(). It shouldn't 
> be ignored, since the file content may be missing or partial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-30 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.trunk.2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, 
> YARN-4218.trunk.2.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4218) Metric for resource*time that was preempted

2016-10-30 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15619394#comment-15619394
 ] 

Chang Li commented on YARN-4218:


it seems patch generated by {code} git diff HEAD --no-prefix {code} is not 
accepted by HADOOP QA? post again .trunk patch generated by {code} git diff 
trunk {code}

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, 
> YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-29 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.trunk.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, 
> YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-10-29 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.4.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, 
> YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4935) TestYarnClient#testSubmitIncorrectQueue fails with FairScheduler

2016-04-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233692#comment-15233692
 ] 

Chang Li commented on YARN-4935:


agree that change in YARN-3131 is CS specific, and that change was originally 
driven by some problem encountered in Tez

> TestYarnClient#testSubmitIncorrectQueue fails with FairScheduler
> 
>
> Key: YARN-4935
> URL: https://issues.apache.org/jira/browse/YARN-4935
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>
> This test case introduced by YARN-3131 works well on CapacityScheduler but 
> not on FairScheduler, since CS doesn't allow dynamically create a queue, but 
> FS support it. So if you give a random queue name, CS will reject it, but FS 
> will create a new queue for it by default. 
> One simple solution is to specific CS in this test case. /cc [~lichangleo]. I 
> was thinking about creating another test case for FS. But for the code 
> introduced by YARN-3131, it may be not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4642) Commonize URL parsing code in RMWebAppFilter

2016-02-02 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128823#comment-15128823
 ] 

Chang Li commented on YARN-4642:


[~jlowe] please help review, thanks!

> Commonize URL parsing code in RMWebAppFilter
> 
>
> Key: YARN-4642
> URL: https://issues.apache.org/jira/browse/YARN-4642
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4642.2.patch, YARN-4642.patch
>
>
> A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url 
> parsing code and to unblock the progress for YARN-4428



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4642) Commonize URL parsing code in RMWebAppFilter

2016-02-02 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4642:
---
Attachment: YARN-4642.2.patch

> Commonize URL parsing code in RMWebAppFilter
> 
>
> Key: YARN-4642
> URL: https://issues.apache.org/jira/browse/YARN-4642
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4642.2.patch, YARN-4642.patch
>
>
> A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url 
> parsing code and to unblock the progress for YARN-4428



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4642) Commonize URL parsing code in RMWebAppFilter

2016-02-01 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4642:
---
Attachment: YARN-4642.patch

> Commonize URL parsing code in RMWebAppFilter
> 
>
> Key: YARN-4642
> URL: https://issues.apache.org/jira/browse/YARN-4642
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4642.patch
>
>
> A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url 
> parsing code and to unblock the progress for YARN-4428



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-29 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.branch-2.7.patch

[~jlowe], uploaded 2.7 patch. Also realized that my previous .9 patch wrote 
log.info instead of log.debug, so updated .10 patch to address that as well

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.10.patch, YARN-4428.2.2.patch, YARN-4428.2.patch, 
> YARN-4428.3.patch, YARN-4428.3.patch, YARN-4428.4.patch, YARN-4428.5.patch, 
> YARN-4428.6.patch, YARN-4428.7.patch, YARN-4428.8.patch, 
> YARN-4428.9.test.patch, YARN-4428.branch-2.7.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-29 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.10.patch

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.10.patch, YARN-4428.2.2.patch, YARN-4428.2.patch, 
> YARN-4428.3.patch, YARN-4428.3.patch, YARN-4428.4.patch, YARN-4428.5.patch, 
> YARN-4428.6.patch, YARN-4428.7.patch, YARN-4428.8.patch, 
> YARN-4428.9.test.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-28 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.9.test.patch

[~jlowe] sorry I missed that. updated .9 patch accordingly

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch, 
> YARN-4428.8.patch, YARN-4428.9.test.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-28 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121570#comment-15121570
 ] 

Chang Li commented on YARN-4428:


[~jlowe] please help review the latest patch, thanks!

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch, 
> YARN-4428.8.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-27 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120028#comment-15120028
 ] 

Chang Li commented on YARN-4428:


[~jlowe] please help review the latest patch, thanks!

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-27 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.8.patch

[~jlowe] thanks a lot for patient and careful review! updated.8 patch 
accordingly

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch, 
> YARN-4428.8.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-27 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.7.patch

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4642) Commonize URL parsing code in RMWebAppFilter

2016-01-26 Thread Chang Li (JIRA)
Chang Li created YARN-4642:
--

 Summary: Commonize URL parsing code in RMWebAppFilter
 Key: YARN-4642
 URL: https://issues.apache.org/jira/browse/YARN-4642
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Chang Li
Assignee: Chang Li


A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url 
parsing code and to unblock the progress for YARN-4428



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-26 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.6.patch

Thanks [~jlowe] for review! updated .6 patch to address your concerns. Also 
opened YARN-4642 to work on commonize url parsing

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-25 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115222#comment-15115222
 ] 

Chang Li commented on YARN-4428:


[~jlowe] could you help review the latest patch? Thx!

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2016-01-25 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.3.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.patch, YARN-4218.wip.patch, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4570) Nodemanager leaking RawLocalFilesystem instances for user "testing"

2016-01-22 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112677#comment-15112677
 ] 

Chang Li commented on YARN-4570:


Profiled several nodemanagers' heap dump in our clusters, not able to find the 
"testing" RawLocalFilesystem leaking by far. Close this as cannot reproduce?

> Nodemanager leaking RawLocalFilesystem instances for user "testing"
> ---
>
> Key: YARN-4570
> URL: https://issues.apache.org/jira/browse/YARN-4570
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Chang Li
>
> I recently ran across a NodeManager that was running slowly due to excessive 
> GC.  Digging into the heap I saw that most of the issue was leaked filesystem 
> statistics data objects which has been fixed in HADOOP-12107.  However I also 
> noticed there were many thousands of RawLocalFilesystem objects on the heap, 
> far more than any other FileSystem type.  Sampling a number of them showed 
> that they were for the "testing" user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2016-01-19 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107096#comment-15107096
 ] 

Chang Li commented on YARN-4589:


[~jlowe] please help review the latest patch.
Latest implementation add a new container external state localizing, and in 
each nodeheartbeat to rm, RMNode maintains and updates states of its container. 
When RMAppattempt timeout it queries from RMNode about its container state. The 
implementation also considers backward compatibility

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-19 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.5.patch

.5 patch address checkstyle

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch, YARN-4428.5.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-19 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.4.patch

Thanks [~jlowe] for review and providing good suggestions! updated .4 patch to 
also support redirect for appattempt and container. Have successfully manually 
test them

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, 
> YARN-4428.4.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2016-01-15 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4589:
---
Attachment: YARN-4589.3.patch

[~jlowe] thanks for good suggestion for separating those two. updated .3 patch 
keep only Yarn related change. Also fixed broken unit tests

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-14 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.3.patch

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2016-01-14 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4589:
---
Attachment: YARN-4589.2.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4589.2.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4592) Remove unsed GetContainerStatus proto

2016-01-14 Thread Chang Li (JIRA)
Chang Li created YARN-4592:
--

 Summary: Remove unsed GetContainerStatus proto
 Key: YARN-4592
 URL: https://issues.apache.org/jira/browse/YARN-4592
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chang Li
Assignee: Chang Li
Priority: Minor
 Attachments: YARN-4592.patch

GetContainerStatus protos have been left unused since YARN-926



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4592) Remove unsed GetContainerStatus proto

2016-01-14 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4592:
---
Attachment: YARN-4592.patch

> Remove unsed GetContainerStatus proto
> -
>
> Key: YARN-4592
> URL: https://issues.apache.org/jira/browse/YARN-4592
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Minor
> Attachments: YARN-4592.patch
>
>
> GetContainerStatus protos have been left unused since YARN-926



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2016-01-14 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.3.patch

thanks [~jlowe] for review and point me to the related issue! updated .3 patch 
which compute redirect inside RMWebAppFilter

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2016-01-13 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4589:
---
Attachment: YARN-4589.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4589) Diagnostics for localization timeouts is lacking

2016-01-13 Thread Chang Li (JIRA)
Chang Li created YARN-4589:
--

 Summary: Diagnostics for localization timeouts is lacking
 Key: YARN-4589
 URL: https://issues.apache.org/jira/browse/YARN-4589
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Chang Li
Assignee: Chang Li


When a container takes too long to localize it manifests as a timeout, and 
there's no indication that localization was the issue. We need diagnostics for 
timeouts to indicate the container was still localizing when the timeout 
occurred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2016-01-11 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.3.patch

Thanks [~jlowe] for spotting that! updated .3 to remove the comment

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch, YARN-4414.3.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2016-01-11 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092408#comment-15092408
 ] 

Chang Li commented on YARN-4414:


[~jlowe], could you help review the latest patch? Thx

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4570) Nodemanager leaking RawLocalFilesystem instances for user "testing"

2016-01-08 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li reassigned YARN-4570:
--

Assignee: Chang Li

> Nodemanager leaking RawLocalFilesystem instances for user "testing"
> ---
>
> Key: YARN-4570
> URL: https://issues.apache.org/jira/browse/YARN-4570
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Chang Li
>
> I recently ran across a NodeManager that was running slowly due to excessive 
> GC.  Digging into the heap I saw that most of the issue was leaked filesystem 
> statistics data objects which has been fixed in HADOOP-12107.  However I also 
> noticed there were many thousands of RawLocalFilesystem objects on the heap, 
> far more than any other FileSystem type.  Sampling a number of them showed 
> that they were for the "testing" user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2016-01-08 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089339#comment-15089339
 ] 

Chang Li commented on YARN-4414:


Hi [~xinxianyin],  RM HA already disables IPC retries. Also client should try 
to connect to RM really hard because it's catastrophic failure if it doesn't. 
Failure to connect to a NM is not. I think we should just make change for 
NMProxy in this jira.

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2016-01-07 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.2.patch

Thanks [~jlowe] for review!
updated .2 patch to remove getNMProxy2 and implemented getProxy() in term of 
getProxy(Configuration).
I set NM address to some dummy value 1234 so that it will trigger connection 
error and rpc level retires.
{{BaseContainerManagerTest}} set it to {code}"0.0.0.0:" + 
ServerSocketUtil.getPort(49162, 10); {code} a normal address thus rpc retry 
could not be triggered

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-18 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.2.2.patch

.2.2 addressed the whitespace issue

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, 
> YARN-4428.2.2.patch, YARN-4428.2.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-17 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.2.patch

.2 patch added a unit test for the change in RMAppAttemptImpl. web redirect is 
hard to write a unit test to test for. I test by first run a sleep job with 
invalid options, such as sleep -Dyarn.app.mapreduce.am.command-opts="-abc" -m 
1, then the job will crash during launching. Then I shutdown rm, clear the 
statestore, and bounce the rm back again. Then I visit the crashed app in RM UI 
and verified that I am able to redirected to AHS page for that app.

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, YARN-4428.2.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-16 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.1.2.patch

.1.2 patch addressed checkstyle issues

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-16 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.1.3.patch

oops, my bad, intended to name latest patch as .1.3. 
removed the .2.2 patch and re-upload the latest as .1.3

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.3.patch, YARN-4414.1.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-16 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: (was: YARN-4414.2.2.patch)

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-16 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.2.2.patch

.2.2 fix the white space issue

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.patch, YARN-4414.2.2.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-14 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.1.2.patch

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, 
> YARN-4414.1.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2015-12-14 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.2.patch, YARN-4218.patch, YARN-4218.wip.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4341) add doc about timeline performance tool usage

2015-12-11 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052983#comment-15052983
 ] 

Chang Li commented on YARN-4341:


[~sjlee0], could you please help look if the latest patch is good? Thanks!

> add doc about timeline performance tool usage
> -
>
> Key: YARN-4341
> URL: https://issues.apache.org/jira/browse/YARN-4341
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4341.2.patch, YARN-4341.3.patch, YARN-4341.4.patch, 
> YARN-4341.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-10 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.1.patch

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-10 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4414:
---
Attachment: YARN-4414.1.2.patch

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4414.1.2.patch, YARN-4414.1.patch
>
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage

2015-12-09 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4341:
---
Attachment: YARN-4341.4.patch

Thanks for detecting the issue [~sjlee0]. updated .4 patch to fix that

> add doc about timeline performance tool usage
> -
>
> Key: YARN-4341
> URL: https://issues.apache.org/jira/browse/YARN-4341
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4341.2.patch, YARN-4341.3.patch, YARN-4341.4.patch, 
> YARN-4341.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage

2015-12-09 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4341:
---
Attachment: YARN-4341.3.patch

Thanks a lot [~sjlee0] for review! updated .3 patch and addressed your 
suggestions there.

> add doc about timeline performance tool usage
> -
>
> Key: YARN-4341
> URL: https://issues.apache.org/jira/browse/YARN-4341
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4341.2.patch, YARN-4341.3.patch, YARN-4341.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-07 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Attachment: YARN-4428.1.patch

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. Also there is a corner case when application 
> failed during launch, it's original tracking url won't be set correctly, and 
> WebAppProxyServlet won't redirect us to AHS page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-07 Thread Chang Li (JIRA)
Chang Li created YARN-4428:
--

 Summary: Redirect RM page to AHS page when AHS turned on and RM 
page is not avaialable
 Key: YARN-4428
 URL: https://issues.apache.org/jira/browse/YARN-4428
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chang Li
Assignee: Chang Li


When AHS is turned on, if we can't view application in RM page, RM page should 
redirect us to AHS page. Also there is a corner case when application failed 
during launch, it's original tracking url won't be set correctly, and 
WebAppProxyServlet won't redirect us to AHS page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-07 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Description: When AHS is turned on, if we can't view application in RM 
page, RM page should redirect us to AHS page. Also there is a corner case when 
application failed during launch, it's original tracking url won't be set 
correctly, and WebAppProxyServlet won't redirect us to AHS page as YARN-3975 
designed.  (was: When AHS is turned on, if we can't view application in RM 
page, RM page should redirect us to AHS page. Also there is a corner case when 
application failed during launch, it's original tracking url won't be set 
correctly, and WebAppProxyServlet won't redirect us to AHS page.)

> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. Also there is a corner case when application 
> failed during launch, it's original tracking url won't be set correctly, and 
> WebAppProxyServlet won't redirect us to AHS page as YARN-3975 designed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4414) Nodemanager connection errors are retried at multiple levels

2015-12-07 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li reassigned YARN-4414:
--

Assignee: Chang Li

> Nodemanager connection errors are retried at multiple levels
> 
>
> Key: YARN-4414
> URL: https://issues.apache.org/jira/browse/YARN-4414
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Jason Lowe
>Assignee: Chang Li
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable

2015-12-07 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4428:
---
Description: 
When AHS is turned on, if we can't view application in RM page, RM page should 
redirect us to AHS page. For example, when you go to cluster/app/application_1, 
if RM no longer remember the application, we will simply get "Failed to read 
the application application_1", but it will be good for RM ui to smartly try to 
redirect to AHS ui /applicationhistory/app/application_1 to see if it's there. 
The redirect usage already exist for logs in nodemanager UI.
Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
fall back of RM not remembering the app. YARN-3975 tried to do this only when 
original tracking url is not set. But there are many cases, such as when app 
failed at launch, original tracking url will be set to point to RM page, so 
redirect to AHS page won't work.

  was:
When AHS is turned on, if we can't view application in RM page, RM page should 
redirect us to AHS page. For example, when you go to cluster/app/application_1, 
if RM no longer remember the application, we will simply get "Failed to read 
the application application_1", but it will be good for RM ui to smartly try to 
redirect to AHS ui /applicationhistory/app/application_1 to see if it's there. 
The redirect usage already exist for logs in nodemanager UI.
Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page. 
YARN-3975 tried to do this only when original tracking url is not set. But 
there are many cases, such as when app failed at launch, original tracking url 
will be set to point to RM page, so redirect to AHS page won't work.


> Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
> -
>
> Key: YARN-4428
> URL: https://issues.apache.org/jira/browse/YARN-4428
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4428.1.patch
>
>
> When AHS is turned on, if we can't view application in RM page, RM page 
> should redirect us to AHS page. For example, when you go to 
> cluster/app/application_1, if RM no longer remember the application, we will 
> simply get "Failed to read the application application_1", but it will be 
> good for RM ui to smartly try to redirect to AHS ui 
> /applicationhistory/app/application_1 to see if it's there. The redirect 
> usage already exist for logs in nodemanager UI.
> Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on 
> fall back of RM not remembering the app. YARN-3975 tried to do this only when 
> original tracking url is not set. But there are many cases, such as when app 
> failed at launch, original tracking url will be set to point to RM page, so 
> redirect to AHS page won't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4375) CapacityScheduler needs more debug logging for why queues don't get containers

2015-11-24 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4375:
---
Attachment: YARN-4375.patch

[~sunilg], thanks for pointing me to those ongoing efforts. What I want to 
accomplish in this jira is to simply add more debug logging of what might go 
wrong when allocate container to queue. 
I have upload a patch which add more debug logging in regular container 
allocator to indicate problems of allocating containers.

> CapacityScheduler needs more debug logging for why queues don't get containers
> --
>
> Key: YARN-4375
> URL: https://issues.apache.org/jira/browse/YARN-4375
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4375.patch
>
>
> CapacityScheduler needs more debug logging for why queues don't get containers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-24 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.4.2.patch

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.3.patch, 
> YARN-4334.4.2.patch, YARN-4334.4.patch, YARN-4334.patch, 
> YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-23 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.4.patch

.4 patch fix some checkstyle and whitespace issues. TestWebApp is tracked by 
YARN-4379, not related to my change. TestAMAuthorization and TestClientRMTokens 
are not caused by my patch either. 
[~jlowe], please help review the latest patch, thanks!

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.3.patch, YARN-4334.4.patch, 
> YARN-4334.patch, YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, 
> YARN-4334.wip.4.patch, YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-23 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022161#comment-15022161
 ] 

Chang Li commented on YARN-4132:


TestWebApp is tracked by YARN-4379, not related to my change. [~jlowe], please 
help review the updated patch, thanks!

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.7.patch, 
> YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.2.patch

fix broken test of TestLeveldbRMStateStore and TestYarnConfigurationFields. The 
other broken tests are not related.
Also add additional tests for statestore.

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.patch, 
> YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-20 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018709#comment-15018709
 ] 

Chang Li commented on YARN-4374:


Thanks [~jlowe] for further review! I am good with going with first patch

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.3.patch

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.3.patch, YARN-4334.patch, 
> YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4132:
---
Attachment: YARN-4132.7.patch

Thanks [~jlowe] for further review! have updated .7 patch accordingly

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.7.patch, 
> YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-19 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.patch

added unit test

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.patch, YARN-4334.wip.2.patch, 
> YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-19 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014573#comment-15014573
 ] 

Chang Li commented on YARN-4374:


[~jlowe], please help review the latest patch

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-19 Thread Chang Li (JIRA)
Chang Li created YARN-4374:
--

 Summary: RM scheduler UI rounds user limit factor
 Key: YARN-4374
 URL: https://issues.apache.org/jira/browse/YARN-4374
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chang Li
Assignee: Chang Li


RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-19 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4374:
---
Attachment: YARN-4374.patch

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4375) CapacityScheduler needs more debug logging for why queues don't get containers

2015-11-19 Thread Chang Li (JIRA)
Chang Li created YARN-4375:
--

 Summary: CapacityScheduler needs more debug logging for why queues 
don't get containers
 Key: YARN-4375
 URL: https://issues.apache.org/jira/browse/YARN-4375
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chang Li
Assignee: Chang Li


CapacityScheduler needs more debug logging for why queues don't get containers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-19 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4374:
---
Attachment: Screenshot1.png

Attach a screenshot to show the patch fix the problem. 
[~jlowe] please help review

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: Screenshot1.png, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-19 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4374:
---
Attachment: YARN-4374.2.patch

Thanks [~jlowe] for review. Agree. update .2 patch with %.3f precision. In 
addition, .2 patch handles trailing 0s.

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-18 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.wip.4.patch

update wip.4 patch to add implementation for ZKRMStateStore. Also make the Rm 
heart beat service to statestore configurable and the check for statestore 
staleness on recovery configurable

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, 
> YARN-4334.wip.4.patch, YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-18 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.wip.3.patch

wip.3 patch add implementation for FileSystemRMStateStore

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-16 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.wip.2.patch

Thanks [~jlowe] for review! I have updated .2 prototype patch, please try it 
out. On rm state store expired, rm recovery will recover previous running app 
and app attempts to killed. .2 prototype patch also address your other 
concerns. Will work on implementation of other statestore soon.

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.wip.2.patch, YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-12 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4132:
---
Attachment: YARN-4132.6.2.patch

.6.2 patch fix whitespace

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage

2015-11-12 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4341:
---
Attachment: YARN-4341.patch

> add doc about timeline performance tool usage
> -
>
> Key: YARN-4341
> URL: https://issues.apache.org/jira/browse/YARN-4341
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4341.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage

2015-11-12 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4341:
---
Attachment: YARN-4341.2.patch

.2 fix whitespace

> add doc about timeline performance tool usage
> -
>
> Key: YARN-4341
> URL: https://issues.apache.org/jira/browse/YARN-4341
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4341.2.patch, YARN-4341.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-12 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4132:
---
Attachment: YARN-4132.6.patch

Thanks [~jlowe] for review and good suggestion! Have update patch accordingly. 

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.6.patch, YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-11 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4132:
---
Attachment: YARN-4132.5.patch

Thanks [~jlowe] for review! .5 patch only has one createRMProxy which takes two 
additional inputs of retry time and retry interval. ServerRMProxy and 
ClientRMProxy pass those two inputs according to different values in conf.
Conf naming is fixed. Test is also tuned down to around 4 seconds. 

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4341) add doc about timeline performance tool usage

2015-11-09 Thread Chang Li (JIRA)
Chang Li created YARN-4341:
--

 Summary: add doc about timeline performance tool usage
 Key: YARN-4341
 URL: https://issues.apache.org/jira/browse/YARN-4341
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Chang Li
Assignee: Chang Li






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4339) optimize timeline server performance tool

2015-11-09 Thread Chang Li (JIRA)
Chang Li created YARN-4339:
--

 Summary: optimize timeline server performance tool
 Key: YARN-4339
 URL: https://issues.apache.org/jira/browse/YARN-4339
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Chang Li
Assignee: Chang Li


As [~Naganarasimha] suggest in YARN-2556 that test could be optimized by having 
some initial Level DB data before testing the performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2015-11-09 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.patch, YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997139#comment-14997139
 ] 

Chang Li commented on YARN-2556:


create YARN-4341 to track work of add doc about timeline performance tool usage

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996875#comment-14996875
 ] 

Chang Li commented on YARN-2556:


Thanks [~Naganarasimha] for suggesting optimization! +1 on the idea of creating 
some initial leveldb data before test the performance. Create YARN-4339 to work 
on this idea.

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996832#comment-14996832
 ] 

Chang Li commented on YARN-2556:


Hi [~xgong], here is the usage print out by the tool  {code} 
Usage: [-m ] number of mappers (default: 1)
 [-v] timeline service version
 [-mtype ]
  1. simple entity write mapper
  2. jobhistory files replay mapper
 [-s <(KBs)test>] number of KB per put (mtype=1, default: 1 KB)
 [-t] package sending iterations per mapper (mtype=1, default: 100)
 [-d ] root path of job history files (mtype=2)
 [-r ] (mtype=2)
  1. write all entities for a job in one put (default)
  2. write one entity at a time{code}
there are two different modes to test, one is simple entity writer, where each 
mapper create your specified size of entities and put them to timeline server. 
The other mode of test is by replaying jobhistory files, which offer a more 
realistic test. In the case of jobhistory file replay test, you put testing 
jobhistory files(both the job history file and job conf file) under a 
directory, and then you specify the testing dir by -d option. You specify the 
test mode by -mtype option. 
Right now the usage won't get printed out if you pass no options, but only 
print out when you pass the wrong options. When you give no parameters, the 
test run with simple entity write mode and default setting. So maybe we want to 
print out this usage if we don't pass any parameter?

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-05 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.wip.patch

upload a prototype patch, which does heartbeat to LeveldbRMStateStore and on RM 
recovery it checks whether statestore is expired

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-05 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li reassigned YARN-4334:
--

Assignee: Chang Li

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4236:
---
Attachment: (was: YARN-4236.patch)

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >