[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue
[ https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4236: --- Attachment: YARN-4236-3.patch updated patch :) > Metric for aggregated resources allocation per queue > > > Key: YARN-4236 > URL: https://issues.apache.org/jira/browse/YARN-4236 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, scheduler >Reporter: Chang Li >Assignee: Chang Li > Labels: oct16-medium > Attachments: YARN-4236.2.patch, YARN-4236-3.patch, YARN-4236.patch > > > We currently track allocated memory and allocated vcores per queue but we > don't have a good rate metric on how fast we're allocating these things. In > other words, a straight line in allocatedmb could equally be one extreme of > no new containers are being allocated or allocating a bunch of containers > where we free exactly what we allocate each time. Adding a resources > allocated per second per queue would give us a better insight into the rate > of resource churn on a queue. Based on this aggregated resource allocation > per queue we can easily have some tools to measure the rate of resource > allocation per queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4236) Metric for aggregated resources allocation per queue
[ https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900663#comment-15900663 ] Chang Li commented on YARN-4236: Hey [~ebadger], I am interested in updating this patch, but I probably need to wait till weekend to work on this. Hope that's ok > Metric for aggregated resources allocation per queue > > > Key: YARN-4236 > URL: https://issues.apache.org/jira/browse/YARN-4236 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, scheduler >Reporter: Chang Li >Assignee: Chang Li > Labels: oct16-medium > Attachments: YARN-4236.2.patch, YARN-4236.patch > > > We currently track allocated memory and allocated vcores per queue but we > don't have a good rate metric on how fast we're allocating these things. In > other words, a straight line in allocatedmb could equally be one extreme of > no new containers are being allocated or allocating a bunch of containers > where we free exactly what we allocate each time. Adding a resources > allocated per second per queue would give us a better insight into the rate > of resource churn on a queue. Based on this aggregated resource allocation > per queue we can easily have some tools to measure the rate of resource > allocation per queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5834) TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time to the incorrect value
[ https://issues.apache.org/jira/browse/YARN-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653391#comment-15653391 ] Chang Li commented on YARN-5834: Thanks for reporting. Yes it's meant to be nmRmConnectionWaitMs. Provide branch-2 patch since this test does not exist in trunk > TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time > to the incorrect value > -- > > Key: YARN-5834 > URL: https://issues.apache.org/jira/browse/YARN-5834 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Miklos Szegedi >Assignee: Chang Li >Priority: Minor > Attachments: YARN-5834-branch-2.001.patch > > > The function is TestNodeStatusUpdater#testNMRMConnectionConf() > I believe the connectionWaitMs references below were meant to be > nmRmConnectionWaitMs. > {code} > conf.setLong(YarnConfiguration.NM_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > nmRmConnectionWaitMs); > conf.setLong(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > connectionWaitMs); > ... > long t = System.currentTimeMillis(); > long duration = t - waitStartTime; > boolean waitTimeValid = (duration >= nmRmConnectionWaitMs) && > (duration < (*connectionWaitMs* + delta)); > if(!waitTimeValid) { > // throw exception if NM doesn't retry long enough > throw new Exception("NM should have tried re-connecting to RM during > " + > "period of at least " + *connectionWaitMs* + " ms, but " + > "stopped retrying within " + (*connectionWaitMs* + delta) + > " ms: " + e, e); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5834) TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time to the incorrect value
[ https://issues.apache.org/jira/browse/YARN-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-5834: --- Attachment: YARN-5834-branch-2.001.patch > TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time > to the incorrect value > -- > > Key: YARN-5834 > URL: https://issues.apache.org/jira/browse/YARN-5834 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Miklos Szegedi >Assignee: Chang Li >Priority: Minor > Attachments: YARN-5834-branch-2.001.patch > > > The function is TestNodeStatusUpdater#testNMRMConnectionConf() > I believe the connectionWaitMs references below were meant to be > nmRmConnectionWaitMs. > {code} > conf.setLong(YarnConfiguration.NM_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > nmRmConnectionWaitMs); > conf.setLong(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > connectionWaitMs); > ... > long t = System.currentTimeMillis(); > long duration = t - waitStartTime; > boolean waitTimeValid = (duration >= nmRmConnectionWaitMs) && > (duration < (*connectionWaitMs* + delta)); > if(!waitTimeValid) { > // throw exception if NM doesn't retry long enough > throw new Exception("NM should have tried re-connecting to RM during > " + > "period of at least " + *connectionWaitMs* + " ms, but " + > "stopped retrying within " + (*connectionWaitMs* + delta) + > " ms: " + e, e); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5834) TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time to the incorrect value
[ https://issues.apache.org/jira/browse/YARN-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li reassigned YARN-5834: -- Assignee: Chang Li > TestNodeStatusUpdater.testNMRMConnectionConf compares nodemanager wait time > to the incorrect value > -- > > Key: YARN-5834 > URL: https://issues.apache.org/jira/browse/YARN-5834 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Miklos Szegedi >Assignee: Chang Li >Priority: Minor > Attachments: YARN-5834-branch-2.001.patch > > > The function is TestNodeStatusUpdater#testNMRMConnectionConf() > I believe the connectionWaitMs references below were meant to be > nmRmConnectionWaitMs. > {code} > conf.setLong(YarnConfiguration.NM_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > nmRmConnectionWaitMs); > conf.setLong(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > connectionWaitMs); > ... > long t = System.currentTimeMillis(); > long duration = t - waitStartTime; > boolean waitTimeValid = (duration >= nmRmConnectionWaitMs) && > (duration < (*connectionWaitMs* + delta)); > if(!waitTimeValid) { > // throw exception if NM doesn't retry long enough > throw new Exception("NM should have tried re-connecting to RM during > " + > "period of at least " + *connectionWaitMs* + " ms, but " + > "stopped retrying within " + (*connectionWaitMs* + delta) + > " ms: " + e, e); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218-branch-2.003.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218-branch-2.003.patch, YARN-4218.006.patch, > YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, > YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, > YARN-4218.trunk.2.patch, YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, > YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.006.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.006.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.3.patch, > YARN-4218.4.patch, YARN-4218.5.patch, YARN-4218.branch-2.2.patch, > YARN-4218.branch-2.patch, YARN-4218.patch, YARN-4218.trunk.2.patch, > YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624458#comment-15624458 ] Chang Li commented on YARN-4218: [~eepayne] hmm, there are tons of errors for protocol javadocs for missing descriptions of why YarnException is thrown and why IOExceptions are thrown. My changes never touches those protocols not sure why it generates those errors for me. To fulfill those thousands of missing javadocs for exceptions or param probably worth a feature... Also I didn't implement those protocols it's hard for me to write correct description... > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, > YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, > YARN-4218.trunk.2.patch, YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, > YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.trunk.3.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, > YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, > YARN-4218.trunk.2.patch, YARN-4218.trunk.3.patch, YARN-4218.trunk.patch, > YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621566#comment-15621566 ] Chang Li commented on YARN-4218: were having some troubles to run jobs on trunk, just figured that out.. Will submit a patch for trunk soon > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, > YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, > YARN-4218.trunk.2.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.branch-2.2.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, > YARN-4218.branch-2.2.patch, YARN-4218.branch-2.patch, YARN-4218.patch, > YARN-4218.trunk.2.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.5.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.5.patch, > YARN-4218.branch-2.patch, YARN-4218.patch, YARN-4218.trunk.2.patch, > YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.branch-2.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, > YARN-4218.branch-2.patch, YARN-4218.patch, YARN-4218.trunk.2.patch, > YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4269) Log aggregation should not swallow the exception during close()
[ https://issues.apache.org/jira/browse/YARN-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15619473#comment-15619473 ] Chang Li commented on YARN-4269: Thanks for taking a look at this jira [~shaneku...@gmail.com]. I think it's not straight forward to pass in the log priority since we are using a Log interface not some specific log like log4j > Log aggregation should not swallow the exception during close() > --- > > Key: YARN-4269 > URL: https://issues.apache.org/jira/browse/YARN-4269 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Reporter: Chang Li >Assignee: Chang Li > Labels: oct16-easy > Attachments: YARN-4269.2.patch, YARN-4269.3.patch, YARN-4269.patch > > > the log aggregation thread ignores exception thrown by close(). It shouldn't > be ignored, since the file content may be missing or partial. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.trunk.2.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, > YARN-4218.trunk.2.patch, YARN-4218.trunk.patch, YARN-4218.wip.patch, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15619394#comment-15619394 ] Chang Li commented on YARN-4218: it seems patch generated by {code} git diff HEAD --no-prefix {code} is not accepted by HADOOP QA? post again .trunk patch generated by {code} git diff trunk {code} > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, > YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.trunk.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, > YARN-4218.trunk.patch, YARN-4218.wip.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.4.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.4.patch, YARN-4218.patch, > YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4935) TestYarnClient#testSubmitIncorrectQueue fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233692#comment-15233692 ] Chang Li commented on YARN-4935: agree that change in YARN-3131 is CS specific, and that change was originally driven by some problem encountered in Tez > TestYarnClient#testSubmitIncorrectQueue fails with FairScheduler > > > Key: YARN-4935 > URL: https://issues.apache.org/jira/browse/YARN-4935 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.8.0 >Reporter: Yufei Gu >Assignee: Yufei Gu > > This test case introduced by YARN-3131 works well on CapacityScheduler but > not on FairScheduler, since CS doesn't allow dynamically create a queue, but > FS support it. So if you give a random queue name, CS will reject it, but FS > will create a new queue for it by default. > One simple solution is to specific CS in this test case. /cc [~lichangleo]. I > was thinking about creating another test case for FS. But for the code > introduced by YARN-3131, it may be not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4642) Commonize URL parsing code in RMWebAppFilter
[ https://issues.apache.org/jira/browse/YARN-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128823#comment-15128823 ] Chang Li commented on YARN-4642: [~jlowe] please help review, thanks! > Commonize URL parsing code in RMWebAppFilter > > > Key: YARN-4642 > URL: https://issues.apache.org/jira/browse/YARN-4642 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4642.2.patch, YARN-4642.patch > > > A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url > parsing code and to unblock the progress for YARN-4428 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4642) Commonize URL parsing code in RMWebAppFilter
[ https://issues.apache.org/jira/browse/YARN-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4642: --- Attachment: YARN-4642.2.patch > Commonize URL parsing code in RMWebAppFilter > > > Key: YARN-4642 > URL: https://issues.apache.org/jira/browse/YARN-4642 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4642.2.patch, YARN-4642.patch > > > A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url > parsing code and to unblock the progress for YARN-4428 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4642) Commonize URL parsing code in RMWebAppFilter
[ https://issues.apache.org/jira/browse/YARN-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4642: --- Attachment: YARN-4642.patch > Commonize URL parsing code in RMWebAppFilter > > > Key: YARN-4642 > URL: https://issues.apache.org/jira/browse/YARN-4642 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4642.patch > > > A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url > parsing code and to unblock the progress for YARN-4428 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.branch-2.7.patch [~jlowe], uploaded 2.7 patch. Also realized that my previous .9 patch wrote log.info instead of log.debug, so updated .10 patch to address that as well > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.10.patch, YARN-4428.2.2.patch, YARN-4428.2.patch, > YARN-4428.3.patch, YARN-4428.3.patch, YARN-4428.4.patch, YARN-4428.5.patch, > YARN-4428.6.patch, YARN-4428.7.patch, YARN-4428.8.patch, > YARN-4428.9.test.patch, YARN-4428.branch-2.7.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.10.patch > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.10.patch, YARN-4428.2.2.patch, YARN-4428.2.patch, > YARN-4428.3.patch, YARN-4428.3.patch, YARN-4428.4.patch, YARN-4428.5.patch, > YARN-4428.6.patch, YARN-4428.7.patch, YARN-4428.8.patch, > YARN-4428.9.test.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.9.test.patch [~jlowe] sorry I missed that. updated .9 patch accordingly > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch, > YARN-4428.8.patch, YARN-4428.9.test.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121570#comment-15121570 ] Chang Li commented on YARN-4428: [~jlowe] please help review the latest patch, thanks! > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch, > YARN-4428.8.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120028#comment-15120028 ] Chang Li commented on YARN-4428: [~jlowe] please help review the latest patch, thanks! > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.8.patch [~jlowe] thanks a lot for patient and careful review! updated.8 patch accordingly > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch, > YARN-4428.8.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.7.patch > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch, YARN-4428.7.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4642) Commonize URL parsing code in RMWebAppFilter
Chang Li created YARN-4642: -- Summary: Commonize URL parsing code in RMWebAppFilter Key: YARN-4642 URL: https://issues.apache.org/jira/browse/YARN-4642 Project: Hadoop YARN Issue Type: Improvement Reporter: Chang Li Assignee: Chang Li A follow up jira for YARN-4428 as suggested by [~jlowe] to commonize url parsing code and to unblock the progress for YARN-4428 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.6.patch Thanks [~jlowe] for review! updated .6 patch to address your concerns. Also opened YARN-4642 to work on commonize url parsing > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch, YARN-4428.6.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115222#comment-15115222 ] Chang Li commented on YARN-4428: [~jlowe] could you help review the latest patch? Thx! > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.3.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.3.patch, YARN-4218.patch, YARN-4218.wip.patch, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4570) Nodemanager leaking RawLocalFilesystem instances for user "testing"
[ https://issues.apache.org/jira/browse/YARN-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112677#comment-15112677 ] Chang Li commented on YARN-4570: Profiled several nodemanagers' heap dump in our clusters, not able to find the "testing" RawLocalFilesystem leaking by far. Close this as cannot reproduce? > Nodemanager leaking RawLocalFilesystem instances for user "testing" > --- > > Key: YARN-4570 > URL: https://issues.apache.org/jira/browse/YARN-4570 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Chang Li > > I recently ran across a NodeManager that was running slowly due to excessive > GC. Digging into the heap I saw that most of the issue was leaked filesystem > statistics data objects which has been fixed in HADOOP-12107. However I also > noticed there were many thousands of RawLocalFilesystem objects on the heap, > far more than any other FileSystem type. Sampling a number of them showed > that they were for the "testing" user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking
[ https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107096#comment-15107096 ] Chang Li commented on YARN-4589: [~jlowe] please help review the latest patch. Latest implementation add a new container external state localizing, and in each nodeheartbeat to rm, RMNode maintains and updates states of its container. When RMAppattempt timeout it queries from RMNode about its container state. The implementation also considers backward compatibility > Diagnostics for localization timeouts is lacking > > > Key: YARN-4589 > URL: https://issues.apache.org/jira/browse/YARN-4589 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch > > > When a container takes too long to localize it manifests as a timeout, and > there's no indication that localization was the issue. We need diagnostics > for timeouts to indicate the container was still localizing when the timeout > occurred. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.5.patch .5 patch address checkstyle > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch, YARN-4428.5.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.4.patch Thanks [~jlowe] for review and providing good suggestions! updated .4 patch to also support redirect for appattempt and container. Have successfully manually test them > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch, > YARN-4428.4.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking
[ https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4589: --- Attachment: YARN-4589.3.patch [~jlowe] thanks for good suggestion for separating those two. updated .3 patch keep only Yarn related change. Also fixed broken unit tests > Diagnostics for localization timeouts is lacking > > > Key: YARN-4589 > URL: https://issues.apache.org/jira/browse/YARN-4589 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch > > > When a container takes too long to localize it manifests as a timeout, and > there's no indication that localization was the issue. We need diagnostics > for timeouts to indicate the container was still localizing when the timeout > occurred. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.3.patch > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch, YARN-4428.3.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking
[ https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4589: --- Attachment: YARN-4589.2.patch > Diagnostics for localization timeouts is lacking > > > Key: YARN-4589 > URL: https://issues.apache.org/jira/browse/YARN-4589 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4589.2.patch, YARN-4589.patch > > > When a container takes too long to localize it manifests as a timeout, and > there's no indication that localization was the issue. We need diagnostics > for timeouts to indicate the container was still localizing when the timeout > occurred. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4592) Remove unsed GetContainerStatus proto
Chang Li created YARN-4592: -- Summary: Remove unsed GetContainerStatus proto Key: YARN-4592 URL: https://issues.apache.org/jira/browse/YARN-4592 Project: Hadoop YARN Issue Type: Bug Reporter: Chang Li Assignee: Chang Li Priority: Minor Attachments: YARN-4592.patch GetContainerStatus protos have been left unused since YARN-926 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4592) Remove unsed GetContainerStatus proto
[ https://issues.apache.org/jira/browse/YARN-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4592: --- Attachment: YARN-4592.patch > Remove unsed GetContainerStatus proto > - > > Key: YARN-4592 > URL: https://issues.apache.org/jira/browse/YARN-4592 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Minor > Attachments: YARN-4592.patch > > > GetContainerStatus protos have been left unused since YARN-926 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.3.patch thanks [~jlowe] for review and point me to the related issue! updated .3 patch which compute redirect inside RMWebAppFilter > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch, YARN-4428.3.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking
[ https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4589: --- Attachment: YARN-4589.patch > Diagnostics for localization timeouts is lacking > > > Key: YARN-4589 > URL: https://issues.apache.org/jira/browse/YARN-4589 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4589.patch > > > When a container takes too long to localize it manifests as a timeout, and > there's no indication that localization was the issue. We need diagnostics > for timeouts to indicate the container was still localizing when the timeout > occurred. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4589) Diagnostics for localization timeouts is lacking
Chang Li created YARN-4589: -- Summary: Diagnostics for localization timeouts is lacking Key: YARN-4589 URL: https://issues.apache.org/jira/browse/YARN-4589 Project: Hadoop YARN Issue Type: Improvement Reporter: Chang Li Assignee: Chang Li When a container takes too long to localize it manifests as a timeout, and there's no indication that localization was the issue. We need diagnostics for timeouts to indicate the container was still localizing when the timeout occurred. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.3.patch Thanks [~jlowe] for spotting that! updated .3 to remove the comment > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch, YARN-4414.3.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092408#comment-15092408 ] Chang Li commented on YARN-4414: [~jlowe], could you help review the latest patch? Thx > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4570) Nodemanager leaking RawLocalFilesystem instances for user "testing"
[ https://issues.apache.org/jira/browse/YARN-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li reassigned YARN-4570: -- Assignee: Chang Li > Nodemanager leaking RawLocalFilesystem instances for user "testing" > --- > > Key: YARN-4570 > URL: https://issues.apache.org/jira/browse/YARN-4570 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Chang Li > > I recently ran across a NodeManager that was running slowly due to excessive > GC. Digging into the heap I saw that most of the issue was leaked filesystem > statistics data objects which has been fixed in HADOOP-12107. However I also > noticed there were many thousands of RawLocalFilesystem objects on the heap, > far more than any other FileSystem type. Sampling a number of them showed > that they were for the "testing" user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089339#comment-15089339 ] Chang Li commented on YARN-4414: Hi [~xinxianyin], RM HA already disables IPC retries. Also client should try to connect to RM really hard because it's catastrophic failure if it doesn't. Failure to connect to a NM is not. I think we should just make change for NMProxy in this jira. > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.2.patch Thanks [~jlowe] for review! updated .2 patch to remove getNMProxy2 and implemented getProxy() in term of getProxy(Configuration). I set NM address to some dummy value 1234 so that it will trigger connection error and rpc level retires. {{BaseContainerManagerTest}} set it to {code}"0.0.0.0:" + ServerSocketUtil.getPort(49162, 10); {code} a normal address thus rpc retry could not be triggered > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.3.patch, YARN-4414.1.patch, YARN-4414.2.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.2.2.patch .2.2 addressed the whitespace issue > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, > YARN-4428.2.2.patch, YARN-4428.2.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.2.patch .2 patch added a unit test for the change in RMAppAttemptImpl. web redirect is hard to write a unit test to test for. I test by first run a sleep job with invalid options, such as sleep -Dyarn.app.mapreduce.am.command-opts="-abc" -m 1, then the job will crash during launching. Then I shutdown rm, clear the statestore, and bounce the rm back again. Then I visit the crashed app in RM UI and verified that I am able to redirected to AHS page for that app. > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch, YARN-4428.2.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.1.2.patch .1.2 patch addressed checkstyle issues > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.2.patch, YARN-4428.1.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.1.3.patch oops, my bad, intended to name latest patch as .1.3. removed the .2.2 patch and re-upload the latest as .1.3 > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.3.patch, YARN-4414.1.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: (was: YARN-4414.2.2.patch) > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.2.2.patch .2.2 fix the white space issue > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.patch, YARN-4414.2.2.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.1.2.patch > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, > YARN-4414.1.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.2.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.2.patch, YARN-4218.patch, YARN-4218.wip.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4341) add doc about timeline performance tool usage
[ https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052983#comment-15052983 ] Chang Li commented on YARN-4341: [~sjlee0], could you please help look if the latest patch is good? Thanks! > add doc about timeline performance tool usage > - > > Key: YARN-4341 > URL: https://issues.apache.org/jira/browse/YARN-4341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4341.2.patch, YARN-4341.3.patch, YARN-4341.4.patch, > YARN-4341.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.1.patch > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4414: --- Attachment: YARN-4414.1.2.patch > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4414.1.2.patch, YARN-4414.1.patch > > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage
[ https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4341: --- Attachment: YARN-4341.4.patch Thanks for detecting the issue [~sjlee0]. updated .4 patch to fix that > add doc about timeline performance tool usage > - > > Key: YARN-4341 > URL: https://issues.apache.org/jira/browse/YARN-4341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4341.2.patch, YARN-4341.3.patch, YARN-4341.4.patch, > YARN-4341.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage
[ https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4341: --- Attachment: YARN-4341.3.patch Thanks a lot [~sjlee0] for review! updated .3 patch and addressed your suggestions there. > add doc about timeline performance tool usage > - > > Key: YARN-4341 > URL: https://issues.apache.org/jira/browse/YARN-4341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4341.2.patch, YARN-4341.3.patch, YARN-4341.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Attachment: YARN-4428.1.patch > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. Also there is a corner case when application > failed during launch, it's original tracking url won't be set correctly, and > WebAppProxyServlet won't redirect us to AHS page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
Chang Li created YARN-4428: -- Summary: Redirect RM page to AHS page when AHS turned on and RM page is not avaialable Key: YARN-4428 URL: https://issues.apache.org/jira/browse/YARN-4428 Project: Hadoop YARN Issue Type: Bug Reporter: Chang Li Assignee: Chang Li When AHS is turned on, if we can't view application in RM page, RM page should redirect us to AHS page. Also there is a corner case when application failed during launch, it's original tracking url won't be set correctly, and WebAppProxyServlet won't redirect us to AHS page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Description: When AHS is turned on, if we can't view application in RM page, RM page should redirect us to AHS page. Also there is a corner case when application failed during launch, it's original tracking url won't be set correctly, and WebAppProxyServlet won't redirect us to AHS page as YARN-3975 designed. (was: When AHS is turned on, if we can't view application in RM page, RM page should redirect us to AHS page. Also there is a corner case when application failed during launch, it's original tracking url won't be set correctly, and WebAppProxyServlet won't redirect us to AHS page.) > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. Also there is a corner case when application > failed during launch, it's original tracking url won't be set correctly, and > WebAppProxyServlet won't redirect us to AHS page as YARN-3975 designed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4414) Nodemanager connection errors are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li reassigned YARN-4414: -- Assignee: Chang Li > Nodemanager connection errors are retried at multiple levels > > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Jason Lowe >Assignee: Chang Li > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4428) Redirect RM page to AHS page when AHS turned on and RM page is not avaialable
[ https://issues.apache.org/jira/browse/YARN-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4428: --- Description: When AHS is turned on, if we can't view application in RM page, RM page should redirect us to AHS page. For example, when you go to cluster/app/application_1, if RM no longer remember the application, we will simply get "Failed to read the application application_1", but it will be good for RM ui to smartly try to redirect to AHS ui /applicationhistory/app/application_1 to see if it's there. The redirect usage already exist for logs in nodemanager UI. Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on fall back of RM not remembering the app. YARN-3975 tried to do this only when original tracking url is not set. But there are many cases, such as when app failed at launch, original tracking url will be set to point to RM page, so redirect to AHS page won't work. was: When AHS is turned on, if we can't view application in RM page, RM page should redirect us to AHS page. For example, when you go to cluster/app/application_1, if RM no longer remember the application, we will simply get "Failed to read the application application_1", but it will be good for RM ui to smartly try to redirect to AHS ui /applicationhistory/app/application_1 to see if it's there. The redirect usage already exist for logs in nodemanager UI. Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page. YARN-3975 tried to do this only when original tracking url is not set. But there are many cases, such as when app failed at launch, original tracking url will be set to point to RM page, so redirect to AHS page won't work. > Redirect RM page to AHS page when AHS turned on and RM page is not avaialable > - > > Key: YARN-4428 > URL: https://issues.apache.org/jira/browse/YARN-4428 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4428.1.patch > > > When AHS is turned on, if we can't view application in RM page, RM page > should redirect us to AHS page. For example, when you go to > cluster/app/application_1, if RM no longer remember the application, we will > simply get "Failed to read the application application_1", but it will be > good for RM ui to smartly try to redirect to AHS ui > /applicationhistory/app/application_1 to see if it's there. The redirect > usage already exist for logs in nodemanager UI. > Also, when AHS is enabled, WebAppProxyServlet should redirect to AHS page on > fall back of RM not remembering the app. YARN-3975 tried to do this only when > original tracking url is not set. But there are many cases, such as when app > failed at launch, original tracking url will be set to point to RM page, so > redirect to AHS page won't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4375) CapacityScheduler needs more debug logging for why queues don't get containers
[ https://issues.apache.org/jira/browse/YARN-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4375: --- Attachment: YARN-4375.patch [~sunilg], thanks for pointing me to those ongoing efforts. What I want to accomplish in this jira is to simply add more debug logging of what might go wrong when allocate container to queue. I have upload a patch which add more debug logging in regular container allocator to indicate problems of allocating containers. > CapacityScheduler needs more debug logging for why queues don't get containers > -- > > Key: YARN-4375 > URL: https://issues.apache.org/jira/browse/YARN-4375 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4375.patch > > > CapacityScheduler needs more debug logging for why queues don't get containers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.4.2.patch > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.2.patch, YARN-4334.3.patch, > YARN-4334.4.2.patch, YARN-4334.4.patch, YARN-4334.patch, > YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, > YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.4.patch .4 patch fix some checkstyle and whitespace issues. TestWebApp is tracked by YARN-4379, not related to my change. TestAMAuthorization and TestClientRMTokens are not caused by my patch either. [~jlowe], please help review the latest patch, thanks! > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.2.patch, YARN-4334.3.patch, YARN-4334.4.patch, > YARN-4334.patch, YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, > YARN-4334.wip.4.patch, YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022161#comment-15022161 ] Chang Li commented on YARN-4132: TestWebApp is tracked by YARN-4379, not related to my change. [~jlowe], please help review the updated patch, thanks! > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, > YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.7.patch, > YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.2.patch fix broken test of TestLeveldbRMStateStore and TestYarnConfigurationFields. The other broken tests are not related. Also add additional tests for statestore. > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.2.patch, YARN-4334.patch, > YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, > YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4374) RM scheduler UI rounds user limit factor
[ https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018709#comment-15018709 ] Chang Li commented on YARN-4374: Thanks [~jlowe] for further review! I am good with going with first patch > RM scheduler UI rounds user limit factor > > > Key: YARN-4374 > URL: https://issues.apache.org/jira/browse/YARN-4374 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch > > > RM scheduler UI rounds user limit factor, such as from 0.25 up to 0.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.3.patch > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.2.patch, YARN-4334.3.patch, YARN-4334.patch, > YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, > YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4132: --- Attachment: YARN-4132.7.patch Thanks [~jlowe] for further review! have updated .7 patch accordingly > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, > YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.7.patch, > YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.patch added unit test > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.patch, YARN-4334.wip.2.patch, > YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4374) RM scheduler UI rounds user limit factor
[ https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014573#comment-15014573 ] Chang Li commented on YARN-4374: [~jlowe], please help review the latest patch > RM scheduler UI rounds user limit factor > > > Key: YARN-4374 > URL: https://issues.apache.org/jira/browse/YARN-4374 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch > > > RM scheduler UI rounds user limit factor, such as from 0.25 up to 0.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4374) RM scheduler UI rounds user limit factor
Chang Li created YARN-4374: -- Summary: RM scheduler UI rounds user limit factor Key: YARN-4374 URL: https://issues.apache.org/jira/browse/YARN-4374 Project: Hadoop YARN Issue Type: Bug Reporter: Chang Li Assignee: Chang Li RM scheduler UI rounds user limit factor, such as from 0.25 up to 0.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4374) RM scheduler UI rounds user limit factor
[ https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4374: --- Attachment: YARN-4374.patch > RM scheduler UI rounds user limit factor > > > Key: YARN-4374 > URL: https://issues.apache.org/jira/browse/YARN-4374 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4374.patch > > > RM scheduler UI rounds user limit factor, such as from 0.25 up to 0.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4375) CapacityScheduler needs more debug logging for why queues don't get containers
Chang Li created YARN-4375: -- Summary: CapacityScheduler needs more debug logging for why queues don't get containers Key: YARN-4375 URL: https://issues.apache.org/jira/browse/YARN-4375 Project: Hadoop YARN Issue Type: Bug Reporter: Chang Li Assignee: Chang Li CapacityScheduler needs more debug logging for why queues don't get containers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4374) RM scheduler UI rounds user limit factor
[ https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4374: --- Attachment: Screenshot1.png Attach a screenshot to show the patch fix the problem. [~jlowe] please help review > RM scheduler UI rounds user limit factor > > > Key: YARN-4374 > URL: https://issues.apache.org/jira/browse/YARN-4374 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: Screenshot1.png, YARN-4374.patch > > > RM scheduler UI rounds user limit factor, such as from 0.25 up to 0.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4374) RM scheduler UI rounds user limit factor
[ https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4374: --- Attachment: YARN-4374.2.patch Thanks [~jlowe] for review. Agree. update .2 patch with %.3f precision. In addition, .2 patch handles trailing 0s. > RM scheduler UI rounds user limit factor > > > Key: YARN-4374 > URL: https://issues.apache.org/jira/browse/YARN-4374 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch > > > RM scheduler UI rounds user limit factor, such as from 0.25 up to 0.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.wip.4.patch update wip.4 patch to add implementation for ZKRMStateStore. Also make the Rm heart beat service to statestore configurable and the check for statestore staleness on recovery configurable > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, > YARN-4334.wip.4.patch, YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.wip.3.patch wip.3 patch add implementation for FileSystemRMStateStore > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, > YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.wip.2.patch Thanks [~jlowe] for review! I have updated .2 prototype patch, please try it out. On rm state store expired, rm recovery will recover previous running app and app attempts to killed. .2 prototype patch also address your other concerns. Will work on implementation of other statestore soon. > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.wip.2.patch, YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4132: --- Attachment: YARN-4132.6.2.patch .6.2 patch fix whitespace > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, > YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage
[ https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4341: --- Attachment: YARN-4341.patch > add doc about timeline performance tool usage > - > > Key: YARN-4341 > URL: https://issues.apache.org/jira/browse/YARN-4341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4341.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4341) add doc about timeline performance tool usage
[ https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4341: --- Attachment: YARN-4341.2.patch .2 fix whitespace > add doc about timeline performance tool usage > - > > Key: YARN-4341 > URL: https://issues.apache.org/jira/browse/YARN-4341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4341.2.patch, YARN-4341.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4132: --- Attachment: YARN-4132.6.patch Thanks [~jlowe] for review and good suggestion! Have update patch accordingly. > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, > YARN-4132.5.patch, YARN-4132.6.patch, YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4132: --- Attachment: YARN-4132.5.patch Thanks [~jlowe] for review! .5 patch only has one createRMProxy which takes two additional inputs of retry time and retry interval. ServerRMProxy and ClientRMProxy pass those two inputs according to different values in conf. Conf naming is fixed. Test is also tuned down to around 4 seconds. > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, > YARN-4132.5.patch, YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4341) add doc about timeline performance tool usage
Chang Li created YARN-4341: -- Summary: add doc about timeline performance tool usage Key: YARN-4341 URL: https://issues.apache.org/jira/browse/YARN-4341 Project: Hadoop YARN Issue Type: Improvement Reporter: Chang Li Assignee: Chang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4339) optimize timeline server performance tool
Chang Li created YARN-4339: -- Summary: optimize timeline server performance tool Key: YARN-4339 URL: https://issues.apache.org/jira/browse/YARN-4339 Project: Hadoop YARN Issue Type: Improvement Reporter: Chang Li Assignee: Chang Li As [~Naganarasimha] suggest in YARN-2556 that test could be optimized by having some initial Level DB data before testing the performance -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted
[ https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4218: --- Attachment: YARN-4218.2.patch > Metric for resource*time that was preempted > --- > > Key: YARN-4218 > URL: https://issues.apache.org/jira/browse/YARN-4218 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, > YARN-4218.patch, YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, > screenshot-3.png > > > After YARN-415 we have the ability to track the resource*time footprint of a > job and preemption metrics shows how many containers were preempted on a job. > However we don't have a metric showing the resource*time footprint cost of > preemption. In other words, we know how many containers were preempted but we > don't have a good measure of how much work was lost as a result of preemption. > We should add this metric so we can analyze how much work preemption is > costing on a grid and better track which jobs were heavily impacted by it. A > job that has 100 containers preempted that only lasted a minute each and were > very small is going to be less impacted than a job that only lost a single > container but that container was huge and had been running for 3 days. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997139#comment-14997139 ] Chang Li commented on YARN-2556: create YARN-4341 to track work of add doc about timeline performance tool usage > Tool to measure the performance of the timeline server > -- > > Key: YARN-2556 > URL: https://issues.apache.org/jira/browse/YARN-2556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Chang Li > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, > YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, > YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, > YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, > YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, > YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, > YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch > > > We need to be able to understand the capacity model for the timeline server > to give users the tools they need to deploy a timeline server with the > correct capacity. > I propose we create a mapreduce job that can measure timeline server write > and read performance. Transactions per second, I/O for both read and write > would be a good start. > This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996875#comment-14996875 ] Chang Li commented on YARN-2556: Thanks [~Naganarasimha] for suggesting optimization! +1 on the idea of creating some initial leveldb data before test the performance. Create YARN-4339 to work on this idea. > Tool to measure the performance of the timeline server > -- > > Key: YARN-2556 > URL: https://issues.apache.org/jira/browse/YARN-2556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Chang Li > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, > YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, > YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, > YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, > YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, > YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, > YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch > > > We need to be able to understand the capacity model for the timeline server > to give users the tools they need to deploy a timeline server with the > correct capacity. > I propose we create a mapreduce job that can measure timeline server write > and read performance. Transactions per second, I/O for both read and write > would be a good start. > This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996832#comment-14996832 ] Chang Li commented on YARN-2556: Hi [~xgong], here is the usage print out by the tool {code} Usage: [-m ] number of mappers (default: 1) [-v] timeline service version [-mtype ] 1. simple entity write mapper 2. jobhistory files replay mapper [-s <(KBs)test>] number of KB per put (mtype=1, default: 1 KB) [-t] package sending iterations per mapper (mtype=1, default: 100) [-d ] root path of job history files (mtype=2) [-r ] (mtype=2) 1. write all entities for a job in one put (default) 2. write one entity at a time{code} there are two different modes to test, one is simple entity writer, where each mapper create your specified size of entities and put them to timeline server. The other mode of test is by replaying jobhistory files, which offer a more realistic test. In the case of jobhistory file replay test, you put testing jobhistory files(both the job history file and job conf file) under a directory, and then you specify the testing dir by -d option. You specify the test mode by -mtype option. Right now the usage won't get printed out if you pass no options, but only print out when you pass the wrong options. When you give no parameters, the test run with simple entity write mode and default setting. So maybe we want to print out this usage if we don't pass any parameter? > Tool to measure the performance of the timeline server > -- > > Key: YARN-2556 > URL: https://issues.apache.org/jira/browse/YARN-2556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Chang Li > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, > YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, > YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, > YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, > YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, > YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, > YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch > > > We need to be able to understand the capacity model for the timeline server > to give users the tools they need to deploy a timeline server with the > correct capacity. > I propose we create a mapreduce job that can measure timeline server write > and read performance. Transactions per second, I/O for both read and write > would be a good start. > This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.wip.patch upload a prototype patch, which does heartbeat to LeveldbRMStateStore and on RM recovery it checks whether statestore is expired > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li reassigned YARN-4334: -- Assignee: Chang Li > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue
[ https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4236: --- Attachment: (was: YARN-4236.patch) > Metric for aggregated resources allocation per queue > > > Key: YARN-4236 > URL: https://issues.apache.org/jira/browse/YARN-4236 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4236.patch > > > We currently track allocated memory and allocated vcores per queue but we > don't have a good rate metric on how fast we're allocating these things. In > other words, a straight line in allocatedmb could equally be one extreme of > no new containers are being allocated or allocating a bunch of containers > where we free exactly what we allocate each time. Adding a resources > allocated per second per queue would give us a better insight into the rate > of resource churn on a queue. Based on this aggregated resource allocation > per queue we can easily have some tools to measure the rate of resource > allocation per queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)