[jira] [Assigned] (YARN-7443) Add native FPGA module support to do isolation with cgroups
[ https://issues.apache.org/jira/browse/YARN-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa reassigned YARN-7443: Assignee: Zhankun Tang > Add native FPGA module support to do isolation with cgroups > --- > > Key: YARN-7443 > URL: https://issues.apache.org/jira/browse/YARN-7443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang > Attachments: YARN-7443-trunk.001.patch > > > Only support one major number devices configured in c-e.cfg for now. So > almost same with GPU native module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3774) ZKRMStateStore should use Curator 3.0 and avail CuratorOp
[ https://issues.apache.org/jira/browse/YARN-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813915#comment-15813915 ] Tsuyoshi Ozawa commented on YARN-3774: -- Thanks Jordan for the notification! I think we should use 3.3.0, 2.12.0 or later. > ZKRMStateStore should use Curator 3.0 and avail CuratorOp > - > > Key: YARN-3774 > URL: https://issues.apache.org/jira/browse/YARN-3774 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > YARN-2716 changes ZKRMStateStore to use Curator. Transactions added there are > somewhat involved, and could be improved using CuratorOp introduced in > Curator 3.0. Hadoop 3.0.0 would be a good time to upgrade the Curator version > and make this change. > Curator is considering shading guava through CURATOR-200. In Hadoop 3, we > should upgrade to the next Curator version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803169#comment-15803169 ] Tsuyoshi Ozawa commented on YARN-4348: -- Yes, Jian is correct. > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.7.2, 2.6.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5801) Adding isRoot method to CSQueue
Tsuyoshi Ozawa created YARN-5801: Summary: Adding isRoot method to CSQueue Key: YARN-5801 URL: https://issues.apache.org/jira/browse/YARN-5801 Project: Hadoop YARN Issue Type: Improvement Reporter: Tsuyoshi Ozawa Currently, we check whether CSQueue is root or not by using null check against getParent. It's more straightforward to introduce isRoot to a method in CSQueue instead of going to current way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5801) Adding isRoot method to CSQueue
[ https://issues.apache.org/jira/browse/YARN-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-5801: - Description: Currently, we check whether CSQueue is root or not by using null check against the return value of getParent. It's more straightforward to introduce isRoot method to CSQueue instead of going to current way. (was: Currently, we check whether CSQueue is root or not by using null check against getParent. It's more straightforward to introduce isRoot to a method in CSQueue instead of going to current way. ) > Adding isRoot method to CSQueue > --- > > Key: YARN-5801 > URL: https://issues.apache.org/jira/browse/YARN-5801 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tsuyoshi Ozawa > > Currently, we check whether CSQueue is root or not by using null check > against the return value of getParent. It's more straightforward to introduce > isRoot method to CSQueue instead of going to current way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3538) TimelineServer doesn't catch/translate all exceptions raised
[ https://issues.apache.org/jira/browse/YARN-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3538: - Attachment: YARN-3538.002.patch Updating the patch based on the discussion. [~djp] could you check the patch? > TimelineServer doesn't catch/translate all exceptions raised > > > Key: YARN-3538 > URL: https://issues.apache.org/jira/browse/YARN-3538 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: oct16-easy > Attachments: YARN-3538-001.patch, YARN-3538.002.patch > > > Not all exceptions in TimelineServer are uprated to web exceptions; only IOEs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617307#comment-15617307 ] Tsuyoshi Ozawa commented on YARN-2674: -- [~chenchun] The patch seems to be stale now. Could you update it? > Distributed shell AM may re-launch containers if RM work preserving restart > happens > --- > > Key: YARN-2674 > URL: https://issues.apache.org/jira/browse/YARN-2674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, resourcemanager >Reporter: Chun Chen >Assignee: Chun Chen > Labels: oct16-easy > Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, > YARN-2674.4.patch, YARN-2674.5.patch > > > Currently, if RM work preserving restart happens while distributed shell is > running, distribute shell AM may re-launch all the containers, including > new/running/complete. We must make sure it won't re-launch the > running/complete containers. > We need to remove allocated containers from > AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2467) Add SpanReceiverHost to ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617301#comment-15617301 ] Tsuyoshi Ozawa commented on YARN-2467: -- [~iwasakims] could you rebase it on trunk code? It cannot be applied to trunk. > Add SpanReceiverHost to ResourceManager > --- > > Key: YARN-2467 > URL: https://issues.apache.org/jira/browse/YARN-2467 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki > Labels: oct16-easy > Attachments: YARN-2467.001.patch, YARN-2467.002.patch > > > Per process SpanReceiverHost should be initialized in ResourceManager in the > same way as NameNode and DataNode do in order to support tracing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5746) The state of the parentQueue and its childQueues should be synchronized.
[ https://issues.apache.org/jira/browse/YARN-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617279#comment-15617279 ] Tsuyoshi Ozawa commented on YARN-5746: -- [~xgong] thanks for taking this issue. {code} public QueueState getConfiguredState(String queue) { String state = get(getQueuePrefix(queue) + STATE); if (state == null) { return null; } else { return QueueState.valueOf(StringUtils.toUpperCase(state)); } {code} It's a bit difficult to understand what the state of "null" mean. I would like to suggest that we create new state, QueueState.NOT_FOUND, and return it instead of returning null. What do you think? {quote} Let's collapse these nested conditionals into an else if: {quote} +1 In addition to Daniel's comments, how about adding new private method to wrap up the following routine? {code} if (parent != null) { QueueState configuredState = csContext.getConfiguration() .getConfiguredState(getQueuePath()); QueueState parentState = parent.getState(); if (configuredState == null) { this.state = parentState; } else { if (configuredState == QueueState.RUNNING && parentState == QueueState.STOPPED) { throw new IllegalArgumentException( "Illegal" + " State of " + configuredState + " for children of queue: " + queueName + ". The state of its parent queue: " + parent.getQueueName() + " is " + parentState); } else { this.state = configuredState; } } } else { // if this is the root queue, get the state from the configuration. // if the state is not set, use RUNNING as default state. this.state = csContext.getConfiguration().getState(getQueuePath()); } {code} > The state of the parentQueue and its childQueues should be synchronized. > > > Key: YARN-5746 > URL: https://issues.apache.org/jira/browse/YARN-5746 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Labels: oct16-easy > Attachments: YARN-5746.1.patch, YARN-5746.2.patch > > > The state of the parentQueue and its childQeues need to be synchronized. > * If the state of the parentQueue becomes STOPPED, the state of its > childQueue need to become STOPPED as well. > * If we change the state of the queue to RUNNING, we should make sure the > state of all its ancestor must be RUNNING. Otherwise, we need to fail this > operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5259) Add two metrics at FSOpDurations for doing container assign and completed Performance statistical analysis
[ https://issues.apache.org/jira/browse/YARN-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-5259: - Assignee: Inigo Goiri > Add two metrics at FSOpDurations for doing container assign and completed > Performance statistical analysis > -- > > Key: YARN-5259 > URL: https://issues.apache.org/jira/browse/YARN-5259 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: ChenFolin >Assignee: Inigo Goiri > Labels: oct16-easy > Attachments: YARN-5259-001.patch, YARN-5259-002.patch, > YARN-5259-003.patch, YARN-5259-004.patch > > > If cluster is slow , we can not know Whether it is caused by container assign > or completed performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3139) Improve locks in AbstractYarnScheduler/CapacityScheduler/FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513703#comment-15513703 ] Tsuyoshi Ozawa commented on YARN-3139: -- [~leftnoteasy] [~jianhe] thanks for taking this issue. {quote} Summary: No regression in performance, didn't see deadlock happens. No significant performance improvement either, because existing scheduler allocation is still in single thread. {quote} If the performance doesn't change, could you clarify the reason to change this? Do you plan to make the scheduler allocation multi-threaded? > Improve locks in AbstractYarnScheduler/CapacityScheduler/FairScheduler > -- > > Key: YARN-3139 > URL: https://issues.apache.org/jira/browse/YARN-3139 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-3139.0.patch, YARN-3139.1.patch, YARN-3139.2.patch > > > Enhance locks in AbstractYarnScheduler/CapacityScheduler/FairScheduler, as > mentioned in YARN-3091, a possible solution is using read/write lock. Other > fine-graind locks for specific purposes / bugs should be addressed in > separated tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4714) [Java 8] Over usage of virtual memory
[ https://issues.apache.org/jira/browse/YARN-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393276#comment-15393276 ] Tsuyoshi Ozawa commented on YARN-4714: -- Hi Krishna, have you changed the configurations on all NodeManagers and restart all of them? > [Java 8] Over usage of virtual memory > - > > Key: YARN-4714 > URL: https://issues.apache.org/jira/browse/YARN-4714 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Mohammad Kamrul Islam >Assignee: Mohammad Kamrul Islam >Priority: Blocker > Attachments: HADOOP-11364.01.patch > > > In our Hadoop 2 + Java8 effort , we found few jobs are being Killed by Hadoop > due to excessive virtual memory allocation. Although the physical memory > usage is low. > The most common error message is "Container [pid=??,containerID=container_??] > is running beyond virtual memory limits. Current usage: 365.1 MB of 1 GB > physical memory used; 3.2 GB of 2.1 GB virtual memory used. Killing > container." > We see this problem for MR job as well as in spark driver/executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4048) Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x)
[ https://issues.apache.org/jira/browse/YARN-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4048: - Summary: Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x) (was: Linux kernel panic under strict CPU limits) > Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x) > -- > > Key: YARN-4048 > URL: https://issues.apache.org/jira/browse/YARN-4048 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: Chengbing Liu >Priority: Critical > Attachments: panic.png > > > With YARN-2440 and YARN-2531, we have seen some kernel panics happening under > heavy pressure. Even with YARN-2809, it still panics. > We are using CentOS 6.5, hadoop 2.5.0-cdh5.2.0 with the above patches. I > guess the latest version also has the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5332) Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9
[ https://issues.apache.org/jira/browse/YARN-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366191#comment-15366191 ] Tsuyoshi Ozawa commented on YARN-5332: -- [~sunilg] How about executing {{mvn clean install test -Dtest=TestRMWebServices}}? It works on my local without cleaning jersey-client-1.9.jar. If it doesn't work, it might be useful to clean M2_REPO as ad-hoc solution on an empirical basis. > Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9 > > > Key: YARN-5332 > URL: https://issues.apache.org/jira/browse/YARN-5332 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Sunil G >Assignee: Sunil G > > Few test classes like TestRMWebServices, were using > ClientResponse#getStatusInfo and this api is not available as part of jersey > 1.9. > Pls refer: > https://jersey.java.net/apidocs/1.9/jersey/com/sun/jersey/api/client/ClientResponse.html > {{getStatusInfo}} is not present here. > We may need to change such invocations from these test classes. > In HADOOP-9613, [~ozawa] mentioned in this > [comment|https://issues.apache.org/jira/browse/HADOOP-9613?focusedCommentId=14980024=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14980024] > that we can use {{getStatusInfo}}. > [~ozawa], could you please help to confirm this point Or am I missing some > thing here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5332) Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9
[ https://issues.apache.org/jira/browse/YARN-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366130#comment-15366130 ] Tsuyoshi Ozawa edited comment on YARN-5332 at 7/7/16 1:51 PM: -- [~sunilg] Thanks for reporting the issue. Please see [the doc of Jersey 1.19|https://jersey.java.net/apidocs/1.19/jersey/com/sun/jersey/api/client/ClientResponse.html#getClientResponseStatus()], not one of 1.9 because we upgraded Jersey to 1.19. Feel free to ask me about the update of dependency. was (Author: ozawa): [~sunilg] Thanks for reporting the issue. Please see [the doc of Jersey 1.19|https://jersey.java.net/apidocs/1.19/jersey/com/sun/jersey/api/client/ClientResponse.html#getClientResponseStatus()], not 1.9 because we upgraded Jersey to 1.19. Feel free to ask me about the update of dependency. > Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9 > > > Key: YARN-5332 > URL: https://issues.apache.org/jira/browse/YARN-5332 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Sunil G >Assignee: Sunil G > > Few test classes like TestRMWebServices, were using > ClientResponse#getStatusInfo and this api is not available as part of jersey > 1.9. > Pls refer: > https://jersey.java.net/apidocs/1.9/jersey/com/sun/jersey/api/client/ClientResponse.html > {{getStatusInfo}} is not present here. > We may need to change such invocations from these test classes. > In HADOOP-9613, [~ozawa] mentioned in this > [comment|https://issues.apache.org/jira/browse/HADOOP-9613?focusedCommentId=14980024=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14980024] > that we can use {{getStatusInfo}}. > [~ozawa], could you please help to confirm this point Or am I missing some > thing here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5332) Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9
[ https://issues.apache.org/jira/browse/YARN-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366130#comment-15366130 ] Tsuyoshi Ozawa edited comment on YARN-5332 at 7/7/16 1:50 PM: -- [~sunilg] Thanks for reporting the issue. Please see [the doc of Jersey 1.19|https://jersey.java.net/apidocs/1.19/jersey/com/sun/jersey/api/client/ClientResponse.html#getClientResponseStatus()], not 1.9 because we upgraded Jersey to 1.19. Feel free to ask me about the update of dependency. was (Author: ozawa): [~sunilg] Thanks for reporting the issue. Please see the doc of Jersey 1.19, not 1.9 because we upgraded Jersey to 1.19. Feel free to ask me about the update of dependency. https://jersey.java.net/apidocs/1.19/jersey/com/sun/jersey/api/client/ClientResponse.html#getClientResponseStatus() > Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9 > > > Key: YARN-5332 > URL: https://issues.apache.org/jira/browse/YARN-5332 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Sunil G >Assignee: Sunil G > > Few test classes like TestRMWebServices, were using > ClientResponse#getStatusInfo and this api is not available as part of jersey > 1.9. > Pls refer: > https://jersey.java.net/apidocs/1.9/jersey/com/sun/jersey/api/client/ClientResponse.html > {{getStatusInfo}} is not present here. > We may need to change such invocations from these test classes. > In HADOOP-9613, [~ozawa] mentioned in this > [comment|https://issues.apache.org/jira/browse/HADOOP-9613?focusedCommentId=14980024=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14980024] > that we can use {{getStatusInfo}}. > [~ozawa], could you please help to confirm this point Or am I missing some > thing here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5332) Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9
[ https://issues.apache.org/jira/browse/YARN-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366130#comment-15366130 ] Tsuyoshi Ozawa commented on YARN-5332: -- [~sunilg] Thanks for reporting the issue. Please see the doc of Jersey 1.19, not 1.9 because we upgraded Jersey to 1.19. Feel free to ask me about the update of dependency. https://jersey.java.net/apidocs/1.19/jersey/com/sun/jersey/api/client/ClientResponse.html#getClientResponseStatus() > Jersey ClientResponse#getStatusInfo api is not available with jersey 1.9 > > > Key: YARN-5332 > URL: https://issues.apache.org/jira/browse/YARN-5332 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Sunil G >Assignee: Sunil G > > Few test classes like TestRMWebServices, were using > ClientResponse#getStatusInfo and this api is not available as part of jersey > 1.9. > Pls refer: > https://jersey.java.net/apidocs/1.9/jersey/com/sun/jersey/api/client/ClientResponse.html > {{getStatusInfo}} is not present here. > We may need to change such invocations from these test classes. > In HADOOP-9613, [~ozawa] mentioned in this > [comment|https://issues.apache.org/jira/browse/HADOOP-9613?focusedCommentId=14980024=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14980024] > that we can use {{getStatusInfo}}. > [~ozawa], could you please help to confirm this point Or am I missing some > thing here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5224) Logs for a completed container are not available in the yarn logs output for a live application
[ https://issues.apache.org/jira/browse/YARN-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340841#comment-15340841 ] Tsuyoshi Ozawa commented on YARN-5224: -- Marking this as incompatible since the patch includes RESTful API's endpoint change > Logs for a completed container are not available in the yarn logs output for > a live application > --- > > Key: YARN-5224 > URL: https://issues.apache.org/jira/browse/YARN-5224 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Siddharth Seth >Assignee: Xuan Gong > Labels: incompatible > Attachments: YARN-5224.1.patch, YARN-5224.2.patch, YARN-5224.3.patch, > YARN-5224.4.patch, YARN-5224.5.patch > > > This affects 'short' jobs like MapReduce and Tez more than long running apps. > Related: YARN-5193 (but that only covers long running apps) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5224) Logs for a completed container are not available in the yarn logs output for a live application
[ https://issues.apache.org/jira/browse/YARN-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-5224: - Labels: incompatible (was: ) > Logs for a completed container are not available in the yarn logs output for > a live application > --- > > Key: YARN-5224 > URL: https://issues.apache.org/jira/browse/YARN-5224 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Siddharth Seth >Assignee: Xuan Gong > Labels: incompatible > Attachments: YARN-5224.1.patch, YARN-5224.2.patch, YARN-5224.3.patch, > YARN-5224.4.patch, YARN-5224.5.patch > > > This affects 'short' jobs like MapReduce and Tez more than long running apps. > Related: YARN-5193 (but that only covers long running apps) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5275) Timeline application page cannot be loaded when no application submitted/running on the cluster after HADOOP-9613
Tsuyoshi Ozawa created YARN-5275: Summary: Timeline application page cannot be loaded when no application submitted/running on the cluster after HADOOP-9613 Key: YARN-5275 URL: https://issues.apache.org/jira/browse/YARN-5275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha1 Reporter: Tsuyoshi Ozawa Priority: Critical After HADOOP-9613, Timeline Web UI has a problem reported by [~leftnoteasy] and [~sunilg] {quote} when no application submitted/running on the cluster, applications page cannot be loaded. {quote} We should investigate the reason and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5006) ResourceManager quit due to ApplicationStateData exceed the limit size of znode in zk
[ https://issues.apache.org/jira/browse/YARN-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283261#comment-15283261 ] Tsuyoshi Ozawa commented on YARN-5006: -- {quote} Our opinion about fixing this bug is that we want to add the limit of ApplicationStateData datasize at RMStateStore do StoreAppTransition . {quote} {quote} You should also see if YARN-4958 would help resolve the issue. We're misusing ZK a bit as a data store, and YARN-4958 attempts to reduce the level of abuse. {quote} Both of your opinions can be done in parallel and are worth fixing. Another workaround is to use compression. > ResourceManager quit due to ApplicationStateData exceed the limit size of > znode in zk > -- > > Key: YARN-5006 > URL: https://issues.apache.org/jira/browse/YARN-5006 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0, 2.7.2 >Reporter: dongtingting >Priority: Critical > > Client submit a job, this job add 1 file into DistributedCache. when the > job is submitted, ResourceManager sotre ApplicationStateData into zk. > ApplicationStateData is exceed the limit size of znode. RM exit 1. > The related code in RMStateStore.java : > {code} > private static class StoreAppTransition > implements SingleArcTransition{ > @Override > public void transition(RMStateStore store, RMStateStoreEvent event) { > if (!(event instanceof RMStateStoreAppEvent)) { > // should never happen > LOG.error("Illegal event type: " + event.getClass()); > return; > } > ApplicationState appState = ((RMStateStoreAppEvent) > event).getAppState(); > ApplicationId appId = appState.getAppId(); > ApplicationStateData appStateData = ApplicationStateData > .newInstance(appState); > LOG.info("Storing info for app: " + appId); > try { > store.storeApplicationStateInternal(appId, appStateData); //store > the appStateData > store.notifyApplication(new RMAppEvent(appId, >RMAppEventType.APP_NEW_SAVED)); > } catch (Exception e) { > LOG.error("Error storing app: " + appId, e); > store.notifyStoreOperationFailed(e); //handle fail event, system > exit > } > }; > } > {code} > The Exception log: > {code} > ... > 2016-04-20 11:26:35,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore > AsyncDispatcher event handler: Maxed out ZK retries. Giving up! > 2016-04-20 11:26:35,732 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore > AsyncDispatcher event handler: Error storing app: > application_1461061795989_17671 > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at >
[jira] [Commented] (YARN-4994) Use MiniYARNCluster with try-with-resources in tests
[ https://issues.apache.org/jira/browse/YARN-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278960#comment-15278960 ] Tsuyoshi Ozawa commented on YARN-4994: -- [~boky01] I'll check the patch. > Use MiniYARNCluster with try-with-resources in tests > > > Key: YARN-4994 > URL: https://issues.apache.org/jira/browse/YARN-4994 > Project: Hadoop YARN > Issue Type: Improvement > Components: test >Affects Versions: 2.7.0 >Reporter: Andras Bokor >Assignee: Andras Bokor >Priority: Trivial > Fix For: 2.7.0 > > Attachments: HDFS-10287.01.patch, HDFS-10287.02.patch, > HDFS-10287.03.patch, YARN-4994.04.patch, YARN-4994.05.patch, > YARN-4994.06.patch, YARN-4994.07.patch > > > In tests MiniYARNCluster is used with the following pattern: > In try-catch block create a MiniYARNCluster instance and in finally block > close it. > [Try-with-resources|https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html] > is preferred since Java7 instead of the pattern above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5071) address HBase compatibility issues with trunk
[ https://issues.apache.org/jira/browse/YARN-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278663#comment-15278663 ] Tsuyoshi Ozawa commented on YARN-5071: -- {quote} In principle, I don't think this is really a HBase problem at the moment as 3.0.0 has not been released yet. {quote} I agree with you. I meant that we should focus on how Hadoop ecosystem, including HBase, can migrate branch-2 based code to trunk easily and smoothly. I think we should get feedback from users of Hadoop, including HBase developers, to avoid critical problems. In other words, this is a good time to recheck Hadoop Compatibility guide. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html > address HBase compatibility issues with trunk > - > > Key: YARN-5071 > URL: https://issues.apache.org/jira/browse/YARN-5071 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > > The trunk is now adding or planning to add more and more > backward-incompatible changes. Some examples include > - remove v.1 metrics classes (HADOOP-12504) > - update jersey version (HADOOP-9613) > - target java 8 by default (HADOOP-11858) > This poses big challenges for the timeline service v.2 as we have a > dependency on hbase which depends on an older version of hadoop. > We need to find a way to solve/contain/manage these risks before it is too > late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5071) address HBase compatibility issues with trunk
[ https://issues.apache.org/jira/browse/YARN-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278626#comment-15278626 ] Tsuyoshi Ozawa commented on YARN-5071: -- I think we should have a discussion with HBase guys. [~stack] [~iwasakims] what do you think for HBase's supporting trunk code? Can we have a help or work to do so at Hadoop side? We'd like to know the barrier and problems of running hbase client under the trunk environment. > address HBase compatibility issues with trunk > - > > Key: YARN-5071 > URL: https://issues.apache.org/jira/browse/YARN-5071 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > > The trunk is now adding or planning to add more and more > backward-incompatible changes. Some examples include > - remove v.1 metrics classes (HADOOP-12504) > - update jersey version (HADOOP-9613) > - target java 8 by default (HADOOP-11858) > This poses big challenges for the timeline service v.2 as we have a > dependency on hbase which depends on an older version of hadoop. > We need to find a way to solve/contain/manage these risks before it is too > late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4844) Add getMemoryLong/getVirtualCoreLong to o.a.h.y.api.records.Resource
[ https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278514#comment-15278514 ] Tsuyoshi Ozawa commented on YARN-4844: -- [~wangda], it's reasonable to make these values long. Should we make getVirtualCores which return int value deprecated? What do you think? > Add getMemoryLong/getVirtualCoreLong to o.a.h.y.api.records.Resource > > > Key: YARN-4844 > URL: https://issues.apache.org/jira/browse/YARN-4844 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-4844.1.patch, YARN-4844.2.patch, YARN-4844.3.patch, > YARN-4844.4.patch, YARN-4844.5.patch, YARN-4844.6.patch, YARN-4844.7.patch > > > We use int32 for memory now, if a cluster has 10k nodes, each node has 210G > memory, we will get a negative total cluster memory. > And another case that easier overflows int32 is: we added all pending > resources of running apps to cluster's total pending resources. If a > problematic app requires too much resources (let's say 1M+ containers, each > of them has 3G containers), int32 will be not enough. > Even if we can cap each app's pending request, we cannot handle the case that > there're many running apps, each of them has capped but still significant > numbers of pending resources. > So we may possibly need to add getMemoryLong/getVirtualCoreLong to > o.a.h.y.api.records.Resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort
[ https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171471#comment-15171471 ] Tsuyoshi Ozawa commented on YARN-4743: -- [~gzh1992n] thank you for the report. IIUC, the comparator must ensure that the relation be transitive. https://docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html I think that DRF comparator is not transitive with my intuition. [~kasha], what do you think? Can we design the comparator as transitive comparator? > ResourceManager crash because TimSort > - > > Key: YARN-4743 > URL: https://issues.apache.org/jira/browse/YARN-4743 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Zephyr Guo > > {code} > 2016-02-26 14:08:50,821 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeCollapse(TimSort.java:410) > at java.util.TimSort.sort(TimSort.java:214) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) > at java.lang.Thread.run(Thread.java:745) > 2016-02-26 14:08:50,822 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > Actually, this issue found in 2.6.0-cdh5.4.7. > I think the cause is that we modify {{Resouce}} while we are sorting > {{runnableApps}}. > {code:title=FSLeafQueue.java} > Comparator comparator = policy.getComparator(); > writeLock.lock(); > try { > Collections.sort(runnableApps, comparator); > } finally { > writeLock.unlock(); > } > readLock.lock(); > {code} > {code:title=FairShareComparator} > public int compare(Schedulable s1, Schedulable s2) { > .. > s1.getResourceUsage(), minShare1); > boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null, > s2.getResourceUsage(), minShare2); > minShareRatio1 = (double) s1.getResourceUsage().getMemory() > / Resources.max(RESOURCE_CALCULATOR, null, minShare1, > ONE).getMemory(); > minShareRatio2 = (double) s2.getResourceUsage().getMemory() > / Resources.max(RESOURCE_CALCULATOR, null, minShare2, > ONE).getMemory(); > .. > {code} > {{getResourceUsage}} will return current Resource. The current Resource is > unstable. > {code:title=FSAppAttempt.java} > @Override > public Resource getResourceUsage() { > // Here the getPreemptedResources() always return zero, except in > // a preemption round > return Resources.subtract(getCurrentConsumption(), > getPreemptedResources()); > } > {code} > {code:title=SchedulerApplicationAttempt} > public Resource getCurrentConsumption() { > return currentConsumption; > } > // This method may modify current Resource. > public synchronized void recoverContainer(RMContainer rmContainer) { > .. > Resources.addTo(currentConsumption, rmContainer.getContainer() > .getResource()); > .. > } > {code} > I suggest that use stable Resource in comparator. > Is there something i think wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg
[ https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166867#comment-15166867 ] Tsuyoshi Ozawa commented on YARN-4673: -- Hi [~sandflee] thank you for the contribution. Could you explain the cause of the deadlock? It helps us to review your patch more fast and more correctly. > race condition in ResourceTrackerService#nodeHeartBeat while processing > deduplicated msg > > > Key: YARN-4673 > URL: https://issues.apache.org/jira/browse/YARN-4673 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4673.01.patch > > > we could add a lock like ApplicationMasterService#allocate -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163046#comment-15163046 ] Tsuyoshi Ozawa commented on YARN-4630: -- Hey Akira, can I check this since it seems to include the changes againstContainerId? It has an impact against RM-HA. > Remove useless boxing/unboxing code (Hadoop YARN) > - > > Key: YARN-4630 > URL: https://issues.apache.org/jira/browse/YARN-4630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Priority: Minor > Attachments: YARN-4630.0.patch > > > There are lots of places where useless boxing/unboxing occur. > To avoid performance issue, let's remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163046#comment-15163046 ] Tsuyoshi Ozawa edited comment on YARN-4630 at 2/24/16 2:15 PM: --- Hey Akira, can I check this since it seems to include the changes against ContainerId? It has an impact against RM-HA. was (Author: ozawa): Hey Akira, can I check this since it seems to include the changes againstContainerId? It has an impact against RM-HA. > Remove useless boxing/unboxing code (Hadoop YARN) > - > > Key: YARN-4630 > URL: https://issues.apache.org/jira/browse/YARN-4630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Priority: Minor > Attachments: YARN-4630.0.patch > > > There are lots of places where useless boxing/unboxing occur. > To avoid performance issue, let's remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4648) Move preemption related tests from TestFairScheduler to TestFairSchedulerPreemption
[ https://issues.apache.org/jira/browse/YARN-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158717#comment-15158717 ] Tsuyoshi Ozawa commented on YARN-4648: -- Note: The failures of TestClientRMTokens and TestAMAuthorization are tracked on HADOOP-12687. It's not related to the patch uploaded here. > Move preemption related tests from TestFairScheduler to > TestFairSchedulerPreemption > --- > > Key: YARN-4648 > URL: https://issues.apache.org/jira/browse/YARN-4648 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Kai Sasaki > Labels: newbie++ > Attachments: YARN-4648.01.patch, YARN-4648.02.patch, > YARN-4648.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4648) Move preemption related tests from TestFairScheduler to TestFairSchedulerPreemption
[ https://issues.apache.org/jira/browse/YARN-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158713#comment-15158713 ] Tsuyoshi Ozawa commented on YARN-4648: -- +1, checking this in. > Move preemption related tests from TestFairScheduler to > TestFairSchedulerPreemption > --- > > Key: YARN-4648 > URL: https://issues.apache.org/jira/browse/YARN-4648 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Kai Sasaki > Labels: newbie++ > Attachments: YARN-4648.01.patch, YARN-4648.02.patch, > YARN-4648.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4713) Warning by unchecked conversion in TestTimelineWebServices
Tsuyoshi Ozawa created YARN-4713: Summary: Warning by unchecked conversion in TestTimelineWebServices Key: YARN-4713 URL: https://issues.apache.org/jira/browse/YARN-4713 Project: Hadoop YARN Issue Type: Test Components: test Reporter: Tsuyoshi Ozawa [WARNING] /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java:[123,38] [unchecked] unchecked conversion {code} Enumeration names = mock(Enumeration.class); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4708) Missing default mapper type in TimelineServer performance test tool usage
[ https://issues.apache.org/jira/browse/YARN-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156267#comment-15156267 ] Tsuyoshi Ozawa commented on YARN-4708: -- +1, checking this in. > Missing default mapper type in TimelineServer performance test tool usage > - > > Key: YARN-4708 > URL: https://issues.apache.org/jira/browse/YARN-4708 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Reporter: Kai Sasaki >Assignee: Kai Sasaki >Priority: Minor > Attachments: YARN-4708.01.patch > > > TimelineServer performance test tool uses SimpleEntityWriter as default > mapper. It can be indicated explicitly in usage of the tool. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4648) Move preemption related tests from TestFairScheduler to TestFairSchedulerPreemption
[ https://issues.apache.org/jira/browse/YARN-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155844#comment-15155844 ] Tsuyoshi Ozawa commented on YARN-4648: -- [~lewuathe] Thank you for updating. Unfortunatelly, startResourceManagerForPreemptionTest is still confusing because the name of the class is TestFairSchedulerPreemption. My suggestion is to rename startResourceManager to startResourceManagerWithStubbedFairScheduler, and startResourceManagerForPreemptionTest to startResourceManagerWithRealFairScheduler. Do you have any better idea? > Move preemption related tests from TestFairScheduler to > TestFairSchedulerPreemption > --- > > Key: YARN-4648 > URL: https://issues.apache.org/jira/browse/YARN-4648 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Kai Sasaki > Labels: newbie++ > Attachments: YARN-4648.01.patch, YARN-4648.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2225) Turn the virtual memory check to be off by default
[ https://issues.apache.org/jira/browse/YARN-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155388#comment-15155388 ] Tsuyoshi Ozawa commented on YARN-2225: -- It's a bit too aggressive to disable vmem-check as discussed on this issue since some user enables vmem-check. IMHO, I prefer to make the default value of the vmem ratio larger. How about closing this issue and doing it on another jira(or, moving HADOOP-11364 to YARN issue) since the addressing problem is different from this issue? > Turn the virtual memory check to be off by default > -- > > Key: YARN-2225 > URL: https://issues.apache.org/jira/browse/YARN-2225 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-2225.patch > > > The virtual memory check may not be the best way to isolate applications. > Virtual memory is not the constrained resource. It would be better if we > limit the swapping of the task using swapiness instead. This patch will turn > this DEFAULT_NM_VMEM_CHECK_ENABLED off by default and let users turn it on if > they need to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2225) Turn the virtual memory check to be off by default
[ https://issues.apache.org/jira/browse/YARN-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148329#comment-15148329 ] Tsuyoshi Ozawa commented on YARN-2225: -- Why not making vmem-pmem ratio larger to address the problem? > Turn the virtual memory check to be off by default > -- > > Key: YARN-2225 > URL: https://issues.apache.org/jira/browse/YARN-2225 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-2225.patch > > > The virtual memory check may not be the best way to isolate applications. > Virtual memory is not the constrained resource. It would be better if we > limit the swapping of the task using swapiness instead. This patch will turn > this DEFAULT_NM_VMEM_CHECK_ENABLED off by default and let users turn it on if > they need to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4648) Move preemption related tests from TestFairScheduler to TestFairSchedulerPreemption
[ https://issues.apache.org/jira/browse/YARN-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147930#comment-15147930 ] Tsuyoshi Ozawa commented on YARN-4648: -- [~kaisasak] Instead of changing the sequence of initialization, how about changing the name of {{startResourceManagerWithoutThreshold}}? I think the name of {{startResourceManagerWithoutThreshold}} looks confusing since the behaviour of the method named {{startResourceManagerWithoutThreshold()}} looks to be equals to startResourceManager(1.1f). What do you think? > Move preemption related tests from TestFairScheduler to > TestFairSchedulerPreemption > --- > > Key: YARN-4648 > URL: https://issues.apache.org/jira/browse/YARN-4648 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Kai Sasaki > Labels: newbie++ > Attachments: YARN-4648.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4648) Move preemption related tests from TestFairScheduler to TestFairSchedulerPreemption
[ https://issues.apache.org/jira/browse/YARN-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146932#comment-15146932 ] Tsuyoshi Ozawa commented on YARN-4648: -- [~lewuathe] thank you for your contribution. I looked over your patch. I have some comments, so could you address them? {code} private void startResourceManagerWithoutThreshold() { {code} Why not reusing startResourceManager(threshold) with the threshold larger than 1.0f? {code} +import org.apache.hadoop.yarn.api.records.*; {code} Please don't use * import. > Move preemption related tests from TestFairScheduler to > TestFairSchedulerPreemption > --- > > Key: YARN-4648 > URL: https://issues.apache.org/jira/browse/YARN-4648 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Kai Sasaki > Labels: newbie++ > Attachments: YARN-4648.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4648) Move preemption related tests from TestFairScheduler to TestFairSchedulerPreemption
[ https://issues.apache.org/jira/browse/YARN-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15143676#comment-15143676 ] Tsuyoshi Ozawa commented on YARN-4648: -- [~lewuathe], sure, I'll check this on this weekend. > Move preemption related tests from TestFairScheduler to > TestFairSchedulerPreemption > --- > > Key: YARN-4648 > URL: https://issues.apache.org/jira/browse/YARN-4648 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Kai Sasaki > Labels: newbie++ > Attachments: YARN-4648.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4234) New put APIs in TimelineClient for ats v1.5
[ https://issues.apache.org/jira/browse/YARN-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070805#comment-15070805 ] Tsuyoshi Ozawa commented on YARN-4234: -- [~djp] [~iwasakims] committed addendum patch by Masatake to trunk and branch-2(Just removing a file "q" in root directory). Thanks for your contribution! > New put APIs in TimelineClient for ats v1.5 > --- > > Key: YARN-4234 > URL: https://issues.apache.org/jira/browse/YARN-4234 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-4234-2015-11-13.1.patch, > YARN-4234-2015-11-16.1.patch, YARN-4234-2015-11-16.2.patch, > YARN-4234-2015.2.patch, YARN-4234.1.patch, YARN-4234.2.patch, > YARN-4234.2015-11-12.1.patch, YARN-4234.2015-11-12.1.patch, > YARN-4234.2015-11-18.1.patch, YARN-4234.2015-11-18.2.patch, > YARN-4234.2015-11-18.patch, YARN-4234.2015-12-09.patch, > YARN-4234.2015-12-09.patch, YARN-4234.2015-12-17.1.patch, > YARN-4234.2015-12-18.1.patch, YARN-4234.2015-12-18.patch, > YARN-4234.2015-12-21.1.patch, YARN-4234.20151109.patch, > YARN-4234.20151110.1.patch, YARN-4234.2015.1.patch, YARN-4234.3.patch, > YARN-4234.addendum.patch > > > In this ticket, we will add new put APIs in timelineClient to let > clients/applications have the option to use ATS v1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4234) New put APIs in TimelineClient for ats v1.5
[ https://issues.apache.org/jira/browse/YARN-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070799#comment-15070799 ] Tsuyoshi Ozawa commented on YARN-4234: -- [~iwasakims] thanks for following up. +1, checking this in. > New put APIs in TimelineClient for ats v1.5 > --- > > Key: YARN-4234 > URL: https://issues.apache.org/jira/browse/YARN-4234 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-4234-2015-11-13.1.patch, > YARN-4234-2015-11-16.1.patch, YARN-4234-2015-11-16.2.patch, > YARN-4234-2015.2.patch, YARN-4234.1.patch, YARN-4234.2.patch, > YARN-4234.2015-11-12.1.patch, YARN-4234.2015-11-12.1.patch, > YARN-4234.2015-11-18.1.patch, YARN-4234.2015-11-18.2.patch, > YARN-4234.2015-11-18.patch, YARN-4234.2015-12-09.patch, > YARN-4234.2015-12-09.patch, YARN-4234.2015-12-17.1.patch, > YARN-4234.2015-12-18.1.patch, YARN-4234.2015-12-18.patch, > YARN-4234.2015-12-21.1.patch, YARN-4234.20151109.patch, > YARN-4234.20151110.1.patch, YARN-4234.2015.1.patch, YARN-4234.3.patch, > YARN-4234.addendum.patch > > > In this ticket, we will add new put APIs in timelineClient to let > clients/applications have the option to use ATS v1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4234) New put APIs in TimelineClient for ats v1.5
[ https://issues.apache.org/jira/browse/YARN-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070915#comment-15070915 ] Tsuyoshi Ozawa commented on YARN-4234: -- sorry, I just forgot to push. thanks for following up! > New put APIs in TimelineClient for ats v1.5 > --- > > Key: YARN-4234 > URL: https://issues.apache.org/jira/browse/YARN-4234 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-4234-2015-11-13.1.patch, > YARN-4234-2015-11-16.1.patch, YARN-4234-2015-11-16.2.patch, > YARN-4234-2015.2.patch, YARN-4234.1.patch, YARN-4234.2.patch, > YARN-4234.2015-11-12.1.patch, YARN-4234.2015-11-12.1.patch, > YARN-4234.2015-11-18.1.patch, YARN-4234.2015-11-18.2.patch, > YARN-4234.2015-11-18.patch, YARN-4234.2015-12-09.patch, > YARN-4234.2015-12-09.patch, YARN-4234.2015-12-17.1.patch, > YARN-4234.2015-12-18.1.patch, YARN-4234.2015-12-18.patch, > YARN-4234.2015-12-21.1.patch, YARN-4234.20151109.patch, > YARN-4234.20151110.1.patch, YARN-4234.2015.1.patch, YARN-4234.3.patch, > YARN-4234.addendum.patch > > > In this ticket, we will add new put APIs in timelineClient to let > clients/applications have the option to use ATS v1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4505) TestJobHistoryEventHandler.testTimelineEventHandling fails on trunk because of NPE
Tsuyoshi Ozawa created YARN-4505: Summary: TestJobHistoryEventHandler.testTimelineEventHandling fails on trunk because of NPE Key: YARN-4505 URL: https://issues.apache.org/jira/browse/YARN-4505 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi Ozawa https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/824/ {code} Tests run: 13, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 21.163 sec <<< FAILURE! - in org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler) Time elapsed: 5.115 sec <<< ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:331) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForTimelineServer(JobHistoryEventHandler.java:1015) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:586) at org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.handleEvent(TestJobHistoryEventHandler.java:722) at org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:510) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048245#comment-15048245 ] Tsuyoshi Ozawa commented on YARN-4301: -- {quote} it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRACTION, could we add a timeout to mkdir? if mkdir timeout, the disk is treated as a failed disk. {quote} +1 for the suggestion by [~sandflee]. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda >Assignee: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4438) Implement RM leader election with curator
[ https://issues.apache.org/jira/browse/YARN-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4438: - Issue Type: Improvement (was: Bug) > Implement RM leader election with curator > - > > Key: YARN-4438 > URL: https://issues.apache.org/jira/browse/YARN-4438 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-4438.1.patch > > > This is to implement the leader election with curator instead of the > ActiveStandbyElector from common package, this also avoids adding more > configs in common to suit RM's own needs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.
[ https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049809#comment-15049809 ] Tsuyoshi Ozawa commented on YARN-4439: -- [~jianhe] should we also add Priority to the printing string? > Clarify NMContainerStatus#toString method. > -- > > Key: YARN-4439 > URL: https://issues.apache.org/jira/browse/YARN-4439 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-4439.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046951#comment-15046951 ] Tsuyoshi Ozawa commented on YARN-4348: -- Now I committed this to branch-2.6.3 too. Thanks! > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046950#comment-15046950 ] Tsuyoshi Ozawa commented on YARN-4348: -- [~djp] I committed this to branch-2.6, which is targeting 2.6.3. Can I push this to branch-2.6.3? > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046944#comment-15046944 ] Tsuyoshi Ozawa commented on YARN-4348: -- Ran tests locally and pass tests on branch-2.6. Committing this to branch-2.6. > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4301: - Assignee: Akihiro Suda > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda >Assignee: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048221#comment-15048221 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for the point. I have some comments about v2 patch - could you update them? 1. About the synchronization of DirectoryCollection, I got the point you mentioned. The change, however, causes race condition between states in the class(localDirs, fullDirs, errorDirs, and numFailures) - e.g. {{DirectoryCollection.concat(errorDirs, fullDirs))}}, {{createNonExistentDirs}} and other functions cannot work well without synchronization. I think the root cause of the problem is to calling {{DC.testDirs}} with synchronization in {{DC.checkDirs}}. How about releasing lock before calling {{testDirs}} and acquiring lock after calling {{testDirs}}? {quote} synchronized DC.getFailedDirs() can be blocked by synchronized DC.checkDirs(), when File.mkdir() (called from DC.checkDirs(), via DC.testDirs()) does not return in a moderate timeout. Hence NodeHealthCheckerServer.isHealthy() gets also blocked. So I would like to make DC.getXXXs unsynchronized. {quote} 2. If the thread is preempted by OS and moves to another CPU in multicore environment, gap can be negative value. Hence I prefer not to abort NodeManager here. {code:title=NodeHealthCheckerService.java} +long diskCheckTime = dirsHandler.getLastDisksCheckTime(); +long now = System.currentTimeMillis(); +long gap = now - diskCheckTime; +if (gap < 0) { + throw new AssertionError("implementation error - now=" + now + + ", diskCheckTime=" + diskCheckTime); +} {code} 3. Please move validations of configuration to serviceInit to avoid aborting at runtime. {code:title=NodeHealthCheckerService.java} +long allowedGap = this.diskHealthCheckInterval + this.diskHealthCheckTimeout; +if (allowedGap <= 0) { + throw new AssertionError("implementation error - interval=" + this.diskHealthCheckInterval + + ", timeout=" + this.diskHealthCheckTimeout); +} {code} > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047902#comment-15047902 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for updating. The warning by findbugs looks related to the change. Could you fix it? > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046332#comment-15046332 ] Tsuyoshi Ozawa commented on YARN-4348: -- Committed this to branch-2.7. Thanks [~jianhe] for reviewing and reporting! I will cherrypick this to branch-2.6 after running tests. > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044258#comment-15044258 ] Tsuyoshi Ozawa commented on YARN-4348: -- [~jianhe] could you take a look? > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding ZK's callback work correctly
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Summary: ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding ZK's callback work correctly (was: ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout) > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > ZK's callback work correctly > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034071#comment-15034071 ] Tsuyoshi Ozawa commented on YARN-4348: -- [~zxu] [~jianhe] I'm rethinking of [this comment|https://issues.apache.org/jira/browse/YARN-3798?focusedCommentId=14609769=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14609769] about sync callback to wait for sync completion: this can cause [the lock problem described here|https://issues.apache.org/jira/browse/YARN-4348?focusedCommentId=15018159=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15018159]. To deal with problem easily, we can just remove a barrier by the sync callback. This works well because ZK client's requests are sent to ZK server in order, unless ZK master server fails while recreating ZK connection. Quorum sync, ZOOKEEPER-2136, is good helper to deal with the corner case. What do you think? > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034045#comment-15034045 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~jianhe] {quote} If curator sync up the data it would be fine. Otherwise there could be a chance of lag like we discussed earlier. Truly I haven't tried Curator yet, probably some one can cross check this part. {quote} FYI, when Curator detects the same situation, it call sync automatically in {{doSyncForSuspendedConnection}} method in Curator Framework. Therefore, we don't need to call sync operation on trunk and branch-2.8 code. > ZKRMStateStore shouldn't create new session without occurrance of > SESSIONEXPIED > --- > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena >Priority: Blocker > Fix For: 2.7.2, 2.6.2 > > Attachments: RM.log, YARN-3798-2.7.002.patch, > YARN-3798-branch-2.6.01.patch, YARN-3798-branch-2.6.02.patch, > YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.003.patch, > YARN-3798-branch-2.7.004.patch, YARN-3798-branch-2.7.005.patch, > YARN-3798-branch-2.7.006.patch, YARN-3798-branch-2.7.patch > > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! >
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Summary: ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread (was: ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding ZK's callback work correctly) > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033915#comment-15033915 ] Tsuyoshi Ozawa commented on YARN-4348: -- Ran test since last Jenkins failed to launch. javadoc seems to be false positive since this patch doesn't include any changes of javadocs. {quote} -1 overall. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated 2079 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. {quote} [~jianhe] could you take a look at latest patch? > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033956#comment-15033956 ] Tsuyoshi Ozawa commented on YARN-4348: -- [~djp] IMHO, this is a blocker ticket of 2.6.3 and 2.7.3 since the problem is more serious than I've thought. please check [this comment|https://issues.apache.org/jira/browse/YARN-4348?focusedCommentId=15018159=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15018159]. This is a unexpected behaviour when RM fails over and it prevents RM fail over correctly. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: YARN-4348-branch-2.7.004.patch Adding missing {{continue}} statement after calling {{syncInternal}} in the following block: {code} if (shouldRetryWithNewConnection(ke.code()) && retry < numRetries) { LOG.info("Retrying operation on ZK with new Connection. " + "Retry no. " + retry); Thread.sleep(zkRetryInterval); createConnection(); syncInternal(ke.getPath()); continue; } {code} > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032845#comment-15032845 ] Tsuyoshi Ozawa edited comment on YARN-4348 at 12/1/15 3:18 AM: --- [~jianhe] good catch. Adding missing {{continue}} statement after calling {{syncInternal}} in the following block in v4 patch. was (Author: ozawa): Adding missing {{continue}} statement after calling {{syncInternal}} in the following block: {code} if (shouldRetryWithNewConnection(ke.code()) && retry < numRetries) { LOG.info("Retrying operation on ZK with new Connection. " + "Retry no. " + retry); Thread.sleep(zkRetryInterval); createConnection(); syncInternal(ke.getPath()); continue; } {code} > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033002#comment-15033002 ] Tsuyoshi Ozawa commented on YARN-4348: -- Jenkins still fail. Opened YETUS-217 to track the problem. Kicking Jenkins on local. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033002#comment-15033002 ] Tsuyoshi Ozawa edited comment on YARN-4348 at 12/1/15 3:22 AM: --- Jenkins still fail. Opened YETUS-217 to track the problem. Kicking test-patch.sh on local. was (Author: ozawa): Jenkins still fail. Opened YETUS-217 to track the problem. Kicking Jenkins on local. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032845#comment-15032845 ] Tsuyoshi Ozawa edited comment on YARN-4348 at 12/1/15 5:44 AM: --- [~jianhe] good catch. Adding missing {{continue}} statement after calling {{syncInternal}} in v4 patch. was (Author: ozawa): [~jianhe] good catch. Adding missing {{continue}} statement after calling {{syncInternal}} in the following block in v4 patch. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028989#comment-15028989 ] Tsuyoshi Ozawa commented on YARN-4348: -- Ran test since last Jenkins failed to launch. {quote} -1 overall. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated 2079 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. {quote} > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348.001.patch, YARN-4348.001.patch, > log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4393) TestResourceLocalizationService#testFailedDirsResourceRelease fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027096#comment-15027096 ] Tsuyoshi Ozawa commented on YARN-4393: -- +1, checking this in. > TestResourceLocalizationService#testFailedDirsResourceRelease fails > intermittently > -- > > Key: YARN-4393 > URL: https://issues.apache.org/jira/browse/YARN-4393 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 2.7.3 > > Attachments: YARN-4393.01.patch > > > [~ozawa] pointed out this failure on YARN-4380. > Check > https://issues.apache.org/jira/browse/YARN-4380?focusedCommentId=15023773=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15023773 > {noformat} > sts run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.518 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testFailedDirsResourceRelease(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.093 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > eventHandler.handle( > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) > Actual invocation has different arguments: > eventHandler.handle( > EventType: APPLICATION_INITED > ); > -> at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4318) Test failure: TestAMAuthorization
[ https://issues.apache.org/jira/browse/YARN-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026936#comment-15026936 ] Tsuyoshi Ozawa commented on YARN-4318: -- [~kshukla] please go ahead :-) > Test failure: TestAMAuthorization > - > > Key: YARN-4318 > URL: https://issues.apache.org/jira/browse/YARN-4318 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Kuhu Shukla > > {quote} > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.891 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.208 sec <<< ERROR! > java.net.UnknownHostException: Invalid host name: local host is: (unknown); > destination host is: "b5a5dd9ec835":8030; java.net.UnknownHostException; For > more details see: http://wiki.apache.org/hadoop/UnknownHost > at org.apache.hadoop.ipc.Client$Connection.(Client.java:403) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1512) > at org.apache.hadoop.ipc.Client.call(Client.java:1439) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy15.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:273) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-4393) TestResourceLocalizationService#testFailedDirsResourceRelease fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa reopened YARN-4393: -- oops, commentted on wrong jira. Reopening. > TestResourceLocalizationService#testFailedDirsResourceRelease fails > intermittently > -- > > Key: YARN-4393 > URL: https://issues.apache.org/jira/browse/YARN-4393 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 2.7.3 > > Attachments: YARN-4393.01.patch > > > [~ozawa] pointed out this failure on YARN-4380. > Check > https://issues.apache.org/jira/browse/YARN-4380?focusedCommentId=15023773=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15023773 > {noformat} > sts run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.518 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testFailedDirsResourceRelease(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.093 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > eventHandler.handle( > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) > Actual invocation has different arguments: > eventHandler.handle( > EventType: APPLICATION_INITED > ); > -> at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4380) TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4380: - Hadoop Flags: Reviewed > TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails > intermittently > > > Key: YARN-4380 > URL: https://issues.apache.org/jira/browse/YARN-4380 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.8.0, 2.7.1 >Reporter: Tsuyoshi Ozawa >Assignee: Varun Saxena > Fix For: 2.7.3 > > Attachments: YARN-4380.01.patch, > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.2.txt, > > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService-output.txt > > > {quote} > Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.361 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testDownloadingResourcesOnContainerKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.109 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > deletionService.delete( > "user0", > null, > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > Actual invocation has different arguments: > deletionService.delete( > "user0", > > /home/ubuntu/hadoop-dev/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/0/usercache/user0/appcache/application_314159265358979_0003/container_314159265358979_0003_01_42 > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1296) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4393) TestResourceLocalizationService#testFailedDirsResourceRelease fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027113#comment-15027113 ] Tsuyoshi Ozawa commented on YARN-4393: -- [~varun_saxena], before committing, I found that there are some missing dispatcher.await(): testResourceRelease: {code} //Send Cleanup Event spyService.handle(new ContainerLocalizationCleanupEvent(c, req)); // <-- here! verify(mockLocallilzerTracker) .cleanupPrivLocalizers("container_314159265358979_0003_01_42"); req2.remove(LocalResourceVisibility.PRIVATE); spyService.handle(new ContainerLocalizationCleanupEvent(c, req2)); dispatcher.await(); {code} testFailedDirsResourceRelease: {code} // Send Cleanup Event spyService.handle(new ContainerLocalizationCleanupEvent(c, req)); // <- here! verify(mockLocallilzerTracker).cleanupPrivLocalizers( "container_314159265358979_0003_01_42"); {code} testRecovery: {code} assertNotNull("Localization not started", privLr1.getLocalPath()); privTracker1.handle(new ResourceLocalizedEvent(privReq1, privLr1.getLocalPath(), privLr1.getSize() + 5)); assertNotNull("Localization not started", privLr2.getLocalPath()); privTracker1.handle(new ResourceLocalizedEvent(privReq2, privLr2.getLocalPath(), privLr2.getSize() + 10)); assertNotNull("Localization not started", appLr1.getLocalPath()); appTracker1.handle(new ResourceLocalizedEvent(appReq1, appLr1.getLocalPath(), appLr1.getSize())); assertNotNull("Localization not started", appLr3.getLocalPath()); appTracker2.handle(new ResourceLocalizedEvent(appReq3, appLr3.getLocalPath(), appLr3.getSize() + 7)); assertNotNull("Localization not started", pubLr1.getLocalPath()); pubTracker.handle(new ResourceLocalizedEvent(pubReq1, pubLr1.getLocalPath(), pubLr1.getSize() + 1000)); assertNotNull("Localization not started", pubLr2.getLocalPath()); pubTracker.handle(new ResourceLocalizedEvent(pubReq2, pubLr2.getLocalPath(), pubLr2.getSize() + 9)); {code} Could you update them? > TestResourceLocalizationService#testFailedDirsResourceRelease fails > intermittently > -- > > Key: YARN-4393 > URL: https://issues.apache.org/jira/browse/YARN-4393 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 2.7.3 > > Attachments: YARN-4393.01.patch > > > [~ozawa] pointed out this failure on YARN-4380. > Check > https://issues.apache.org/jira/browse/YARN-4380?focusedCommentId=15023773=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15023773 > {noformat} > sts run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.518 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testFailedDirsResourceRelease(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.093 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > eventHandler.handle( > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) > Actual invocation has different arguments: > eventHandler.handle( > EventType: APPLICATION_INITED > ); > -> at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027145#comment-15027145 ] Tsuyoshi Ozawa commented on YARN-4348: -- Kicking Jenkins again. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348.001.patch, YARN-4348.001.patch, > log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4318) Test failure: TestAMAuthorization
[ https://issues.apache.org/jira/browse/YARN-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4318: - Assignee: Kuhu Shukla > Test failure: TestAMAuthorization > - > > Key: YARN-4318 > URL: https://issues.apache.org/jira/browse/YARN-4318 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Kuhu Shukla > > {quote} > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.891 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.208 sec <<< ERROR! > java.net.UnknownHostException: Invalid host name: local host is: (unknown); > destination host is: "b5a5dd9ec835":8030; java.net.UnknownHostException; For > more details see: http://wiki.apache.org/hadoop/UnknownHost > at org.apache.hadoop.ipc.Client$Connection.(Client.java:403) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1512) > at org.apache.hadoop.ipc.Client.call(Client.java:1439) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy15.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:273) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4380) TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027004#comment-15027004 ] Tsuyoshi Ozawa commented on YARN-4380: -- +1, checking this in. > TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails > intermittently > > > Key: YARN-4380 > URL: https://issues.apache.org/jira/browse/YARN-4380 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.8.0, 2.7.1 >Reporter: Tsuyoshi Ozawa >Assignee: Varun Saxena > Attachments: YARN-4380.01.patch, > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.2.txt, > > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService-output.txt > > > {quote} > Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.361 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testDownloadingResourcesOnContainerKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.109 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > deletionService.delete( > "user0", > null, > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > Actual invocation has different arguments: > deletionService.delete( > "user0", > > /home/ubuntu/hadoop-dev/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/0/usercache/user0/appcache/application_314159265358979_0003/container_314159265358979_0003_01_42 > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1296) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4387) Fix FairScheduler log message
[ https://issues.apache.org/jira/browse/YARN-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023984#comment-15023984 ] Tsuyoshi Ozawa commented on YARN-4387: -- +1, checking this in. > Fix FairScheduler log message > - > > Key: YARN-4387 > URL: https://issues.apache.org/jira/browse/YARN-4387 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Xin Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4387) Fix typo in FairScheduler log message
[ https://issues.apache.org/jira/browse/YARN-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4387: - Summary: Fix typo in FairScheduler log message (was: Fix FairScheduler log message) > Fix typo in FairScheduler log message > - > > Key: YARN-4387 > URL: https://issues.apache.org/jira/browse/YARN-4387 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Xin Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4387) Fix typo in FairScheduler log message
[ https://issues.apache.org/jira/browse/YARN-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4387: - Attachment: YARN-4387.001.patch > Fix typo in FairScheduler log message > - > > Key: YARN-4387 > URL: https://issues.apache.org/jira/browse/YARN-4387 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Xin Wang >Priority: Minor > Attachments: YARN-4387.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4387) Fix typo in FairScheduler log message
[ https://issues.apache.org/jira/browse/YARN-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4387: - Hadoop Flags: Reviewed > Fix typo in FairScheduler log message > - > > Key: YARN-4387 > URL: https://issues.apache.org/jira/browse/YARN-4387 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Xin Wang >Priority: Minor > Attachments: YARN-4387.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4371) "yarn application -kill" should take multiple application ids
[ https://issues.apache.org/jira/browse/YARN-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024028#comment-15024028 ] Tsuyoshi Ozawa commented on YARN-4371: -- [~sunilg] thank you for the initial patch. I looked over the patch and have a comment about the design. In the patch, a new RPC, {{killApplication(List applicationIds)}}, is added. IMHO, it's better to call multiple {{killApplication(ApplicationId applicationId)}} since it's simpler and I think killApplication is not called too much. Could you update so? > "yarn application -kill" should take multiple application ids > - > > Key: YARN-4371 > URL: https://issues.apache.org/jira/browse/YARN-4371 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tsuyoshi Ozawa >Assignee: Sunil G > Attachments: 0001-YARN-4371.patch > > > Currently we cannot pass multiple applications to "yarn application -kill" > command. The command should take multiple application ids at the same time. > Each entries should be separated with whitespace like: > {code} > yarn application -kill application_1234_0001 application_1234_0007 > application_1234_0012 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4387) Fix typo in FairScheduler log message
[ https://issues.apache.org/jira/browse/YARN-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4387: - Target Version/s: 2.8.0 > Fix typo in FairScheduler log message > - > > Key: YARN-4387 > URL: https://issues.apache.org/jira/browse/YARN-4387 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Xin Wang >Priority: Minor > Attachments: YARN-4387.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4387) Fix typo in FairScheduler log message
[ https://issues.apache.org/jira/browse/YARN-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4387: - Assignee: Xin Wang > Fix typo in FairScheduler log message > - > > Key: YARN-4387 > URL: https://issues.apache.org/jira/browse/YARN-4387 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Xin Wang >Assignee: Xin Wang >Priority: Minor > Attachments: YARN-4387.001.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4306) Test failure: TestClientRMTokens
[ https://issues.apache.org/jira/browse/YARN-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024298#comment-15024298 ] Tsuyoshi Ozawa commented on YARN-4306: -- This problem still continues on trunk - [~sunilg], could you take a look at this problem? > Test failure: TestClientRMTokens > > > Key: YARN-4306 > URL: https://issues.apache.org/jira/browse/YARN-4306 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Sunil G >Assignee: Sunil G > > Tests are getting failed in local also. As part of HADOOP-12321 jenkins run, > I see same error.: > {noformat}testShortCircuitRenewCancelDifferentHostSamePort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) > Time elapsed: 0.638 sec <<< FAILURE! > java.lang.AssertionError: expected: but was: > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:363) > at > org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostSamePort(TestClientRMTokens.java:316) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4385) TestDistributedShell times out
[ https://issues.apache.org/jira/browse/YARN-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022660#comment-15022660 ] Tsuyoshi Ozawa edited comment on YARN-4385 at 11/23/15 6:25 PM: >From https://builds.apache.org/job/Hadoop-Yarn-trunk/1380/ {quote} TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShellWithNodeLabels.setup:47 » YarnRuntime java.io.IOException:... Tests run: 14, Failures: 0, Errors: 12, Skipped: 0 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop YARN SUCCESS [ 4.803 s] [INFO] Apache Hadoop YARN API SUCCESS [04:44 min] [INFO] Apache Hadoop YARN Common . SUCCESS [03:31 min] [INFO] Apache Hadoop YARN Server . SUCCESS [ 0.109 s] [INFO] Apache Hadoop YARN Server Common .. SUCCESS [ 57.348 s] [INFO] Apache Hadoop YARN NodeManager SUCCESS [10:05 min] [INFO] Apache Hadoop YARN Web Proxy .. SUCCESS [ 29.458 s] [INFO] Apache Hadoop YARN ApplicationHistoryService .. SUCCESS [03:46 min] [INFO] Apache Hadoop YARN ResourceManager SUCCESS [ 01:03 h] [INFO] Apache Hadoop YARN Server Tests ... SUCCESS [01:52 min] [INFO] Apache Hadoop YARN Client . SUCCESS [07:21 min] [INFO] Apache Hadoop YARN SharedCacheManager . SUCCESS [ 32.136 s] [INFO] Apache Hadoop YARN Applications ... SUCCESS [ 0.053 s] [INFO] Apache Hadoop YARN DistributedShell ... FAILURE [ 29.403 s] [INFO] Apache Hadoop YARN Unmanaged Am Launcher .. SKIPPED [INFO] Apache Hadoop YARN Site ... SKIPPED [INFO] Apache Hadoop YARN Registry ... SKIPPED [INFO] Apache Hadoop YARN Project SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 01:37 h [INFO] Finished at: 2015-11-09T20:36:25+00:00 [INFO] Final Memory: 81M/690M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project hadoop-yarn-applications-distributedshell: There are test failures. [ERROR] [ERROR] Please refer to /home/jenkins/jenkins-slave/workspace/Hadoop-Yarn-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/surefire-reports for the individual test results. [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :hadoop-yarn-applications-distributedshell Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results Updating HDFS-9234 Sending e-mails to: yarn-...@hadoop.apache.org Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## 12 tests failed. FAILED: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithInvalidArgs Error Message: java.io.IOException: ResourceManager failed to start. Final state is STOPPED Stack Trace: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: ResourceManager failed to start. Final state is STOPPED at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:331) at org.apache.hadoop.yarn.server.MiniYARNCluster.access$500(MiniYARNCluster.java:99) at
[jira] [Comment Edited] (YARN-4385) TestDistributedShell times out
[ https://issues.apache.org/jira/browse/YARN-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022660#comment-15022660 ] Tsuyoshi Ozawa edited comment on YARN-4385 at 11/23/15 6:26 PM: On my local log: {quote} Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 437.156 sec <<< FAILURE! - in org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell testDSShellWithCustomLogPropertyFile(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 115.558 sec <<< ERROR! java.lang.Exception: test timed out after 9 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:734) at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:715) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithCustomLogPropertyFile(TestDistributedShell.java:502) {quote} was (Author: ozawa): >From https://builds.apache.org/job/Hadoop-Yarn-trunk/1380/ {quote} TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShellWithNodeLabels.setup:47 » YarnRuntime java.io.IOException:... Tests run: 14, Failures: 0, Errors: 12, Skipped: 0 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop YARN SUCCESS [ 4.803 s] [INFO] Apache Hadoop YARN API SUCCESS [04:44 min] [INFO] Apache Hadoop YARN Common . SUCCESS [03:31 min] [INFO] Apache Hadoop YARN Server . SUCCESS [ 0.109 s] [INFO] Apache Hadoop YARN Server Common .. SUCCESS [ 57.348 s] [INFO] Apache Hadoop YARN NodeManager SUCCESS [10:05 min] [INFO] Apache Hadoop YARN Web Proxy .. SUCCESS [ 29.458 s] [INFO] Apache Hadoop YARN ApplicationHistoryService .. SUCCESS [03:46 min] [INFO] Apache Hadoop YARN ResourceManager SUCCESS [ 01:03 h] [INFO] Apache Hadoop YARN Server Tests ... SUCCESS [01:52 min] [INFO] Apache Hadoop YARN Client . SUCCESS [07:21 min] [INFO] Apache Hadoop YARN SharedCacheManager . SUCCESS [ 32.136 s] [INFO] Apache Hadoop YARN Applications ... SUCCESS [ 0.053 s] [INFO] Apache Hadoop YARN DistributedShell ... FAILURE [ 29.403 s] [INFO] Apache Hadoop YARN Unmanaged Am Launcher .. SKIPPED [INFO] Apache Hadoop YARN Site ... SKIPPED [INFO] Apache Hadoop YARN Registry ... SKIPPED [INFO] Apache Hadoop YARN Project SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 01:37 h [INFO] Finished at: 2015-11-09T20:36:25+00:00 [INFO] Final Memory: 81M/690M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project hadoop-yarn-applications-distributedshell: There are test failures. [ERROR] [ERROR] Please refer to /home/jenkins/jenkins-slave/workspace/Hadoop-Yarn-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/surefire-reports for the individual test results. [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :hadoop-yarn-applications-distributedshell Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results Updating HDFS-9234 Sending e-mails to: yarn-...@hadoop.apache.org Email was triggered for: Failure - Any Sending email for trigger: Failure - Any
[jira] [Updated] (YARN-4380) TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails intermittently on branch-2.8
[ https://issues.apache.org/jira/browse/YARN-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4380: - Attachment: org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService-output.txt [~varun_saxena] attaching a log when the test fails. I use this simple script to reproduce some intermittent failures https://github.com/oza/failchecker > TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails > intermittently on branch-2.8 > -- > > Key: YARN-4380 > URL: https://issues.apache.org/jira/browse/YARN-4380 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.8.0 >Reporter: Tsuyoshi Ozawa >Assignee: Varun Saxena > Attachments: > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService-output.txt > > > {quote} > Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.361 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testDownloadingResourcesOnContainerKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.109 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > deletionService.delete( > "user0", > null, > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > Actual invocation has different arguments: > deletionService.delete( > "user0", > > /home/ubuntu/hadoop-dev/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/0/usercache/user0/appcache/application_314159265358979_0003/container_314159265358979_0003_01_42 > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1296) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4385) TestDistributedShell times out
[ https://issues.apache.org/jira/browse/YARN-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022660#comment-15022660 ] Tsuyoshi Ozawa commented on YARN-4385: -- >From https://builds.apache.org/job/Hadoop-Yarn-trunk/1380/ {quote} ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 11262 lines...] TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShell.setup:72->setupInternal:94 » YarnRuntime java.io.IOExcept... TestDistributedShellWithNodeLabels.setup:47 » YarnRuntime java.io.IOException:... Tests run: 14, Failures: 0, Errors: 12, Skipped: 0 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop YARN SUCCESS [ 4.803 s] [INFO] Apache Hadoop YARN API SUCCESS [04:44 min] [INFO] Apache Hadoop YARN Common . SUCCESS [03:31 min] [INFO] Apache Hadoop YARN Server . SUCCESS [ 0.109 s] [INFO] Apache Hadoop YARN Server Common .. SUCCESS [ 57.348 s] [INFO] Apache Hadoop YARN NodeManager SUCCESS [10:05 min] [INFO] Apache Hadoop YARN Web Proxy .. SUCCESS [ 29.458 s] [INFO] Apache Hadoop YARN ApplicationHistoryService .. SUCCESS [03:46 min] [INFO] Apache Hadoop YARN ResourceManager SUCCESS [ 01:03 h] [INFO] Apache Hadoop YARN Server Tests ... SUCCESS [01:52 min] [INFO] Apache Hadoop YARN Client . SUCCESS [07:21 min] [INFO] Apache Hadoop YARN SharedCacheManager . SUCCESS [ 32.136 s] [INFO] Apache Hadoop YARN Applications ... SUCCESS [ 0.053 s] [INFO] Apache Hadoop YARN DistributedShell ... FAILURE [ 29.403 s] [INFO] Apache Hadoop YARN Unmanaged Am Launcher .. SKIPPED [INFO] Apache Hadoop YARN Site ... SKIPPED [INFO] Apache Hadoop YARN Registry ... SKIPPED [INFO] Apache Hadoop YARN Project SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 01:37 h [INFO] Finished at: 2015-11-09T20:36:25+00:00 [INFO] Final Memory: 81M/690M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project hadoop-yarn-applications-distributedshell: There are test failures. [ERROR] [ERROR] Please refer to /home/jenkins/jenkins-slave/workspace/Hadoop-Yarn-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/surefire-reports for the individual test results. [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :hadoop-yarn-applications-distributedshell Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results Updating HDFS-9234 Sending e-mails to: yarn-...@hadoop.apache.org Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## 12 tests failed. FAILED: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithInvalidArgs Error Message: java.io.IOException: ResourceManager failed to start. Final state is STOPPED Stack Trace: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: ResourceManager failed to start. Final state is STOPPED at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:331) at
[jira] [Moved] (YARN-4385) TestDistributedShell times out
[ https://issues.apache.org/jira/browse/YARN-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa moved HADOOP-12591 to YARN-4385: --- Key: YARN-4385 (was: HADOOP-12591) Project: Hadoop YARN (was: Hadoop Common) > TestDistributedShell times out > -- > > Key: YARN-4385 > URL: https://issues.apache.org/jira/browse/YARN-4385 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tsuyoshi Ozawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4385) TestDistributedShell times out
[ https://issues.apache.org/jira/browse/YARN-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4385: - Component/s: test > TestDistributedShell times out > -- > > Key: YARN-4385 > URL: https://issues.apache.org/jira/browse/YARN-4385 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tsuyoshi Ozawa > Attachments: > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.txt > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4385) TestDistributedShell times out
[ https://issues.apache.org/jira/browse/YARN-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4385: - Attachment: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.txt Attaching a log when it fails. > TestDistributedShell times out > -- > > Key: YARN-4385 > URL: https://issues.apache.org/jira/browse/YARN-4385 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tsuyoshi Ozawa > Attachments: > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.txt > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4380) TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails intermittently on branch-2.8
[ https://issues.apache.org/jira/browse/YARN-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4380: - Attachment: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.2.txt [~varun_saxena], thank you for the fix. The fix itself looks good me. I got another error though it's rare to happen: {quote} Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.518 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService testFailedDirsResourceRelease(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) Time elapsed: 0.093 sec <<< FAILURE! org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: Argument(s) are different! Wanted: eventHandler.handle( ); -> at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) Actual invocation has different arguments: eventHandler.handle( EventType: APPLICATION_INITED ); -> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testFailedDirsResourceRelease(TestResourceLocalizationService.java:2632) {quote} Attaching a log for the failure. Could you take a look? > TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails > intermittently on branch-2.8 > -- > > Key: YARN-4380 > URL: https://issues.apache.org/jira/browse/YARN-4380 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.8.0, 2.7.1 >Reporter: Tsuyoshi Ozawa >Assignee: Varun Saxena > Attachments: YARN-4380.01.patch, > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell-output.2.txt, > > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService-output.txt > > > {quote} > Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.361 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testDownloadingResourcesOnContainerKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.109 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > deletionService.delete( > "user0", > null, > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > Actual invocation has different arguments: > deletionService.delete( > "user0", > > /home/ubuntu/hadoop-dev/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/0/usercache/user0/appcache/application_314159265358979_0003/container_314159265358979_0003_01_42 > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1296) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023870#comment-15023870 ] Tsuyoshi Ozawa commented on YARN-4348: -- {quote} Archiving artifacts [description-setter] Description set: YARN-4348 Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Email was triggered for: Failure - Any Sending email for trigger: Failure - Any An attempt to send an e-mail to empty list of recipients, ignored. Finished: FAILURE {quote} Hmm, Jenkins looks to be unhealthy. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348.001.patch, YARN-4348.001.patch, > log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: YARN-4348-branch-2.7.003.patch > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348.001.patch, YARN-4348.001.patch, > log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4380) TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails intermittently on branch-2.8
[ https://issues.apache.org/jira/browse/YARN-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4380: - Summary: TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails intermittently on branch-2.8 (was: TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails on branch-2.8) > TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails > intermittently on branch-2.8 > -- > > Key: YARN-4380 > URL: https://issues.apache.org/jira/browse/YARN-4380 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.8.0 >Reporter: Tsuyoshi Ozawa > > {quote} > Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.361 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService > testDownloadingResourcesOnContainerKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) > Time elapsed: 0.109 sec <<< FAILURE! > org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: > Argument(s) are different! Wanted: > deletionService.delete( > "user0", > null, > > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > Actual invocation has different arguments: > deletionService.delete( > "user0", > > /home/ubuntu/hadoop-dev/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/0/usercache/user0/appcache/application_314159265358979_0003/container_314159265358979_0003_01_42 > ); > -> at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1296) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4380) TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails on branch-2.8
Tsuyoshi Ozawa created YARN-4380: Summary: TestResourceLocalizationService.testDownloadingResourcesOnContainerKill fails on branch-2.8 Key: YARN-4380 URL: https://issues.apache.org/jira/browse/YARN-4380 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.8.0 Reporter: Tsuyoshi Ozawa {quote} Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.361 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService testDownloadingResourcesOnContainerKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService) Time elapsed: 0.109 sec <<< FAILURE! org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: Argument(s) are different! Wanted: deletionService.delete( "user0", null, ); -> at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) Actual invocation has different arguments: deletionService.delete( "user0", /home/ubuntu/hadoop-dev/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/0/usercache/user0/appcache/application_314159265358979_0003/container_314159265358979_0003_01_42 ); -> at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1296) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testDownloadingResourcesOnContainerKill(TestResourceLocalizationService.java:1322) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Priority: Blocker (was: Major) > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018159#comment-15018159 ] Tsuyoshi Ozawa commented on YARN-4348: -- Found that this is caused by the lock ordering. 1. (In main thread of RM) locking ZKRMStateStore(startInternal) -> waiting for lock.await() 2. ZK's eventThread: Got SyncConnected event from ZK -> Calling ForwardingWatcher#process -> processWatchEvent called, but ZKRMStateStore has been locked since 1 3. (In main thread of RM) timeout and IOException -> unlocking ZKRMStateStore() -> the callback, processEvent, of sync is fired. I will attach a patch to address this problem. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4371) "yarn application -kill" should take multiple application ids
Tsuyoshi Ozawa created YARN-4371: Summary: "yarn application -kill" should take multiple application ids Key: YARN-4371 URL: https://issues.apache.org/jira/browse/YARN-4371 Project: Hadoop YARN Issue Type: Improvement Reporter: Tsuyoshi Ozawa Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. I think it's straight forward to pass comma-separated ids if we can grant application ids don't contain any commas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4371) "yarn application -kill" should take multiple application ids
[ https://issues.apache.org/jira/browse/YARN-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4371: - Description: Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. Each entries should be separated with white-space like: {code} yarn application -kill application_1234_0001 application_1234_0007 application_1234_0012 {code} was: Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. Each entries should be separated with white-space like . > "yarn application -kill" should take multiple application ids > - > > Key: YARN-4371 > URL: https://issues.apache.org/jira/browse/YARN-4371 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tsuyoshi Ozawa >Assignee: Sunil G > > Currently we cannot pass multiple applications to "yarn application -kill" > command. The command should take multiple application ids at the same time. > Each entries should be separated with white-space like: > {code} > yarn application -kill application_1234_0001 application_1234_0007 > application_1234_0012 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4371) "yarn application -kill" should take multiple application ids
[ https://issues.apache.org/jira/browse/YARN-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4371: - Description: Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. Each entries should be separated with white-space like . was: Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. I think it's straight forward to pass comma-separated ids if we can grant application ids don't contain any commas. > "yarn application -kill" should take multiple application ids > - > > Key: YARN-4371 > URL: https://issues.apache.org/jira/browse/YARN-4371 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tsuyoshi Ozawa >Assignee: Sunil G > > Currently we cannot pass multiple applications to "yarn application -kill" > command. The command should take multiple application ids at the same time. > Each entries should be separated with white-space like . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4371) "yarn application -kill" should take multiple application ids
[ https://issues.apache.org/jira/browse/YARN-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4371: - Description: Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. Each entries should be separated with whitespace like: {code} yarn application -kill application_1234_0001 application_1234_0007 application_1234_0012 {code} was: Currently we cannot pass multiple applications to "yarn application -kill" command. The command should take multiple application ids at the same time. Each entries should be separated with white-space like: {code} yarn application -kill application_1234_0001 application_1234_0007 application_1234_0012 {code} > "yarn application -kill" should take multiple application ids > - > > Key: YARN-4371 > URL: https://issues.apache.org/jira/browse/YARN-4371 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tsuyoshi Ozawa >Assignee: Sunil G > > Currently we cannot pass multiple applications to "yarn application -kill" > command. The command should take multiple application ids at the same time. > Each entries should be separated with whitespace like: > {code} > yarn application -kill application_1234_0001 application_1234_0007 > application_1234_0012 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)