[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path
[ https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384386#comment-17384386 ] Kuhu Shukla commented on YARN-10854: Proposal seems good but since I have been away from the land of YARN for a while, could [~brahma], [~templedf] or others chime in on the idea as well? I would love to review the code for this change. > Support marking inactive node as untracked without configured include path > -- > > Key: YARN-10854 > URL: https://issues.apache.org/jira/browse/YARN-10854 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-10854.001.patch > > > Currently inactive nodes which have been decommissioned/shutdown/lost for a > while(specified expiration time defined via > {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by > default) and not exist in both include and exclude files can be marked as > untracked nodes and can be removed from RM state (YARN-4311). It's very > useful when auto-scaling is enabled in elastic cloud environment, which can > avoid unlimited increase of inactive nodes (mostly are decommissioned nodes). > But this only works when the include path is configured, mismatched for most > of our cloud environments without configured white list of nodes, which can > lead to easily control for the auto-scaling of nodes without further security > requirements. > So I propose to support marking inactive node as untracked without configured > include path, to be compatible with the former versions, we can add a switch > config for this. > Any thoughts/suggestions/feedbacks are welcome! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901428#comment-16901428 ] Kuhu Shukla commented on YARN-6315: --- Thanks for the ping Eric and sorry about the delay on this. This is not a trivial change when it comes to archives and directories and I would have difficulty making time for this patch rework. I apologize and please feel free to reassign and use the existing patch if it is any good. :( > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch, YARN-6315.005.patch, > YARN-6315.006.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869632#comment-16869632 ] Kuhu Shukla commented on YARN-9202: --- This attempts to add the logic to refresh nodes but there are test failures around it that beg further investigation of scenarios where we get node heratbeats after a refresh that has taken those nodes out from the include list. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch, YARN-9202.002.patch, > YARN-9202.003.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9202: -- Attachment: YARN-9202.003.patch > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch, YARN-9202.002.patch, > YARN-9202.003.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861171#comment-16861171 ] Kuhu Shukla commented on YARN-9202: --- I am unable to reproduce this case locally but investigating some more. AFAICT , so far, it seems unrelated. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch, YARN-9202.002.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861170#comment-16861170 ] Kuhu Shukla commented on YARN-9202: --- [~Jim_Brennan], the nodes from the inactive list (with port= -1) are thrown away once the actual NM registration come through and creates the new RMNode object. Since that is the case for any new node trying to register, we do not need the shutdown to running transition since the rmnode object that is in shutdown state is never really used so to say. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch, YARN-9202.002.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9202: -- Attachment: YARN-9202.002.patch > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch, YARN-9202.002.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819467#comment-16819467 ] Kuhu Shukla commented on YARN-9202: --- I do not think we can get away with creating new RMNodeImpl objects since anything that has not registered may not have valid values for cmPort and NmVersion and other fields that are populated through the constructor only upon registration. Even for the case where we could just have the REST APIs return state in new state, the issue is that none of the lists that the webservice has access to have nodes in new state. [~eepayne], appreciate thoughts on how to move forward on this given this inherent design of RMNodeImpl. I could expose some fields and add setters to get over this issue but I am not sure if that is the right way to proceed. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802025#comment-16802025 ] Kuhu Shukla commented on YARN-9202: --- On second thought : {quote} bq. can they just be reused and moved to the NEW or RUNNING state when the host registers? in the next patch. {quote} This is trickier as the old node will be set with the nodId = unknownNodeID and updating that needs change to RMNodeImpl fields some of which are private and final. Also when I tried making this change there were several typecasts to RMNodeImpl which makes this less than ideal. I will explore the two other possibilities to see which one makes more sense. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206-branch-2.8.001.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9206-branch-2.8.001.patch, > YARN-9206-branch-3.1.001.patch, YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch, YARN-9206.004.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206-branch-3.1.001.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9206-branch-3.1.001.patch, YARN-9206.001.patch, > YARN-9206.002.patch, YARN-9206.003.patch, YARN-9206.004.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761251#comment-16761251 ] Kuhu Shukla commented on YARN-9206: --- Thank you [~sunilg] for the commit and the review. Thank you [~Jim_Brennan] for the reviews. The 2.8 version has no Test file in the repo and I didn't find one that was relevant so I skipped the test. Hope that is ok, else if needed I can add one. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9206-branch-2.8.001.patch, > YARN-9206-branch-3.1.001.patch, YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch, YARN-9206.004.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: (was: YARN-9206-branch-3.1.001.patch) > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9206-branch-3.1.001.patch, YARN-9206.001.patch, > YARN-9206.002.patch, YARN-9206.003.patch, YARN-9206.004.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206-branch-3.1.001.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9206-branch-3.1.001.patch, YARN-9206.001.patch, > YARN-9206.002.patch, YARN-9206.003.patch, YARN-9206.004.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206.004.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch, YARN-9206.004.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756075#comment-16756075 ] Kuhu Shukla commented on YARN-9206: --- Thank you [~sunilg], will update patch shortly. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750094#comment-16750094 ] Kuhu Shukla commented on YARN-9206: --- Spoke to [~Jim_Brennan] offline and he clarified with the following solution. It is certainly better but does add some if blocks to the computation, which I was really trying to not do (at the cost of complexity -which is not ideal). [~sunilg] please review the following suggestion by [~Jim_Brennan] and if it looks ok to you I will revise my patch. {code:java} public static List queryRMNodes(RMContext context, EnumSet acceptedStates) { // nodes contains nodes that are NEW, RUNNING, UNHEALTHY or DECOMMISSIONING. boolean has_active = false; boolean has_inactive = false; ArrayList results = new ArrayList(); for (NodeState nodeState : acceptedStates) { if (!has_inactive && nodeState.isInactiveState()) { has_inactive = true; } if (!has_active && nodeState.isActiveState()) { has_active = true; } if (has_active && has_inactive) { break; } } if (has_inactive) { for (RMNode rmNode : context.getInactiveRMNodes().values()) { if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { results.add(rmNode); } } } if (has_active) { for (RMNode rmNode : context.getRMNodes().values()) { if (acceptedStates.contains(rmNode.getState())) { results.add(rmNode); } } } return results; } {code} > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9018) Add functionality to AuxiliaryLocalPathHandler to return all locations to read for a given path
[ https://issues.apache.org/jira/browse/YARN-9018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750058#comment-16750058 ] Kuhu Shukla commented on YARN-9018: --- [~eepayne], could you help review this patch? Thanks a lot! > Add functionality to AuxiliaryLocalPathHandler to return all locations to > read for a given path > --- > > Key: YARN-9018 > URL: https://issues.apache.org/jira/browse/YARN-9018 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9018.001.patch > > > Analogous to LocalDirAllocator#getAllLocalPathsToRead, this will allow aux > services(and other components) to use this function that they rely on when > using the former class objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749362#comment-16749362 ] Kuhu Shukla commented on YARN-9202: --- [~eepayne], Thank you so much for the review! {quote} I do not see the nodes from the include list in the UI shutdown list. {quote} I tested it on my pseudo distributed cluster and it works per our offline discussion. I will address other comments especially around {quote}can they just be reused and moved to the NEW or RUNNING state when the host registers? {quote} in the next patch. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749361#comment-16749361 ] Kuhu Shukla commented on YARN-9206: --- Thank you [~sunilg] for the review. I was not super happy with it either. Can you comment on v2 and v1 patch and see if that makes more sense? Thanks again! > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749333#comment-16749333 ] Kuhu Shukla edited comment on YARN-9206 at 1/23/19 12:37 AM: - Complexity is a bit worse but I tried not to add another boolean. Appreciate corrections, comments. was (Author: kshukla): Complexity is a bit worse but I tried not to add another boolean. Appreciate correctiosn, comments. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749333#comment-16749333 ] Kuhu Shukla commented on YARN-9206: --- Complexity is a bit worse but I tried not to add another boolean. Appreciate correctiosn, comments. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206.003.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch, > YARN-9206.003.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749140#comment-16749140 ] Kuhu Shukla commented on YARN-9206: --- I see! Will update patch shortly. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749021#comment-16749021 ] Kuhu Shukla commented on YARN-9206: --- Thank you [~Jim_Brennan] for the review! bq. I think you need to iterate the acceptedStates() and call isInactiveState() on each one to determine if it contains one. Yes, this was something I was trying to avoid as EnumSet.contains at least in my understanding is faster than iterating over elements of the enum set. bq.if there should also be a NodeState.isActiveState that can be used in the same way for the first part of QueryRMNodes(). Agree that it should be a good addition. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746600#comment-16746600 ] Kuhu Shukla commented on YARN-9206: --- Thank you for the comments [~leftnoteasy], I guess the v2 patch is what you were looking for? I would have preferred the param to to the new method to be not an Enum but it made more sense than iteratng over the acceptedStates. This patch includes a test for inactive node states. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206.002.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch, YARN-9206.002.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9206: -- Attachment: YARN-9206.001.patch > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745595#comment-16745595 ] Kuhu Shukla commented on YARN-9206: --- Patch needs a test still, but just to get things going. > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
Kuhu Shukla created YARN-9206: - Summary: RMServerUtils does not count SHUTDOWN as an accepted state Key: YARN-9206 URL: https://issues.apache.org/jira/browse/YARN-9206 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.3 Reporter: Kuhu Shukla Assignee: Kuhu Shukla {code} if (acceptedStates.contains(NodeState.DECOMMISSIONED) || acceptedStates.contains(NodeState.LOST) || acceptedStates.contains(NodeState.REBOOTED)) { for (RMNode rmNode : context.getInactiveRMNodes().values()) { if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { results.add(rmNode); } } } return results; } {code} This should include SHUTDOWN state as they are inactive too. This method is used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745549#comment-16745549 ] Kuhu Shukla commented on YARN-9202: --- Thank you Jim for the review. Appreciate it. bq. If nodes are in the include list, but never register, what is it that we are missing Currently there is no way to know which nodes should have been a part of the cluster, unless one manually goes and checks the include list. This is different from the Namenode as the nodes that are not registered are still listed as dead or in other categories. bq. Is it just that those nodes are not included in any metrics? More or less, yes, tracking what *should* be there is harder for operation teams. bq. Can the desired result be accomplished by just adding these nodes to the inactive list and leaving them in the NEW state? I did think about that and since there was no place where NEW nodes were exposed on the UI I thought may be moving them to a somewhat terminal state would be nicer , but of course, I like the idea of having NEW nodes in the inactive list as well. I will have to see how much semantic difference does it make in the code, to which end I will update shortly. bq. testIncludeHostsWithNoRegister() - it's not clear to me why the latter half of the test is needed? Looks like it was copied from the previous test but I don't see why it needs to be repeated in this one? True. I will prune the test in the next version. If keeping the nodes in NEW state is fairly straight forward while they get listed as inactive, the next version would have that change as well. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8625) Aggregate Resource Allocation for each job is not present in ATS
[ https://issues.apache.org/jira/browse/YARN-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745285#comment-16745285 ] Kuhu Shukla commented on YARN-8625: --- Thank you for the patch and the report. Minor checkstyle needs fixing but change seems straightforward. How do you plan to use it on the AHS side since the new field has not be leveraged yet I think, please correct me if I am wrong [~Prabhu Joseph]. > Aggregate Resource Allocation for each job is not present in ATS > > > Key: YARN-8625 > URL: https://issues.apache.org/jira/browse/YARN-8625 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 2.7.4 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: 0001-YARN-8625.patch > > > Aggregate Resource Allocation shown on RM UI for finished job is very useful > metric to understand how much resource a job has consumed. But this does not > get stored in ATS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745268#comment-16745268 ] Kuhu Shukla commented on YARN-9202: --- The test fails with and without the patch and YARN-8494 tracks that. [~Jim_Brennan], [~eepayne],[~nroberts] request for initial thoughts and comments. > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6616) YARN AHS shows submitTime for jobs same as startTime
[ https://issues.apache.org/jira/browse/YARN-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745192#comment-16745192 ] Kuhu Shukla commented on YARN-6616: --- Minor comment on {code} @Public @Stable public abstract long getSubmitTime(); @Private @Unstable public abstract void setSubmitTime(long submitTime); {code} Wondering how the getter is considered Stable and the setter is not. I see other methods do the same but is that true for these particular ones or just an artifact from older code? I also wonder how this fix getting in would not break compatibility in a minor release of 3.2. I am no expert at compatibility so pinging [~eepayne] and [~haibochen] for helping with this. Otherwise the patch looks good to me and I verified that the test failures are unrelated. We need to check out the mapreduce build failure however. > YARN AHS shows submitTime for jobs same as startTime > > > Key: YARN-6616 > URL: https://issues.apache.org/jira/browse/YARN-6616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: 0001-YARN-6616.patch, 0002-YARN-6616.patch, > 0003-YARN-6616.patch > > > YARN AHS returns startTime value for both submitTime and startTime for the > jobs. Looks the code sets the submitTime with startTime value. > https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java#L80 > {code} > curl --negotiate -u: > http://prabhuzeppelin3.openstacklocal:8188/ws/v1/applicationhistory/apps > 149501553757414950155375741495016384084 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6616) YARN AHS shows submitTime for jobs same as startTime
[ https://issues.apache.org/jira/browse/YARN-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744435#comment-16744435 ] Kuhu Shukla commented on YARN-6616: --- Taking a look. Thanks for the update! > YARN AHS shows submitTime for jobs same as startTime > > > Key: YARN-6616 > URL: https://issues.apache.org/jira/browse/YARN-6616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: 0001-YARN-6616.patch, 0002-YARN-6616.patch, > 0003-YARN-6616.patch > > > YARN AHS returns startTime value for both submitTime and startTime for the > jobs. Looks the code sets the submitTime with startTime value. > https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java#L80 > {code} > curl --negotiate -u: > http://prabhuzeppelin3.openstacklocal:8188/ws/v1/applicationhistory/apps > 149501553757414950155375741495016384084 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744369#comment-16744369 ] Kuhu Shukla commented on YARN-9202: --- Here is an initial patch that tackles this problem by listing new nodes as SHUTDOWN first. This means that now nodes can be shutdown and be brought back up making it a non terminal state in say one life cycle of the RM. Any ideas, concerns around this change which can cause semantics to break would be good to point out here. I will wait for p\Precommit before formal review comment request but any ideas on this patch would be awesome! > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9202) RM does not track nodes that are in the include list and never register
[ https://issues.apache.org/jira/browse/YARN-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9202: -- Attachment: YARN-9202.001.patch > RM does not track nodes that are in the include list and never register > --- > > Key: YARN-9202 > URL: https://issues.apache.org/jira/browse/YARN-9202 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.2, 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9202.001.patch > > > The RM state machine decides to put new or running nodes in inactive state > only past the point of either registration or being in the exclude list. This > does not cover the case where a node is the in the include list but never > registers and since all state changes are based on these NodeState > transitions, having NEW nodes be listed as inactive first may help. This > would change the semantics of how inactiveNodes are looked at today. Another > state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9202) RM does not track nodes that are in the include list and never register
Kuhu Shukla created YARN-9202: - Summary: RM does not track nodes that are in the include list and never register Key: YARN-9202 URL: https://issues.apache.org/jira/browse/YARN-9202 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.5, 3.0.3, 2.9.2 Reporter: Kuhu Shukla Assignee: Kuhu Shukla The RM state machine decides to put new or running nodes in inactive state only past the point of either registration or being in the exclude list. This does not cover the case where a node is the in the include list but never registers and since all state changes are based on these NodeState transitions, having NEW nodes be listed as inactive first may help. This would change the semantics of how inactiveNodes are looked at today. Another state addition might help this case too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6616) YARN AHS shows submitTime for jobs same as startTime
[ https://issues.apache.org/jira/browse/YARN-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742520#comment-16742520 ] Kuhu Shukla commented on YARN-6616: --- I am currently reviewing this patch and the proposed change is a good one. My one concern is the break of backward compatibility with the change to the protos and more importantly the newInstance for ApplicationReport which at least for us is used by upstream and peer projects. Can we add a new constructor/newInstance with submit Time and replace the non-public usages of that so that the submit time is available for the new or modified consumers but keeps the option to use the old(buggy) way if the need be? > YARN AHS shows submitTime for jobs same as startTime > > > Key: YARN-6616 > URL: https://issues.apache.org/jira/browse/YARN-6616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: 0001-YARN-6616.patch > > > YARN AHS returns startTime value for both submitTime and startTime for the > jobs. Looks the code sets the submitTime with startTime value. > https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java#L80 > {code} > curl --negotiate -u: > http://prabhuzeppelin3.openstacklocal:8188/ws/v1/applicationhistory/apps > 149501553757414950155375741495016384084 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9018) Add functionality to AuxiliaryLocalPathHandler to return all locations to read for a given path
[ https://issues.apache.org/jira/browse/YARN-9018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-9018: -- Attachment: YARN-9018.001.patch > Add functionality to AuxiliaryLocalPathHandler to return all locations to > read for a given path > --- > > Key: YARN-9018 > URL: https://issues.apache.org/jira/browse/YARN-9018 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.3, 2.8.5 >Reporter: Kuhu Shukla >Priority: Major > Attachments: YARN-9018.001.patch > > > Analogous to LocalDirAllocator#getAllLocalPathsToRead, this will allow aux > services(and other components) to use this function that they rely on when > using the former class objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9018) Add functionality to AuxiliaryLocalPathHandler to return all locations to read for a given path
[ https://issues.apache.org/jira/browse/YARN-9018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla reassigned YARN-9018: - Assignee: Kuhu Shukla > Add functionality to AuxiliaryLocalPathHandler to return all locations to > read for a given path > --- > > Key: YARN-9018 > URL: https://issues.apache.org/jira/browse/YARN-9018 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.5, 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9018.001.patch > > > Analogous to LocalDirAllocator#getAllLocalPathsToRead, this will allow aux > services(and other components) to use this function that they rely on when > using the former class objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9018) Add functionality to AuxiliaryLocalPathHandler to return all locations to read for a given path
Kuhu Shukla created YARN-9018: - Summary: Add functionality to AuxiliaryLocalPathHandler to return all locations to read for a given path Key: YARN-9018 URL: https://issues.apache.org/jira/browse/YARN-9018 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.8.5, 3.0.3 Reporter: Kuhu Shukla Analogous to LocalDirAllocator#getAllLocalPathsToRead, this will allow aux services(and other components) to use this function that they rely on when using the former class objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8082) Include LocalizedResource size information in the NM download log for localization
[ https://issues.apache.org/jira/browse/YARN-8082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-8082: -- Attachment: YARN-8082.002.patch > Include LocalizedResource size information in the NM download log for > localization > -- > > Key: YARN-8082 > URL: https://issues.apache.org/jira/browse/YARN-8082 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: YARN-8082.001.patch, YARN-8082.002.patch > > > The size of the resource that finished downloading helps with debugging > localization delays and failures. A close approximate local size of the > resource is available in the LocalizedResource object which can be used to > address this minor change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8082) Include LocalizedResource size information in the NM download log for localization
[ https://issues.apache.org/jira/browse/YARN-8082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-8082: -- Attachment: YARN-8082.001.patch > Include LocalizedResource size information in the NM download log for > localization > -- > > Key: YARN-8082 > URL: https://issues.apache.org/jira/browse/YARN-8082 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: YARN-8082.001.patch > > > The size of the resource that finished downloading helps with debugging > localization delays and failures. A close approximate local size of the > resource is available in the LocalizedResource object which can be used to > address this minor change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8082) Include LocalizedResource size information in the NM download log for localization
Kuhu Shukla created YARN-8082: - Summary: Include LocalizedResource size information in the NM download log for localization Key: YARN-8082 URL: https://issues.apache.org/jira/browse/YARN-8082 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Reporter: Kuhu Shukla Assignee: Kuhu Shukla The size of the resource that finished downloading helps with debugging localization delays and failures. A close approximate local size of the resource is available in the LocalizedResource object which can be used to address this minor change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8054) Improve robustness of the LocalDirsHandlerService MonitoringTimerTask thread
[ https://issues.apache.org/jira/browse/YARN-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla reassigned YARN-8054: - Assignee: Jonathan Eagles (was: Jason Lowe) > Improve robustness of the LocalDirsHandlerService MonitoringTimerTask thread > > > Key: YARN-8054 > URL: https://issues.apache.org/jira/browse/YARN-8054 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.1.1 > > Attachments: YARN-8054.001.patch, YARN-8054.002.patch > > > The DeprecatedRawLocalFileStatus#loadPermissionInfo can throw a > RuntimeException which can kill the MonitoringTimerTask thread. This can > leave the node is a bad state where all NM local directories are marked "bad" > and there is no automatic recovery. In the below can the error was "too many > open files", but could be a number of other recoverable states. > {noformat} > 2018-03-18 02:37:42,960 [DiskHealthMonitor-Timer] ERROR > yarn.YarnUncaughtExceptionHandler: Thread > Thread[DiskHealthMonitor-Timer,5,main] threw an Exception. > java.lang.RuntimeException: Error while running command to get file > permissions : java.io.IOException: Cannot run program "ls": error=24, Too > many open files > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:942) > at org.apache.hadoop.util.Shell.run(Shell.java:898) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1078) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1556) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1521) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$1.onDirsChanged(ResourceLocalizationService.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:381) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:449) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:166) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > Caused by: java.io.IOException: error=24, Too many open files > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:247) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 17 more > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:737) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1556) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1521) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$1.onDirsChanged(ResourceLocalizationService.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:381) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:449) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$Moni
[jira] [Commented] (YARN-8054) Improve robustness of the LocalDirsHandlerService MonitoringTimerTask thread
[ https://issues.apache.org/jira/browse/YARN-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408088#comment-16408088 ] Kuhu Shukla commented on YARN-8054: --- Since the stack trace is printed already in the NM log, the log.warn seems good. +1 (non-binding). > Improve robustness of the LocalDirsHandlerService MonitoringTimerTask thread > > > Key: YARN-8054 > URL: https://issues.apache.org/jira/browse/YARN-8054 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: YARN-8054.001.patch > > > The DeprecatedRawLocalFileStatus#loadPermissionInfo can throw a > RuntimeException which can kill the MonitoringTimerTask thread. This can > leave the node is a bad state where all NM local directories are marked "bad" > and there is no automatic recovery. In the below can the error was "too many > open files", but could be a number of other recoverable states. > {noformat} > 2018-03-18 02:37:42,960 [DiskHealthMonitor-Timer] ERROR > yarn.YarnUncaughtExceptionHandler: Thread > Thread[DiskHealthMonitor-Timer,5,main] threw an Exception. > java.lang.RuntimeException: Error while running command to get file > permissions : java.io.IOException: Cannot run program "ls": error=24, Too > many open files > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:942) > at org.apache.hadoop.util.Shell.run(Shell.java:898) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1078) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1556) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1521) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$1.onDirsChanged(ResourceLocalizationService.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:381) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:449) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:166) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > Caused by: java.io.IOException: error=24, Too many open files > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:247) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 17 more > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:737) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1556) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1521) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$1.onDirsChanged(ResourceLocalizationService.java:271) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:381) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:449) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52) > at > org.apache.hadoop.yarn.server.nodemanager.Local
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289657#comment-16289657 ] Kuhu Shukla commented on YARN-6315: --- Known and unrelated test failures, appreciate any comments on the patch/approach [~jlowe], [~jrottinghuis]. Thanks a lot. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch, YARN-6315.005.patch, > YARN-6315.006.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6315: -- Attachment: YARN-6315.006.patch Updated patch that fixes all but one checkstyle issues. The indentation warning seems trivial. Also, findbugs warning is present with and without the patch. No test failures were there since the test builds failed with {code} [ERROR] Error occurred in starting fork, check output in log [ERROR] Process Exit Code: 1 [ERROR] ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? {code} Hopefully this precommit will go through without this error. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch, YARN-6315.005.patch, > YARN-6315.006.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6315: -- Attachment: YARN-6315.005.patch Updated patch with a revised approach to keep track of actual size of the file via downloadSize. Changes were also made to YARNRunner and LocalResourceProto for this added field. If download size is not updated and is -1 (it could be changed to a constant to indicate that the value was not set at any point), we ignore the file attribute mismatch. Would appreciate any initial comments/modifications on the approach. Thanks a lot! > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch, YARN-6315.005.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7422) Application History Server URL does not direct to the appropriate UI for failed/killed jobs
[ https://issues.apache.org/jira/browse/YARN-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234089#comment-16234089 ] Kuhu Shukla commented on YARN-7422: --- Thank you [~jlowe]. bq. One simple method that would work for Tez and possibly other frameworks would be supporting a history URL during registration in addition to the one already supported at unregistration. That would solve most cases and I agree that AM that dies before registration does not need a valid tracking URL. > Application History Server URL does not direct to the appropriate UI for > failed/killed jobs > --- > > Key: YARN-7422 > URL: https://issues.apache.org/jira/browse/YARN-7422 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Priority: Major > > In cases where AM fails fatally, the AHS page's history link does not work > since AM was not able to update the trackingURL for the job. This JIRA is to > track any last attempt effort we can do from the AM to allow a tracking URL > in cases where the AM failure does not occur immediately at start up. Any > ideas and corrections would be appreciated. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7422) Application History Server URL does not direct to the appropriate UI for failed/killed jobs
Kuhu Shukla created YARN-7422: - Summary: Application History Server URL does not direct to the appropriate UI for failed/killed jobs Key: YARN-7422 URL: https://issues.apache.org/jira/browse/YARN-7422 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.1 Reporter: Kuhu Shukla In cases where AM fails fatally, the AHS page's history link does not work since AM was not able to update the trackingURL for the job. This JIRA is to track any last attempt effort we can do from the AM to allow a tracking URL in cases where the AM failure does not occur immediately at start up. Any ideas and corrections would be appreciated. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225477#comment-16225477 ] Kuhu Shukla commented on YARN-7244: --- [~jlowe], request for comments on the 2.8 version of the patch. Appreciate it! > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Fix For: 2.9.0, 3.0.0 > > Attachments: YARN-7244-branch-2.8.001.patch, > YARN-7244-branch-2.8.002.patch, YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244-branch-2.8.002.patch Fixing minor new line checkstyle. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Fix For: 2.9.0, 3.0.0 > > Attachments: YARN-7244-branch-2.8.001.patch, > YARN-7244-branch-2.8.002.patch, YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244-branch-2.8.001.patch Attaching 2.8 version of the patch which needed some extra changes. The important one is in LocalDirsHandlerService which was missing getLocalPathForRead() method from trunk which went in as part of YARN-3998. I have added just that method rather than change the visibility of getPathToRead(). > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Fix For: 2.9.0, 3.0.0 > > Attachments: YARN-7244-branch-2.8.001.patch, YARN-7244.001.patch, > YARN-7244.002.patch, YARN-7244.003.patch, YARN-7244.004.patch, > YARN-7244.005.patch, YARN-7244.006.patch, YARN-7244.007.patch, > YARN-7244.008.patch, YARN-7244.009.patch, YARN-7244.010.patch, > YARN-7244.011.patch, YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla reopened YARN-7244: --- Re-opening to attach 2.8 version of the patch. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Fix For: 2.9.0, 3.0.0 > > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218956#comment-16218956 ] Kuhu Shukla commented on YARN-7244: --- [~jlowe]/[~sunilg] appreciate any comments on the latest patch! Thank you. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209152#comment-16209152 ] Kuhu Shukla commented on YARN-7244: --- [~jlowe], [~sunilg] request for comments/review. Thanks a lot! > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.013.patch The build failed with 2 separate and unrelated issues I believe. The second time the cache seems to be picking up the old package for AuxiliaryLocalPathHandler. Re-triggering by uploading the same patch again. Please let me know if I missed something. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch, YARN-7244.013.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.012.patch Fixing minor checkstyle issues. :( > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch, > YARN-7244.012.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.011.patch Updated patch addressing comments from [~sunilg]. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch, YARN-7244.011.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208089#comment-16208089 ] Kuhu Shukla commented on YARN-7244: --- Thank you [~sunilg] for the comments! bq. Do you think is it better to have a setter and update AuxiliaryLocalPathHandler to AuxServices rather than changing AuxServices ctor. The constructor change makes sure that we always initialize the pathHandler which seems safer to me. bq. AuxiliaryLocalPathHandler could be in org.apache.hadoop.yarn.server.api? any reasons to move to api? You are right. This needs to be in server apis. bq. All apis in AuxiliaryLocalPathHandlerImpl could have Override annotation. Will do. bq. Does ContainerManagerImpl need to have a getAuxiliaryLocalPathHandler ? I added the getter to assist any testing in the futur and made it package private. I can mark it as VisibleForTesting or take it out , either way would be fine. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.010.patch Attaching revised patch that address review comments. Thanks a lot ! > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch, YARN-7244.010.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205971#comment-16205971 ] Kuhu Shukla commented on YARN-7244: --- [~jlowe], request for review/comments on the latest patch. Thanks again. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.009.patch > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch, > YARN-7244.009.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.008.patch Fixing minor javadoc comments. Think the patch is ready for some review! > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch, YARN-7244.008.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.007.patch Updated patch that fixes checkstyles (almost all.. there is one to add getter in a test that seems excessive to me) and test failure for testMapFileAccess. My setup did not allow for that test to run and required overhauling. Verified that it passes now. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch, YARN-7244.007.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.006.patch Thank you for the comments/review [~jlowe]! Updated patch. Will wait for PreCommit before any review requests. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch, > YARN-7244.006.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.005.patch Fixing TestShuffleHandler failures. The TestDistributedScheduler failure is documented in YARN-7299. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch, YARN-7244.005.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201883#comment-16201883 ] Kuhu Shukla commented on YARN-7244: --- Test failures are related. Will update shortly. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.004.patch Rebasing patch on trunk. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch, YARN-7244.004.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.003.patch Updated patch closer to the design Jason mentioned earlier. Adds a new Path Handler that is passed from the Containermanager -> AuxServices -> AuxiliaryService -> ShuffleHandler. Appreciate any comments on the approach/patch. Thanks a lot! > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch, > YARN-7244.003.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184239#comment-16184239 ] Kuhu Shukla commented on YARN-7244: --- Thank you [~jlowe], [~sunilg] for the review/comments. bq. We could make a pull API where the aux service can essentially directly call the NM's LocalDirHandlerService for getting a path to read or a path to write, then the aux service doesn't even have to manage the directories itself if all it cares about is finding a place to write or read. A pull model where the Shuffle handler /aux service does not maintain valid dirs state would be my preference but the other pull approach would work too. I will start reworking the patch in the meantime and will finalize based on what we decide. Appreciate your thoughts. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182450#comment-16182450 ] Kuhu Shukla commented on YARN-7244: --- Thank you [~sunilg] for the review comments! bq. We could push this config name to LocalDirAllocator and then read from NM end I am not sure how we can initialize the config specifically for LocalDirAllocator (may be add a constructor?) and what would reading it from NM end mean. Agreed that code separation is important here. May be not having this as a config might help? bq. Do you think, we can improve this to skip as default behavior itself? I did not fully get what you have in mind here. Could you help me understand and elaborate a bit. Thanks a lot! > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181059#comment-16181059 ] Kuhu Shukla edited comment on YARN-7244 at 9/27/17 12:06 PM: - Thank you [~bibinchundatt] for the review comments! bq. Better to check directory exists first if we are not concerned of permission . thoughts? Makes sense to me. bq. Testcase is successful even if YARN_SHUFFLE_BAD_DIRS_FILTER_ENABLED set to false. That is expected. This config essentially decides if we should stick to existing behavior (value=true) of removing directories if they are bad or not (value=false). was (Author: kshukla): bq. Better to check directory exists first if we are not concerned of permission . thoughts? Makes sense to me. bq. Testcase is successful even if YARN_SHUFFLE_BAD_DIRS_FILTER_ENABLED set to false. That is expected. This config essentially decides if we should stick to existing behavior (value=true) of removing directories if they are bad or not (value=false). > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181059#comment-16181059 ] Kuhu Shukla commented on YARN-7244: --- bq. Better to check directory exists first if we are not concerned of permission . thoughts? Makes sense to me. bq. Testcase is successful even if YARN_SHUFFLE_BAD_DIRS_FILTER_ENABLED set to false. That is expected. This config essentially decides if we should stick to existing behavior (value=true) of removing directories if they are bad or not (value=false). > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.002.patch Fixing minor test issue for newly added yarn config key. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch, YARN-7244.002.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7244) ShuffleHandler is not aware of disks that are added
[ https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-7244: -- Attachment: YARN-7244.001.patch v1 patch that adds a new LocalDirAllocator#getLocalPathToRead() that decides to filter the bad directories based on a boolean. Changing the original call would be more pervasive. The patch does modify the AllocatorPerContext#getLocalPathToRead() signature since that is a private static class to LocalDirAllocator. The ShuffleHandler uses a yarn config to decide whether or not to filter bad dirs. This value , when false will never take out bad directories and hence any changes to local dirs would not impact the shuffle handler reads. Even if the mkdirs and exists check fails we want the dirs to be listed in the localdirs member when the config is false. For testing reasons, I have added a getter to the lDirAllocator which is package private. Appreciate any comments/corrections to this patch. Another way to handle this would have been to change the AuxiliaryServices to pass the NMContext or the LocalDirAllocator from the NM . The former approach needs nodemanager dependencies to be added and the latter is tricky as I am not sure how the AuxServices class would pass the object without adding that it as a member. Would appreciate any suggestions on any alternative approaches as well. > ShuffleHandler is not aware of disks that are added > --- > > Key: YARN-7244 > URL: https://issues.apache.org/jira/browse/YARN-7244 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-7244.001.patch > > > The ShuffleHandler permanently remembers the list of "good" disks on NM > startup. If disks later are added to the node then map tasks will start using > them but the ShuffleHandler will not be aware of them. The end result is that > the data cannot be shuffled from the node leading to fetch failures and > re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7244) ShuffleHandler is not aware of disks that are added
Kuhu Shukla created YARN-7244: - Summary: ShuffleHandler is not aware of disks that are added Key: YARN-7244 URL: https://issues.apache.org/jira/browse/YARN-7244 Project: Hadoop YARN Issue Type: Bug Reporter: Kuhu Shukla Assignee: Kuhu Shukla The ShuffleHandler permanently remembers the list of "good" disks on NM startup. If disks later are added to the node then map tasks will start using them but the ShuffleHandler will not be aware of them. The end result is that the data cannot be shuffled from the node leading to fetch failures and re-runs of the map tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055849#comment-16055849 ] Kuhu Shukla commented on YARN-6315: --- Adding an 'actualSize' for LocalizedResource and then checking it against the file attributes in isResourcePresent() covers a subset of corruption scenarios. That is if the file size changes after its successful download. I am leaning towards adding actualSize to reflect the "hdfs resource size" and compare that with the local file size. This will cover any corruption caused during download. Special case here would be directories and archives. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026350#comment-16026350 ] Kuhu Shukla commented on YARN-6641: --- Thanks [~jlowe], let me know if a 2.8 patch is required as well. > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch, YARN-6641.002.patch, > YARN-6641.003.patch, YARN-6641.004.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6641: -- Attachment: YARN-6641.004.patch Updated patch to make the getter for dirsHandler package private. Thanks [~jlowe] for the review comments. > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch, YARN-6641.002.patch, > YARN-6641.003.patch, YARN-6641.004.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025341#comment-16025341 ] Kuhu Shukla commented on YARN-6641: --- [~jlowe], request for some more comments. Thanks a lot! > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch, YARN-6641.002.patch, > YARN-6641.003.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6641: -- Attachment: YARN-6641.003.patch Thanks [~jlowe] for the quick response. I have updated the patch. > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch, YARN-6641.002.patch, > YARN-6641.003.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024814#comment-16024814 ] Kuhu Shukla commented on YARN-6641: --- Minor checkstyle issues. Will fix in upcoming patches. Request for review on the approach and any concerns with this change. [~jlowe]/ [~nroberts]. > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch, YARN-6641.002.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6641: -- Attachment: YARN-6641.002.patch Fixed minor checkstyle issues. Findbugs warnings are in files this patch has not touched so leaving them unaddressed for now. > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch, YARN-6641.002.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6641: -- Attachment: YARN-6641.001.patch v1 patch that calls the constructor for LocalResourceTracker with LocalDirsHandlerService object. It dies that also in the case where it is trying to recover resources. > Non-public resource localization on a bad disk causes subsequent containers > failure > --- > > Key: YARN-6641 > URL: https://issues.apache.org/jira/browse/YARN-6641 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6641.001.patch > > > YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} > call to allow checking an already localized resource against the list of > good/full directories. > Since LocalResourcesTrackerImpl instantiations for app level resources and > private resources do not use the new constructor, such resources that are on > bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6641) Non-public resource localization on a bad disk causes subsequent containers failure
Kuhu Shukla created YARN-6641: - Summary: Non-public resource localization on a bad disk causes subsequent containers failure Key: YARN-6641 URL: https://issues.apache.org/jira/browse/YARN-6641 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Kuhu Shukla Assignee: Kuhu Shukla YARN-3591 added the {{checkLocalResource}} method to {{isResourcePresent()}} call to allow checking an already localized resource against the list of good/full directories. Since LocalResourcesTrackerImpl instantiations for app level resources and private resources do not use the new constructor, such resources that are on bad disk will never be checked against good dirs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6277) Nodemanager heap memory leak
[ https://issues.apache.org/jira/browse/YARN-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952157#comment-15952157 ] Kuhu Shukla commented on YARN-6277: --- [~Feng Yuan], did you see this after YARN-4095 went in? Thanks! > Nodemanager heap memory leak > > > Key: YARN-6277 > URL: https://issues.apache.org/jira/browse/YARN-6277 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.3, 2.8.1, 3.0.0-alpha2 >Reporter: Feng Yuan >Assignee: Feng Yuan > Attachments: YARN-6277.branch-2.8.001.patch > > > Because LocalDirHandlerService@LocalDirAllocator`s mechanism,they will create > massive LocalFileSystem.So lead to heap leak. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943132#comment-15943132 ] Kuhu Shukla commented on YARN-6315: --- Thank you [~jlowe] for the reviews and help find out the bug with this approach. I will update my patch shortly. Initial idea which seems to work when I tested it was add "actualSize" to LocalizedResource and use that instead of the request's size. Will update shortly, > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928563#comment-15928563 ] Kuhu Shukla commented on YARN-6315: --- mvn install is marking a lot of hdfs files as duplicate. I have asked on HDFS-11431 since that seems related. {code} [WARNING] Rule 1: org.apache.maven.plugins.enforcer.BanDuplicateClasses failed with message: Duplicate classes found: Found in: org.apache.hadoop:hadoop-client-api:jar:3.0.0-alpha3-SNAPSHOT:compile org.apache.hadoop:hadoop-client-minicluster:jar:3.0.0-alpha3-SNAPSHOT:compile Duplicate classes: org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocolProtos$GetJournalStateRequestProto$Builder.class {code} > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6315: -- Attachment: YARN-6315.004.patch Thank you [~jlowe] for the feedback. I have made the changes accordingly. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927045#comment-15927045 ] Kuhu Shukla commented on YARN-6315: --- [~jlowe], Request for some more comments on the latest patch. Appreciate it. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6315: -- Attachment: YARN-6315.003.patch Thank you Jason for the review. Updated patch. Also, I now catch Exception instead of just IOException to cover the cases where the readAttributes call could throw SecurityException or UnsupportedOperationException. Will wait for Precommit before review request. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907917#comment-15907917 ] Kuhu Shukla commented on YARN-6315: --- [~jlowe], [~eepayne], Request for comments/review. Thanks a lot! > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6315: -- Attachment: YARN-6315.002.patch Fixing checkstyle warnings. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch, YARN-6315.002.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906665#comment-15906665 ] Kuhu Shukla edited comment on YARN-6315 at 3/12/17 7:36 PM: Some performance numbers from instrumenting the test and profiling it through YourKit on my Macbook Pro. The current patch spends an average of 1900 ms for 10,002 runs (189 micro-seconds per call). An equivalent patch that uses file.isDirectory(), file.exists(), file.length() as shown below takes 2080.8 ms for 10,002 runs (208 micro seconds per call). {code} if ((!file.isDirectory() && file.length() != req.getSize()) || !file.exists()) { ret = false; } else if (dirsHandler != null) { ret = checkLocalResource(rsrc); } {code} was (Author: kshukla): Some performance numbers from instrumenting the test and profiling it through YourKit on my Macbook Pro. The current patch spends an average of 1900 ms for 10,002 runs (189 micro-seconds per call). An equivalent patch that uses file.isDirectory(), file.exists(), file.length() as shown below takes 2080.8 ms for 10,002 runs (0.208 micro seconds per call). {code} if ((!file.isDirectory() && file.length() != req.getSize()) || !file.exists()) { ret = false; } else if (dirsHandler != null) { ret = checkLocalResource(rsrc); } {code} > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906665#comment-15906665 ] Kuhu Shukla commented on YARN-6315: --- Some performance numbers from instrumenting the test and profiling it through YourKit on my Macbook Pro. The current patch spends an average of 1900 ms for 10,002 runs (189 micro-seconds per call). An equivalent patch that uses file.isDirectory(), file.exists(), file.length() as shown below takes 2080.8 ms for 10,002 runs (0.208 micro seconds per call). {code} if ((!file.isDirectory() && file.length() != req.getSize()) || !file.exists()) { ret = false; } else if (dirsHandler != null) { ret = checkLocalResource(rsrc); } {code} > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-6315: -- Attachment: YARN-6315.001.patch First version of the patch that uses readAttributes bulk operation to match the size for resources that are not directories since the size of the directory may not always match up. It maintains the exists() behavior by setting ret= false when file not found exception is thrown. The method also catches IOException to maintain previous behavior/signature. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-6315.001.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
Kuhu Shukla created YARN-6315: - Summary: Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files Key: YARN-6315 URL: https://issues.apache.org/jira/browse/YARN-6315 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.3, 2.8.1 Reporter: Kuhu Shukla Assignee: Kuhu Shukla We currently check if a resource is present by making sure that the file exists locally. There can be a case where the LocalizationTracker thinks that it has the resource if the file exists but with size 0 or less than the "expected" size of the LocalResource. This JIRA tracks the change to harden the isResourcePresent call to address that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org