Re: A basic approach towards implementing the ANTI-AFFINITY in Slider
On 21 May 2015, at 02:25, Rajesh Kartha karth...@gmail.com wrote: Thanks a lot Steve, So the suggestion is to use the AMRMClient.updateBlacklist() method to include *nodes where the component is running + unreliable nodes* before the container is requested instead of discarding allocated containers. that's right. 1. We'd have to include in the blacklist those nodes on which instances have already been explicitly requested but not yet granted (outstanding request tracker knows them) 2. and ask for each instance one by one. Otherwise if you ask for 1 container they can be assigned to the same host I will take a closer look and update SLIDER-82 with my approach. Thanks, Rajesh that'd be great!
Re: A basic approach towards implementing the ANTI-AFFINITY in Slider
Thanks a lot Steve, So the suggestion is to use the AMRMClient.updateBlacklist() method to include *nodes where the component is running + unreliable nodes* before the container is requested instead of discarding allocated containers. I will take a closer look and update SLIDER-82 with my approach. Thanks, Rajesh On Tue, May 19, 2015 at 12:34 PM, Steve Loughran ste...@hortonworks.com wrote: On 18 May 2015, at 23:41, Rajesh Kartha karth...@gmail.com wrote: Hello, One of the early requests we got for improving Slider was to have a way of *ensuring only a single process* of the given application runs on any given node. I have read about the ANTI-AFFINITY flag but was not fully sure about its implementation. Hence have been trying to piece things together based on the comments in: - SLIDER-82 - Steve's blog at http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html - The Slider wiki at - http://slider.incubator.apache.org/design/rolehistory.html - Looking at the code Today the flag PlacementPolicy.ANTI_AFFINITY_REQUIRED seems like a place holder and is not being used currently in the flow. I think it's used on restart, where explicit requests for nodes on hosts are not made if there's already an instance on that node and the anti-affinity flag is set Also, as I understand, the main method where the check on containers happen is in the Event Handler: *AppState.onContainersAllocated()* Since this method makes the decision on the allocated containers before launching the role, I was thinking of a simple approach where we could : 1) check for the RoleStatus.isAntiAffinePlacement() to be true 2) check if the NodeInstance on which the current container is allocated to be either in the RoleHistory.listActiveNodes(roleId) or found to be unreliable 3) discard the container without decrementing the request count for the role 4) if the container check does not meet the check in #2 then proceed with the flow and continue with the launch The launching of the role via the launchService happens after this check, so I would hope these checks may not be that expensive. I thought about this, but wasn't happy with it. We will end up discarding a lot of containers. These may seem simple idle capacity, but in a cluster with pre-emption enabled a slider app could be killing other work to get containers it then discards. Even without that, theres a risk that you end up getting back those same hosts, again, and again. Same for unreliable hosts One other potential area for such a check is RoleHostory.findNodeForNewInstance(role) during the iteration of the list of Node Instances from the getNodesForRoleId(), but based on my experiments the listofActiveNodes() and the getNodesForRoleId() seemed mutually exclusive, hence this check may not be needed there. Again, not sure if the above can address the different scenarios that is expected from the ANTI-AFFINITY flag, but was wondering if this was feasible as a first approach to having some ANTI-AFFINITY support. I think it's a step in the right direction, but we really need to make the leap to doing what twill did and use the blacklist to exclude nodes where we either have active containers or their reliability is considered too low I'm not planning to do any work in that area in the near future -so if you want to sit down and start doing it, feel free!
Re: A basic approach towards implementing the ANTI-AFFINITY in Slider
On 18 May 2015, at 23:41, Rajesh Kartha karth...@gmail.com wrote: Hello, One of the early requests we got for improving Slider was to have a way of *ensuring only a single process* of the given application runs on any given node. I have read about the ANTI-AFFINITY flag but was not fully sure about its implementation. Hence have been trying to piece things together based on the comments in: - SLIDER-82 - Steve's blog at http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html - The Slider wiki at - http://slider.incubator.apache.org/design/rolehistory.html - Looking at the code Today the flag PlacementPolicy.ANTI_AFFINITY_REQUIRED seems like a place holder and is not being used currently in the flow. I think it's used on restart, where explicit requests for nodes on hosts are not made if there's already an instance on that node and the anti-affinity flag is set Also, as I understand, the main method where the check on containers happen is in the Event Handler: *AppState.onContainersAllocated()* Since this method makes the decision on the allocated containers before launching the role, I was thinking of a simple approach where we could : 1) check for the RoleStatus.isAntiAffinePlacement() to be true 2) check if the NodeInstance on which the current container is allocated to be either in the RoleHistory.listActiveNodes(roleId) or found to be unreliable 3) discard the container without decrementing the request count for the role 4) if the container check does not meet the check in #2 then proceed with the flow and continue with the launch The launching of the role via the launchService happens after this check, so I would hope these checks may not be that expensive. I thought about this, but wasn't happy with it. We will end up discarding a lot of containers. These may seem simple idle capacity, but in a cluster with pre-emption enabled a slider app could be killing other work to get containers it then discards. Even without that, theres a risk that you end up getting back those same hosts, again, and again. Same for unreliable hosts One other potential area for such a check is RoleHostory.findNodeForNewInstance(role) during the iteration of the list of Node Instances from the getNodesForRoleId(), but based on my experiments the listofActiveNodes() and the getNodesForRoleId() seemed mutually exclusive, hence this check may not be needed there. Again, not sure if the above can address the different scenarios that is expected from the ANTI-AFFINITY flag, but was wondering if this was feasible as a first approach to having some ANTI-AFFINITY support. I think it's a step in the right direction, but we really need to make the leap to doing what twill did and use the blacklist to exclude nodes where we either have active containers or their reliability is considered too low I'm not planning to do any work in that area in the near future -so if you want to sit down and start doing it, feel free!