Re: A basic approach towards implementing the ANTI-AFFINITY in Slider

2015-05-21 Thread Steve Loughran

 On 21 May 2015, at 02:25, Rajesh Kartha karth...@gmail.com wrote:
 
 Thanks a lot Steve,
 
 So the suggestion is to use  the AMRMClient.updateBlacklist() method to
 include *nodes where the component is running + unreliable nodes* before
 the container is requested instead of discarding allocated containers.
 

that's right. 

1. We'd have to include in the blacklist those nodes on which instances have 
already been explicitly requested but not yet granted (outstanding request 
tracker knows them)
2. and ask for each instance one by one. Otherwise if you ask for 1 container 
they can be assigned to the same host

 I will take a closer look and update SLIDER-82 with my approach.
 
 Thanks,
 Rajesh
 


that'd be great!


Re: A basic approach towards implementing the ANTI-AFFINITY in Slider

2015-05-20 Thread Rajesh Kartha
Thanks a lot Steve,

So the suggestion is to use  the AMRMClient.updateBlacklist() method to
include *nodes where the component is running + unreliable nodes* before
the container is requested instead of discarding allocated containers.

I will take a closer look and update SLIDER-82 with my approach.

Thanks,
Rajesh


On Tue, May 19, 2015 at 12:34 PM, Steve Loughran ste...@hortonworks.com
wrote:


  On 18 May 2015, at 23:41, Rajesh Kartha karth...@gmail.com wrote:
 
  Hello,
 
  One of the early requests we got for improving Slider was to have a
  way of *ensuring
  only a single process* of the given application runs on any given node. I
  have read about the ANTI-AFFINITY flag but was not fully sure about its
  implementation.
 
  Hence have been  trying to piece things together based on the comments
 in:
 
  - SLIDER-82
  - Steve's blog at
 
 http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html
  - The Slider wiki at -
  http://slider.incubator.apache.org/design/rolehistory.html
  - Looking at the code
 
  Today the flag PlacementPolicy.ANTI_AFFINITY_REQUIRED seems like a place
  holder and is not being used currently  in the flow.

 I think it's used on restart, where explicit requests for nodes on hosts
 are not made if there's
 already an instance on that node and the anti-affinity flag is set

 
  Also, as I understand, the main method where the check on containers
 happen
  is in the Event Handler:
 
  *AppState.onContainersAllocated()*
 
  Since this method makes the decision on the allocated containers before
  launching the role, I was thinking of a simple approach where we could :
 
  1) check for the RoleStatus.isAntiAffinePlacement() to be true
  2) check if the NodeInstance on which the current container is allocated
 to
  be either  in the RoleHistory.listActiveNodes(roleId) or found to be
  unreliable
  3) discard the container without decrementing the request count for the
 role
  4) if the container check does not meet the check in #2 then proceed with
  the flow
  and continue with the launch
 
  The launching of the role via the launchService happens after this check,
  so I would hope these checks may not be that expensive.
 

 I thought about this, but wasn't happy with it.

 We will end up discarding a lot of containers. These may seem simple idle
 capacity, but in a cluster with pre-emption enabled a slider app could be
 killing  other work to get containers it then discards.

 Even without that, theres a risk that you end up getting back those same
 hosts, again, and again. Same for unreliable hosts


  One other potential area for such a check is
  RoleHostory.findNodeForNewInstance(role) during
  the iteration of the list of Node Instances from the getNodesForRoleId(),
  but based on my  experiments the listofActiveNodes() and the
  getNodesForRoleId() seemed mutually exclusive, hence this check may not
 be
  needed there.
 
  Again, not sure if the above can address the different scenarios that is
  expected from the ANTI-AFFINITY flag, but was wondering if this was
  feasible as a first approach to having some ANTI-AFFINITY support.


 I think it's a step in the right direction, but we really need to make the
 leap to doing what twill did and use the blacklist to exclude nodes where
 we either have active containers or their reliability is considered too low

 I'm not planning to do any work in that area in the near future -so if you
 want to sit down and start doing it, feel free!



Re: A basic approach towards implementing the ANTI-AFFINITY in Slider

2015-05-19 Thread Steve Loughran

 On 18 May 2015, at 23:41, Rajesh Kartha karth...@gmail.com wrote:
 
 Hello,
 
 One of the early requests we got for improving Slider was to have a
 way of *ensuring
 only a single process* of the given application runs on any given node. I
 have read about the ANTI-AFFINITY flag but was not fully sure about its
 implementation.
 
 Hence have been  trying to piece things together based on the comments in:
 
 - SLIDER-82
 - Steve's blog at
 http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html
 - The Slider wiki at -
 http://slider.incubator.apache.org/design/rolehistory.html
 - Looking at the code
 
 Today the flag PlacementPolicy.ANTI_AFFINITY_REQUIRED seems like a place
 holder and is not being used currently  in the flow.

I think it's used on restart, where explicit requests for nodes on hosts are 
not made if there's
already an instance on that node and the anti-affinity flag is set

 
 Also, as I understand, the main method where the check on containers happen
 is in the Event Handler:
 
 *AppState.onContainersAllocated()*
 
 Since this method makes the decision on the allocated containers before
 launching the role, I was thinking of a simple approach where we could :
 
 1) check for the RoleStatus.isAntiAffinePlacement() to be true
 2) check if the NodeInstance on which the current container is allocated to
 be either  in the RoleHistory.listActiveNodes(roleId) or found to be
 unreliable
 3) discard the container without decrementing the request count for the role
 4) if the container check does not meet the check in #2 then proceed with
 the flow
 and continue with the launch
 
 The launching of the role via the launchService happens after this check,
 so I would hope these checks may not be that expensive.
 

I thought about this, but wasn't happy with it.

We will end up discarding a lot of containers. These may seem simple idle 
capacity, but in a cluster with pre-emption enabled a slider app could be 
killing  other work to get containers it then discards.

Even without that, theres a risk that you end up getting back those same hosts, 
again, and again. Same for unreliable hosts


 One other potential area for such a check is
 RoleHostory.findNodeForNewInstance(role) during
 the iteration of the list of Node Instances from the getNodesForRoleId(),
 but based on my  experiments the listofActiveNodes() and the
 getNodesForRoleId() seemed mutually exclusive, hence this check may not be
 needed there.
 
 Again, not sure if the above can address the different scenarios that is
 expected from the ANTI-AFFINITY flag, but was wondering if this was
 feasible as a first approach to having some ANTI-AFFINITY support.


I think it's a step in the right direction, but we really need to make the leap 
to doing what twill did and use the blacklist to exclude nodes where we either 
have active containers or their reliability is considered too low

I'm not planning to do any work in that area in the near future -so if you want 
to sit down and start doing it, feel free!