[jira] [Updated] (MESOS-2919) Framework can overcommit oversubscribable resources during master failover.

2015-06-24 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2919:
--
Story Points: 3

 Framework can overcommit oversubscribable resources during master failover.
 ---

 Key: MESOS-2919
 URL: https://issues.apache.org/jira/browse/MESOS-2919
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu
Assignee: Jie Yu
Priority: Critical
  Labels: twitter

 This is due to a bug in the hierarchical allocator. Here is the sequence of 
 events:
 1) slave uses a fixed resource estimator which advertise 4 revocable cpus
 2) a framework A launches a task that uses all the 4 revocable cpus
 3) master fails over
 4) slave re-registers with the new master, and sends UpdateSlaveMessage with 
 4 revocable cpus as oversubscribed resources
 5) framework A hasn't registered yet, therefore, the slave's available 
 resources will be 4 revocable cpus
 6) framework A registered and will receive an additional 4 revocable cpus. So 
 it can launch another task with 4 revocable cpus (that means 8 total!)
 The problem is due to the way we calculate 'allocated' resource in allocator 
 when 'updateSlave'. If the framework is not registered, the 'allocation' 
 below is not accurate (check that if block in 'addSlave').
 {code}
 template class RoleSorter, class FrameworkSorter
 void
 HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::updateSlave(
 const SlaveID slaveId,
 const Resources oversubscribed)
 {
   CHECK(initialized);
   CHECK(slaves.contains(slaveId));
   // Check that all the oversubscribed resources are revocable.
   CHECK_EQ(oversubscribed, oversubscribed.revocable());
   // Update the total resources.
   // First remove the old oversubscribed resources from the total.
   slaves[slaveId].total -= slaves[slaveId].total.revocable();
   // Now add the new estimate of oversubscribed resources.
   slaves[slaveId].total += oversubscribed;
   // Now, update the total resources in the role sorter.
   roleSorter-update(
   slaveId,
   slaves[slaveId].total.unreserved());
   // Calculate the current allocation of oversubscribed resources.
   Resources allocation;
   foreachkey (const std::string role, roles) {
 allocation += roleSorter-allocation(role, slaveId).revocable();
   }
   // Update the available resources.
   // First remove the old oversubscribed resources from available.
   slaves[slaveId].available -= slaves[slaveId].available.revocable();
   // Now add the new estimate of available oversubscribed resources.
   slaves[slaveId].available += oversubscribed - allocation;
   LOG(INFO)  Slave   slaveId   (  slaves[slaveId].hostname
  ) updated with oversubscribed resources   oversubscribed
   (total:   slaves[slaveId].total
  , available:   slaves[slaveId].available  );
   allocate(slaveId);
 }
 template class RoleSorter, class FrameworkSorter
 void
 HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::addSlave(
 const SlaveID slaveId,
 const SlaveInfo slaveInfo,
 const Resources total,
 const hashmapFrameworkID, Resources used)
 {
   CHECK(initialized);
   CHECK(!slaves.contains(slaveId));
   roleSorter-add(slaveId, total.unreserved());
   foreachpair (const FrameworkID frameworkId,
const Resources allocated,
used) {
 if (frameworks.contains(frameworkId)) {
   const std::string role = frameworks[frameworkId].role;
   // TODO(bmahler): Validate that the reserved resources have the
   // framework's role.
   roleSorter-allocated(role, slaveId, allocated.unreserved());
   frameworkSorters[role]-add(slaveId, allocated);
   frameworkSorters[role]-allocated(
   frameworkId.value(), slaveId, allocated);
 }
   }
   ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2919) Framework can overcommit oversubscribable resources during master failover.

2015-06-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2919:
--
Labels: twitter  (was: )

 Framework can overcommit oversubscribable resources during master failover.
 ---

 Key: MESOS-2919
 URL: https://issues.apache.org/jira/browse/MESOS-2919
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu
Priority: Critical
  Labels: twitter

 This is due to a bug in the hierarchical allocator. Here is the sequence of 
 events:
 1) slave uses a fixed resource estimator which advertise 4 revocable cpus
 2) a framework A launches a task that uses all the 4 revocable cpus
 3) master fails over
 4) slave re-registers with the new master, and sends UpdateSlaveMessage with 
 4 revocable cpus as oversubscribed resources
 5) framework A hasn't registered yet, therefore, the slave's available 
 resources will be 4 revocable cpus
 6) framework A registered and will receive an additional 4 revocable cpus. So 
 it can launch another task with 4 revocable cpus (that means 8 total!)
 The problem is due to the way we calculate 'allocated' resource in allocator 
 when 'updateSlave'. If the framework is not registered, the 'allocation' 
 below is not accurate (check that if block in 'addSlave').
 {code}
 template class RoleSorter, class FrameworkSorter
 void
 HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::updateSlave(
 const SlaveID slaveId,
 const Resources oversubscribed)
 {
   CHECK(initialized);
   CHECK(slaves.contains(slaveId));
   // Check that all the oversubscribed resources are revocable.
   CHECK_EQ(oversubscribed, oversubscribed.revocable());
   // Update the total resources.
   // First remove the old oversubscribed resources from the total.
   slaves[slaveId].total -= slaves[slaveId].total.revocable();
   // Now add the new estimate of oversubscribed resources.
   slaves[slaveId].total += oversubscribed;
   // Now, update the total resources in the role sorter.
   roleSorter-update(
   slaveId,
   slaves[slaveId].total.unreserved());
   // Calculate the current allocation of oversubscribed resources.
   Resources allocation;
   foreachkey (const std::string role, roles) {
 allocation += roleSorter-allocation(role, slaveId).revocable();
   }
   // Update the available resources.
   // First remove the old oversubscribed resources from available.
   slaves[slaveId].available -= slaves[slaveId].available.revocable();
   // Now add the new estimate of available oversubscribed resources.
   slaves[slaveId].available += oversubscribed - allocation;
   LOG(INFO)  Slave   slaveId   (  slaves[slaveId].hostname
  ) updated with oversubscribed resources   oversubscribed
   (total:   slaves[slaveId].total
  , available:   slaves[slaveId].available  );
   allocate(slaveId);
 }
 template class RoleSorter, class FrameworkSorter
 void
 HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::addSlave(
 const SlaveID slaveId,
 const SlaveInfo slaveInfo,
 const Resources total,
 const hashmapFrameworkID, Resources used)
 {
   CHECK(initialized);
   CHECK(!slaves.contains(slaveId));
   roleSorter-add(slaveId, total.unreserved());
   foreachpair (const FrameworkID frameworkId,
const Resources allocated,
used) {
 if (frameworks.contains(frameworkId)) {
   const std::string role = frameworks[frameworkId].role;
   // TODO(bmahler): Validate that the reserved resources have the
   // framework's role.
   roleSorter-allocated(role, slaveId, allocated.unreserved());
   frameworkSorters[role]-add(slaveId, allocated);
   frameworkSorters[role]-allocated(
   frameworkId.value(), slaveId, allocated);
 }
   }
   ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)