[jira] [Commented] (MESOS-2919) Framework can overcommit oversubscribable resources during master failover.

2015-06-25 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601607#comment-14601607
 ] 

Jie Yu commented on MESOS-2919:
---

commit 1fcb1e447ac52b8baf58dc9c88f186e3dfcaab50
Author: Jie Yu 
Date:   Wed Jun 24 11:12:21 2015 -0700

Replaced slave's 'available' with 'allocated' in hierarchical allocator.

Review: https://reviews.apache.org/r/35836

> Framework can overcommit oversubscribable resources during master failover.
> ---
>
> Key: MESOS-2919
> URL: https://issues.apache.org/jira/browse/MESOS-2919
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>  Labels: twitter
> Fix For: 0.23.0
>
>
> This is due to a bug in the hierarchical allocator. Here is the sequence of 
> events:
> 1) slave uses a fixed resource estimator which advertise 4 revocable cpus
> 2) a framework A launches a task that uses all the 4 revocable cpus
> 3) master fails over
> 4) slave re-registers with the new master, and sends UpdateSlaveMessage with 
> 4 revocable cpus as oversubscribed resources
> 5) framework A hasn't registered yet, therefore, the slave's available 
> resources will be 4 revocable cpus
> 6) framework A registered and will receive an additional 4 revocable cpus. So 
> it can launch another task with 4 revocable cpus (that means 8 total!)
> The problem is due to the way we calculate 'allocated' resource in allocator 
> when 'updateSlave'. If the framework is not registered, the 'allocation' 
> below is not accurate (check that if block in 'addSlave').
> {code}
> template 
> void
> HierarchicalAllocatorProcess::updateSlave(
> const SlaveID& slaveId,
> const Resources& oversubscribed)
> {
>   CHECK(initialized);
>   CHECK(slaves.contains(slaveId));
>   // Check that all the oversubscribed resources are revocable.
>   CHECK_EQ(oversubscribed, oversubscribed.revocable());
>   // Update the total resources.
>   // First remove the old oversubscribed resources from the total.
>   slaves[slaveId].total -= slaves[slaveId].total.revocable();
>   // Now add the new estimate of oversubscribed resources.
>   slaves[slaveId].total += oversubscribed;
>   // Now, update the total resources in the role sorter.
>   roleSorter->update(
>   slaveId,
>   slaves[slaveId].total.unreserved());
>   // Calculate the current allocation of oversubscribed resources.
>   Resources allocation;
>   foreachkey (const std::string& role, roles) {
> allocation += roleSorter->allocation(role, slaveId).revocable();
>   }
>   // Update the available resources.
>   // First remove the old oversubscribed resources from available.
>   slaves[slaveId].available -= slaves[slaveId].available.revocable();
>   // Now add the new estimate of available oversubscribed resources.
>   slaves[slaveId].available += oversubscribed - allocation;
>   LOG(INFO) << "Slave " << slaveId << " (" << slaves[slaveId].hostname
> << ") updated with oversubscribed resources " << oversubscribed
> << " (total: " << slaves[slaveId].total
> << ", available: " << slaves[slaveId].available << ")";
>   allocate(slaveId);
> }
> template 
> void
> HierarchicalAllocatorProcess::addSlave(
> const SlaveID& slaveId,
> const SlaveInfo& slaveInfo,
> const Resources& total,
> const hashmap& used)
> {
>   CHECK(initialized);
>   CHECK(!slaves.contains(slaveId));
>   roleSorter->add(slaveId, total.unreserved());
>   foreachpair (const FrameworkID& frameworkId,
>const Resources& allocated,
>used) {
> if (frameworks.contains(frameworkId)) {
>   const std::string& role = frameworks[frameworkId].role;
>   // TODO(bmahler): Validate that the reserved resources have the
>   // framework's role.
>   roleSorter->allocated(role, slaveId, allocated.unreserved());
>   frameworkSorters[role]->add(slaveId, allocated);
>   frameworkSorters[role]->allocated(
>   frameworkId.value(), slaveId, allocated);
> }
>   }
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2919) Framework can overcommit oversubscribable resources during master failover.

2015-06-24 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599902#comment-14599902
 ] 

Jie Yu commented on MESOS-2919:
---

https://reviews.apache.org/r/35836/

> Framework can overcommit oversubscribable resources during master failover.
> ---
>
> Key: MESOS-2919
> URL: https://issues.apache.org/jira/browse/MESOS-2919
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>  Labels: twitter
>
> This is due to a bug in the hierarchical allocator. Here is the sequence of 
> events:
> 1) slave uses a fixed resource estimator which advertise 4 revocable cpus
> 2) a framework A launches a task that uses all the 4 revocable cpus
> 3) master fails over
> 4) slave re-registers with the new master, and sends UpdateSlaveMessage with 
> 4 revocable cpus as oversubscribed resources
> 5) framework A hasn't registered yet, therefore, the slave's available 
> resources will be 4 revocable cpus
> 6) framework A registered and will receive an additional 4 revocable cpus. So 
> it can launch another task with 4 revocable cpus (that means 8 total!)
> The problem is due to the way we calculate 'allocated' resource in allocator 
> when 'updateSlave'. If the framework is not registered, the 'allocation' 
> below is not accurate (check that if block in 'addSlave').
> {code}
> template 
> void
> HierarchicalAllocatorProcess::updateSlave(
> const SlaveID& slaveId,
> const Resources& oversubscribed)
> {
>   CHECK(initialized);
>   CHECK(slaves.contains(slaveId));
>   // Check that all the oversubscribed resources are revocable.
>   CHECK_EQ(oversubscribed, oversubscribed.revocable());
>   // Update the total resources.
>   // First remove the old oversubscribed resources from the total.
>   slaves[slaveId].total -= slaves[slaveId].total.revocable();
>   // Now add the new estimate of oversubscribed resources.
>   slaves[slaveId].total += oversubscribed;
>   // Now, update the total resources in the role sorter.
>   roleSorter->update(
>   slaveId,
>   slaves[slaveId].total.unreserved());
>   // Calculate the current allocation of oversubscribed resources.
>   Resources allocation;
>   foreachkey (const std::string& role, roles) {
> allocation += roleSorter->allocation(role, slaveId).revocable();
>   }
>   // Update the available resources.
>   // First remove the old oversubscribed resources from available.
>   slaves[slaveId].available -= slaves[slaveId].available.revocable();
>   // Now add the new estimate of available oversubscribed resources.
>   slaves[slaveId].available += oversubscribed - allocation;
>   LOG(INFO) << "Slave " << slaveId << " (" << slaves[slaveId].hostname
> << ") updated with oversubscribed resources " << oversubscribed
> << " (total: " << slaves[slaveId].total
> << ", available: " << slaves[slaveId].available << ")";
>   allocate(slaveId);
> }
> template 
> void
> HierarchicalAllocatorProcess::addSlave(
> const SlaveID& slaveId,
> const SlaveInfo& slaveInfo,
> const Resources& total,
> const hashmap& used)
> {
>   CHECK(initialized);
>   CHECK(!slaves.contains(slaveId));
>   roleSorter->add(slaveId, total.unreserved());
>   foreachpair (const FrameworkID& frameworkId,
>const Resources& allocated,
>used) {
> if (frameworks.contains(frameworkId)) {
>   const std::string& role = frameworks[frameworkId].role;
>   // TODO(bmahler): Validate that the reserved resources have the
>   // framework's role.
>   roleSorter->allocated(role, slaveId, allocated.unreserved());
>   frameworkSorters[role]->add(slaveId, allocated);
>   frameworkSorters[role]->allocated(
>   frameworkId.value(), slaveId, allocated);
> }
>   }
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)