[ 
https://issues.apache.org/jira/browse/MESOS-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318877#comment-15318877
 ] 

Anindya Sinha edited comment on MESOS-5448 at 6/7/16 5:01 PM:
--------------------------------------------------------------

This is the proposed solution:

To address disk space leaking since rmdir is not completed successfully after 
checkpoint is updated:

- In handling of `CheckpointResourcesMessage` on the agent, we update the 
checkpoint information on the agent only after successful handling of 
respective operations (ie, rmdir or mkdir). So if rmdir fails or is not 
complete and the agent exits, the checkpoint info on the agent would only 
contain the checkpointed resources that were handled successfully up until the 
point the agent exits (ie. the checkpoint would not contain the resources that 
failed).
- Assuming no change in reserved resources: When the agent restarts, the 
checkpoint info that the master sends would not match that of the agent and the 
operation (say rmdir) shall be attempted again.

To address data in directories not leaked to other frameworks in future:

- We shall not allow checkpoints to be added on the agent for a `CREATE` 
operation if the path exists and the contents of the directory is not empty. 
For `MOUNT` disks, the root can exist but the contents needs to be empty for 
`CREATE` to be successful.


was (Author: anindya.sinha):
This is the proposed solution:

To address disk space leaking since rmdir is not completed successfully after 
checkpoint is updated:

- In handling of CheckpointResourcesMessage on the agent, we update the 
checkpoint information on the agent only after successful handling of 
respective operations (ie, rmdir or mkdir). So if rmdir fails or is not 
complete and the agent exits, the checkpoint info on the agent would only 
contain the checkpointed resources that were handled successfully up until the 
point the agent exits (ie. the checkpoint would not contain the resources that 
failed).
- Assuming no change in reserved resources: When the agent restarts, the 
checkpoint info that the master sends would not match that of the agent and the 
operation (say rmdir) shall be attempted again.

To address data in directories not leaked to other frameworks in future:

- We shall not allow checkpoints to be added on the agent for a CREATE 
operation if the path exists AND the contents of the directory is not empty. 
For MOUNT disks, the root can exist but the contents needs to be empty for 
CREATE to be successful.

> Persistent volume deletion on the agent should survive slave restart
> --------------------------------------------------------------------
>
>                 Key: MESOS-5448
>                 URL: https://issues.apache.org/jira/browse/MESOS-5448
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>              Labels: persistent-volumes
>
> When the master sends a CheckpointResourcesMessage to the agent, the agent 
> attempts to rmdir the persistent volume for a DESTROY operation (if it 
> existed before, and is no longer in the updated checkpoint in 
> CheckpointResourcesMessage).
> If the slave restarts before the operation finishes, the disk space can be 
> leaked because a reattempt of a rmdir is not done (since the checkpoint is 
> already updated).
> Subsequently, a CREATE on the same path could result in leaking of the data 
> to another framework (since the directory was not rm-ed) since the CREATE 
> operation is successful even if the root directory exists and the contents of 
> that directory is not empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to