Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-18 Thread Qian Zhang


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > 
> >
> > Given that `state::checkpoint` is **atomic**, we can not end up in the 
> > state where the file is empty because the agent did not finish writing to 
> > it.
> > 
> > However, an empty file might occur in case of hard reboot of the 
> > agent's host. This happens because page cache is dumped every 20 seconds by 
> > default in Linux. There is a chance that the file is created, but data has 
> > not yet synced on disk.
> > 
> > As we have agreed with Gilbert, we need to ignore empty files **only** 
> > in case of orphan containers.
> 
> Qian Zhang wrote:
> > Given that state::checkpoint is atomic, we can not end up in the state 
> where the file is empty because the agent did not finish writing to it.
> > However, an empty file might occur in case of hard reboot of the 
> agent's host. This happens because page cache is dumped every 20 seconds by 
> default in Linux. There is a chance that the file is created, but data has 
> not yet synced on disk.
> 
> Agree. And I see the comment `// This could happen if the slave died 
> after opening the file for writing but before it checkpointed anything.` in a 
> couple of places in Mesos code (e.g., `slave/state.cpp`, 
> `metadata_manager.cpp`), I think those comments need to be updated as well.
> 
> > As we have agreed with Gilbert, we need to ignore empty files only in 
> case of orphan containers.
> 
> Can you please elaborate a bit? Why do we want to treat orphan containers 
> and recoverable containers differently? How will we handle the recoverable 
> containers in this case?
> 
> Andrei Budnik wrote:
> > I think those comments need to be updated as well.
> 
> We can update these comments in a separate patch later.
> 
> > Why do we want to treat orphan containers and recoverable containers 
> differently? How will we handle the recoverable containers in this case?
> 
> I think that `recoverable` containers could not have empty state files by 
> construction as checkpointing state is atomic. If this invariant does not 
> satisfy, then we definitely have a bug in our code which we need to fix.
> In the case of hard reboots, this invariant might be broken, but all 
> containers are `orphan`.
> 
> If the reasoning above looks acceptable, then we might want to recover 
> after broken invariant for `orphan` containers, while keeping this error for 
> `non-orphan` containers as an assertion (for us, developers) that the 
> invariant could not be broken in normal circumstances.

Agree, and I posted a patch for updating comments here: 
https://reviews.apache.org/r/70001/


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-15 Thread Andrei Budnik


> On Feb. 13, 2019, 1:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > 
> >
> > Given that `state::checkpoint` is **atomic**, we can not end up in the 
> > state where the file is empty because the agent did not finish writing to 
> > it.
> > 
> > However, an empty file might occur in case of hard reboot of the 
> > agent's host. This happens because page cache is dumped every 20 seconds by 
> > default in Linux. There is a chance that the file is created, but data has 
> > not yet synced on disk.
> > 
> > As we have agreed with Gilbert, we need to ignore empty files **only** 
> > in case of orphan containers.
> 
> Qian Zhang wrote:
> > Given that state::checkpoint is atomic, we can not end up in the state 
> where the file is empty because the agent did not finish writing to it.
> > However, an empty file might occur in case of hard reboot of the 
> agent's host. This happens because page cache is dumped every 20 seconds by 
> default in Linux. There is a chance that the file is created, but data has 
> not yet synced on disk.
> 
> Agree. And I see the comment `// This could happen if the slave died 
> after opening the file for writing but before it checkpointed anything.` in a 
> couple of places in Mesos code (e.g., `slave/state.cpp`, 
> `metadata_manager.cpp`), I think those comments need to be updated as well.
> 
> > As we have agreed with Gilbert, we need to ignore empty files only in 
> case of orphan containers.
> 
> Can you please elaborate a bit? Why do we want to treat orphan containers 
> and recoverable containers differently? How will we handle the recoverable 
> containers in this case?

> I think those comments need to be updated as well.

We can update these comments in a separate patch later.

> Why do we want to treat orphan containers and recoverable containers 
> differently? How will we handle the recoverable containers in this case?

I think that `recoverable` containers could not have empty state files by 
construction as checkpointing state is atomic. If this invariant does not 
satisfy, then we definitely have a bug in our code which we need to fix.
In the case of hard reboots, this invariant might be broken, but all containers 
are `orphan`.

If the reasoning above looks acceptable, then we might want to recover after 
broken invariant for `orphan` containers, while keeping this error for 
`non-orphan` containers as an assertion (for us, developers) that the invariant 
could not be broken in normal circumstances.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-15 Thread Qian Zhang


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > Thanks for the patch!
> > I think we should implement a test for this. Otherwise, it would be very 
> > dangerous to _refactor_ this part of code in the future.
> > If you have no chance to implement a test for this now, please feel free to 
> > file a ticket.

Sure, I posted a unit test here: https://reviews.apache.org/r/69994/


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-15 Thread Qian Zhang


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > 
> >
> > Given that `state::checkpoint` is **atomic**, we can not end up in the 
> > state where the file is empty because the agent did not finish writing to 
> > it.
> > 
> > However, an empty file might occur in case of hard reboot of the 
> > agent's host. This happens because page cache is dumped every 20 seconds by 
> > default in Linux. There is a chance that the file is created, but data has 
> > not yet synced on disk.
> > 
> > As we have agreed with Gilbert, we need to ignore empty files **only** 
> > in case of orphan containers.

> Given that state::checkpoint is atomic, we can not end up in the state where 
> the file is empty because the agent did not finish writing to it.
> However, an empty file might occur in case of hard reboot of the agent's 
> host. This happens because page cache is dumped every 20 seconds by default 
> in Linux. There is a chance that the file is created, but data has not yet 
> synced on disk.

Agree. And I see the comment `// This could happen if the slave died after 
opening the file for writing but before it checkpointed anything.` in a couple 
of places in Mesos code (e.g., `slave/state.cpp`, `metadata_manager.cpp`), I 
think those comments need to be updated as well.

> As we have agreed with Gilbert, we need to ignore empty files only in case of 
> orphan containers.

Can you please elaborate a bit? Why do we want to treat orphan containers and 
recoverable containers differently? How will we handle the recoverable 
containers in this case?


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-13 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212803
---



Patch looks great!

Reviews applied: [69972]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers 
--disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; 
./support/docker-build.sh

- Mesos Reviewbot


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-13 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---



Thanks for the patch!
I think we should implement a test for this. Otherwise, it would be very 
dangerous to _refactor_ this part of code in the future.
If you have no chance to implement a test for this now, please feel free to 
file a ticket.


src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
Lines 300 (patched)


Given that `state::checkpoint` is **atomic**, we can not end up in the 
state where the file is empty because the agent did not finish writing to it.

However, an empty file might occur in case of hard reboot of the agent's 
host. This happens because page cache is dumped every 20 seconds by default in 
Linux. There is a chance that the file is created, but data has not yet synced 
on disk.

As we have agreed with Gilbert, we need to ignore empty files **only** in 
case of orphan containers.


- Andrei Budnik


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-13 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212791
---



PASS: Mesos patch 69972 was successfully built and tested.

Reviews applied: `['69972']`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2884/mesos-review-69972

- Mesos Reviewbot Windows


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>