Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-06 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

(Updated Dec. 6, 2016, 7:46 p.m.)


Review request for mesos and Jie Yu.


Changes
---

Updated based on all but 1 of Jie's comments.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs (updated)
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
839665a22aca9b1c1c1cf4992406bc924ee2b065 
  src/slave/containerizer/mesos/io/switchboard.cpp 
19f2b625f2aa4b790fbe80b8dfad44b219f2c24e 

Diff: https://reviews.apache.org/r/54355/diff/


Testing
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Test added in follow-on patch.


Thanks,

Kevin Klues



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-06 Thread Kevin Klues


> On Dec. 6, 2016, 5:31 a.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/io/switchboard.cpp, line 161
> > 
> >
> > or agent crashes right before we create the directory and fork the 
> > server.

Would we actually have an active container state in this case though? If the 
agent crashed at that point, then you would never have gotten the cahnce to 
even try and launch the container itself.


- Kevin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158121
---


On Dec. 6, 2016, 4:55 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 6, 2016, 4:55 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> 19f2b625f2aa4b790fbe80b8dfad44b219f2c24e 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158121
---


Fix it, then Ship it!





src/slave/containerizer/mesos/io/switchboard.cpp (line 161)


or agent crashes right before we create the directory and fork the server.



src/slave/containerizer/mesos/io/switchboard.cpp (line 173)


Failed to get I/O switchboard server pid for ...



src/slave/containerizer/mesos/io/switchboard.cpp (line 175)


Pid file does not exist



src/slave/containerizer/mesos/io/switchboard.cpp (lines 334 - 348)


I would probably move this down right before we fork the server so that if 
anything below fails, we don't need to do anything in recovery.


- Jie Yu


On Dec. 6, 2016, 4:55 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 6, 2016, 4:55 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> 19f2b625f2aa4b790fbe80b8dfad44b219f2c24e 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

(Updated Dec. 6, 2016, 4:55 a.m.)


Review request for mesos and Jie Yu.


Changes
---

Updated based on Jie's comments.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs (updated)
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
839665a22aca9b1c1c1cf4992406bc924ee2b065 
  src/slave/containerizer/mesos/io/switchboard.cpp 
19f2b625f2aa4b790fbe80b8dfad44b219f2c24e 

Diff: https://reviews.apache.org/r/54355/diff/


Testing
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Test added in follow-on patch.


Thanks,

Kevin Klues



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158097
---




src/slave/containerizer/mesos/io/switchboard.cpp (lines 147 - 148)


Update this comemnts here?



src/slave/containerizer/mesos/io/switchboard.cpp (lines 206 - 210)


Do you need to check the existence of the io_switchboard directory as well. 
It's likely the container we want to recover here does not have io_switchboard 
directory (do not use io switchboard server).



src/slave/containerizer/mesos/io/switchboard.cpp (lines 217 - 220)


I don't think we want to fail the agent recovery if the previous agent run 
crashed at an unfortunate time.

Also, if a container is in active container list (recoverable), that means 
the io switchboard pid must exist with proper value. Otherwise, it should not 
be part of the recoverable list. I'll probably return a Failure for that case.

Also, I would do the recovering of active containers first. Then, deal with 
orphans.



src/slave/containerizer/mesos/io/switchboard.cpp (lines 648 - 650)


`cleanup` might be called for legacy containers that do not have io 
switchboard server. For those containers, we probably don't want to create an 
Info for them.

Therefore, we should probably ignore unknown container here. Please add a 
comment here about why we ignore unknown container here.


- Jie Yu


On Dec. 6, 2016, 12:21 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 6, 2016, 12:21 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> 19f2b625f2aa4b790fbe80b8dfad44b219f2c24e 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

(Updated Dec. 6, 2016, 12:21 a.m.)


Review request for mesos and Jie Yu.


Changes
---

Updated based on comments from Jie.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs (updated)
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
839665a22aca9b1c1c1cf4992406bc924ee2b065 
  src/slave/containerizer/mesos/io/switchboard.cpp 
19f2b625f2aa4b790fbe80b8dfad44b219f2c24e 

Diff: https://reviews.apache.org/r/54355/diff/


Testing
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Test added in follow-on patch.


Thanks,

Kevin Klues



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Jie Yu


> On Dec. 5, 2016, 9:11 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/io/switchboard.cpp, lines 537-547
> > 
> >
> > It's likely that the io switchboard server has been forked, but the 
> > agent crashes before it was able to checkpoint the pid.
> > 
> > If that happens, during recovery, we will not maintain Info for that 
> > container. As a result, we won't try to cleanup the socket file potentially 
> > created?
> > 
> > I think we probably need to createa directory for io switchboard 
> > related files (sock and pid files). When we create the directory, it 
> > indicates that the io switchboard server might or might not be created. 
> > During recovery, if we find the directory exists, but pid file does not 
> > exist, we should still create the Info with pid set to None(), and cleanup 
> > the socket file in 'cleanup' method.
> > 
> > Thoughts?
> 
> Kevin Klues wrote:
> That seems reasonable, what should we call the directory? As below?
> ```
> io_switchboard
> |-pid
> -socket
> ```
> 
> Kevin Klues wrote:
> That said, note that the io-switchboard itself creates the sock file, so 
> the only time this would ever happen is if an agent restarted between 
> successfully launching the io-switchboard and checkpointing its pid.

yeah, the layout sounds good to me.


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158038
---


On Dec. 5, 2016, 9:46 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 5, 2016, 9:46 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> d5211b98616e72a27ca6b472a5ee83505c227f22 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Kevin Klues


> On Dec. 5, 2016, 9:11 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/io/switchboard.cpp, lines 537-547
> > 
> >
> > It's likely that the io switchboard server has been forked, but the 
> > agent crashes before it was able to checkpoint the pid.
> > 
> > If that happens, during recovery, we will not maintain Info for that 
> > container. As a result, we won't try to cleanup the socket file potentially 
> > created?
> > 
> > I think we probably need to createa directory for io switchboard 
> > related files (sock and pid files). When we create the directory, it 
> > indicates that the io switchboard server might or might not be created. 
> > During recovery, if we find the directory exists, but pid file does not 
> > exist, we should still create the Info with pid set to None(), and cleanup 
> > the socket file in 'cleanup' method.
> > 
> > Thoughts?
> 
> Kevin Klues wrote:
> That seems reasonable, what should we call the directory? As below?
> ```
> io_switchboard
> |-pid
> -socket
> ```

That said, note that the io-switchboard itself creates the sock file, so the 
only time this would ever happen is if an agent restarted between successfully 
launching the io-switchboard and checkpointing its pid.


- Kevin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158038
---


On Dec. 5, 2016, 9:46 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 5, 2016, 9:46 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> d5211b98616e72a27ca6b472a5ee83505c227f22 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Kevin Klues


> On Dec. 5, 2016, 9:11 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/io/switchboard.cpp, lines 167-171
> > 
> >
> > I think this flag should apply to containers that are about to be 
> > launched next.
> > 
> > So let's still recover Info for those containers that have io 
> > switchboard server even if this flag has been turned off after agent 
> > restart.
> > 
> > What do you think?

I agree. I removed this check, and added a NOTE about this in the comments 
below.


> On Dec. 5, 2016, 9:11 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/io/switchboard.cpp, lines 537-547
> > 
> >
> > It's likely that the io switchboard server has been forked, but the 
> > agent crashes before it was able to checkpoint the pid.
> > 
> > If that happens, during recovery, we will not maintain Info for that 
> > container. As a result, we won't try to cleanup the socket file potentially 
> > created?
> > 
> > I think we probably need to createa directory for io switchboard 
> > related files (sock and pid files). When we create the directory, it 
> > indicates that the io switchboard server might or might not be created. 
> > During recovery, if we find the directory exists, but pid file does not 
> > exist, we should still create the Info with pid set to None(), and cleanup 
> > the socket file in 'cleanup' method.
> > 
> > Thoughts?

That seems reasonable, what should we call the directory? As below?
```
io_switchboard
|-pid
-socket
```


- Kevin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158038
---


On Dec. 5, 2016, 9:46 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 5, 2016, 9:46 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> d5211b98616e72a27ca6b472a5ee83505c227f22 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/#review158038
---




src/slave/containerizer/mesos/io/switchboard.cpp (line 153)


Can we also log the error case? Like:
```
if (pid.isError()) {
  LOG(ERROR) << ...
} else if (pid.None()) {
  // Add a comment about when this will happen.
  // E.g., container that do not enable io switchboard server.
} else {
  ...
}
```



src/slave/containerizer/mesos/io/switchboard.cpp (lines 154 - 162)


orphans will be killed by containerizer. So we don't have to kill the io 
switchboard here. Just recover the pid and rely on 'cleanup' to clean it up.



src/slave/containerizer/mesos/io/switchboard.cpp (lines 167 - 171)


I think this flag should apply to containers that are about to be launched 
next.

So let's still recover Info for those containers that have io switchboard 
server even if this flag has been turned off after agent restart.

What do you think?



src/slave/containerizer/mesos/io/switchboard.cpp (lines 536 - 546)


It's likely that the io switchboard server has been forked, but the agent 
crashes before it was able to checkpoint the pid.

If that happens, during recovery, we will not maintain Info for that 
container. As a result, we won't try to cleanup the socket file potentially 
created?

I think we probably need to createa directory for io switchboard related 
files (sock and pid files). When we create the directory, it indicates that the 
io switchboard server might or might not be created. During recovery, if we 
find the directory exists, but pid file does not exist, we should still create 
the Info with pid set to None(), and cleanup the socket file in 'cleanup' 
method.

Thoughts?


- Jie Yu


On Dec. 5, 2016, 9:46 a.m., Kevin Klues wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54355/
> ---
> 
> (Updated Dec. 5, 2016, 9:46 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6688
> https://issues.apache.org/jira/browse/MESOS-6688
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added implementation of `recover()` to the IOSwitchboard isolator.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/io/switchboard.hpp 
> 839665a22aca9b1c1c1cf4992406bc924ee2b065 
>   src/slave/containerizer/mesos/io/switchboard.cpp 
> d5211b98616e72a27ca6b472a5ee83505c227f22 
> 
> Diff: https://reviews.apache.org/r/54355/diff/
> 
> 
> Testing
> ---
> 
> GTEST_FILTER="" make -j check
> sudo src/mesos-tests
> 
> Test added in follow-on patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-05 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

(Updated Dec. 5, 2016, 9:46 a.m.)


Review request for mesos and Jie Yu.


Changes
---

Updated to properly recover orphans info before killing them.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs (updated)
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
839665a22aca9b1c1c1cf4992406bc924ee2b065 
  src/slave/containerizer/mesos/io/switchboard.cpp 
d5211b98616e72a27ca6b472a5ee83505c227f22 

Diff: https://reviews.apache.org/r/54355/diff/


Testing
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Test added in follow-on patch.


Thanks,

Kevin Klues



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-04 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

(Updated Dec. 5, 2016, 7:11 a.m.)


Review request for mesos and Jie Yu.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
839665a22aca9b1c1c1cf4992406bc924ee2b065 
  src/slave/containerizer/mesos/io/switchboard.cpp 
d5211b98616e72a27ca6b472a5ee83505c227f22 

Diff: https://reviews.apache.org/r/54355/diff/


Testing (updated)
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Test added in follow-on patch.


Thanks,

Kevin Klues



Re: Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-04 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

(Updated Dec. 5, 2016, 7:10 a.m.)


Review request for mesos and Jie Yu.


Changes
---

Rebased on master.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs (updated)
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
839665a22aca9b1c1c1cf4992406bc924ee2b065 
  src/slave/containerizer/mesos/io/switchboard.cpp 
d5211b98616e72a27ca6b472a5ee83505c227f22 

Diff: https://reviews.apache.org/r/54355/diff/


Testing
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Unfortunately, I don't have any new tests for this yet. That will follow in a 
subsequent commit.


Thanks,

Kevin Klues



Review Request 54355: Added implementation of `recover()` to the IOSwitchboard isolator.

2016-12-04 Thread Kevin Klues

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54355/
---

Review request for mesos and Jie Yu.


Bugs: MESOS-6688
https://issues.apache.org/jira/browse/MESOS-6688


Repository: mesos


Description
---

Added implementation of `recover()` to the IOSwitchboard isolator.


Diffs
-

  src/slave/containerizer/mesos/io/switchboard.hpp 
23c66bb2b05310a12fec24a5e5eebddd370ba0f0 
  src/slave/containerizer/mesos/io/switchboard.cpp 
778367a268ec350ed438bc9fe9d359d63bdb5503 

Diff: https://reviews.apache.org/r/54355/diff/


Testing
---

GTEST_FILTER="" make -j check
sudo src/mesos-tests

Unfortunately, I don't have any new tests for this yet. That will follow in a 
subsequent commit.


Thanks,

Kevin Klues