Re: Review Request 32543: Documented problem and solution with slave recovery and systemd settings.

2015-07-05 Thread Benjamin Hindman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32543/#review90433
---

Ship it!


I updated this to be specific to using 'posix' isolation mechanisms and 
committed. I also killed the reference in upgrades.md as asked by AdamB, which 
I agree will probably be weird to maintain there.

Note that there are still other systemd related issues when using 'cgroups' 
isolation mechanisms that need to get worked out but hopefully this will be 
helpful for folks that are using systemd with just 'posix' isolation.

- Benjamin Hindman


On March 27, 2015, 2:09 p.m., Joerg Schad wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32543/
> ---
> 
> (Updated March 27, 2015, 2:09 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Brenden Matthews.
> 
> 
> Bugs: Mesos-2555
> https://issues.apache.org/jira/browse/Mesos-2555
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Documented the problem and solution encountered in MESOS-2419.
> 
> 
> Diffs
> -
> 
>   docs/slave-recovery.md 4bb4a71c6945bd70121743a1e9209a26906773c1 
>   docs/upgrades.md 2a15694607c079ad95ef6cf7f1490872ab9a5976 
> 
> Diff: https://reviews.apache.org/r/32543/diff/
> 
> 
> Testing
> ---
> 
> markdown check
> 
> 
> Thanks,
> 
> Joerg Schad
> 
>



Re: Review Request 32543: Documented problem and solution with slave recovery and systemd settings.

2015-07-05 Thread Benjamin Hindman


> On March 27, 2015, 9:17 a.m., Adam B wrote:
> > docs/slave-recovery.md, line 71
> > 
> >
> > (If the slave does not come back, each executorDriver shuts itself down 
> > after $MESOS_RECOVERY_TIMEOUT.)
> > 
> > Important question: If an executor is killed, does this systemd mode 
> > affect whether its tasks would get killed?
> 
> Alexander Rukletsov wrote:
> Adam, could you please explain what use case do you have in mind and how 
> it is related to slave recovery?
> 
> Adam B wrote:
> It's not related to slave recovery necessarily, but to how this KillMode 
> impacts other processes like a custom executor. Some frameworks (like HDFS) 
> have a custom executor that launches task(s) as a separate 
> process/subprocess. If the executor is killed (kill -9, or shutdown by the 
> framework/admin), will this change in KillMode affect whether the executors 
> task subprocesses also get killed?
> I'm mostly worried about this KillMode change suddenly leaving stranded 
> task processes if/when executors are killed.
> 
> Alexander Rukletsov wrote:
> I thought that's exactly why we have containerizers: clean-up all 
> stranded processes.
> 
> Adam B wrote:
> Fair enough, when the slave is running. But what if the executor is 
> killed while the slave (thus also the containerizer) is shutdown/recovering?
> I'm not claiming there's anything necessarily wrong with using this 
> KillMode. I just ask the question to make sure we don't recommend a setting 
> that may fix one issue but cause others.
> 
> Alexander Rukletsov wrote:
> I see your point. I would be surprised if this setting will cause the 
> issue, but let's check: better safe than sorry.

The KillMode is only relevant when stopping the "root" process of a systemd 
unit (e.g., via 'systemctl stop'). When another process within the same cgroup 
dies systemd doesn't do anything about it, the normal Linux/init reaping takes 
place. Thus, the suggestion documented in this review is correct. HOWEVER, it 
only applies when using 'posix' isolation since when using 'cgroups' isolation 
the processes are in another cgroup. I updated the documentation accordingly 
before committing.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32543/#review78025
---


On March 27, 2015, 2:09 p.m., Joerg Schad wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32543/
> ---
> 
> (Updated March 27, 2015, 2:09 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Brenden Matthews.
> 
> 
> Bugs: Mesos-2555
> https://issues.apache.org/jira/browse/Mesos-2555
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Documented the problem and solution encountered in MESOS-2419.
> 
> 
> Diffs
> -
> 
>   docs/slave-recovery.md 4bb4a71c6945bd70121743a1e9209a26906773c1 
>   docs/upgrades.md 2a15694607c079ad95ef6cf7f1490872ab9a5976 
> 
> Diff: https://reviews.apache.org/r/32543/diff/
> 
> 
> Testing
> ---
> 
> markdown check
> 
> 
> Thanks,
> 
> Joerg Schad
> 
>



Re: Review Request 32543: Documented problem and solution with slave recovery and systemd settings.

2015-06-17 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32543/#review88284
---

Ship it!



docs/slave-recovery.md (line 66)


Should this maybe be under a new sub-title with 'Known issues' or something 
similar to it?



docs/upgrades.md (line 15)


Looks like the double space is still there:
s/  / /


- Niklas Nielsen


On March 27, 2015, 7:09 a.m., Joerg Schad wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32543/
> ---
> 
> (Updated March 27, 2015, 7:09 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Brenden Matthews.
> 
> 
> Bugs: Mesos-2555
> https://issues.apache.org/jira/browse/Mesos-2555
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Documented the problem and solution encountered in MESOS-2419.
> 
> 
> Diffs
> -
> 
>   docs/slave-recovery.md 4bb4a71c6945bd70121743a1e9209a26906773c1 
>   docs/upgrades.md 2a15694607c079ad95ef6cf7f1490872ab9a5976 
> 
> Diff: https://reviews.apache.org/r/32543/diff/
> 
> 
> Testing
> ---
> 
> markdown check
> 
> 
> Thanks,
> 
> Joerg Schad
> 
>