Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-18 Thread James Peach


> On Nov. 11, 2015, 9:28 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 4244-4247
> > 
> >
> > why do it here instead of in recoverFramework() #4363? that feels more 
> > consistent with #1345.
> 
> James Peach wrote:
> I did this after recovery because the original code did not write 
> framework checkpoints if the slave was in RECOVERING state. I did not see a 
> reason for that, but decided to preserve the behavior as much as possible 
> just in case.
> 
> Vinod Kone wrote:
> originally, the slave didn't checkpoint framework during recovery stage 
> because it was not needed. if it is creating a framework object during 
> recovery, it is because it read the checkpointed data. so no need to 
> checkpoint again. 
> 
> but due to the compatibility issue you found, the slave can re-checkpoint 
> framework info during recovery because the framework info is *updated*. so i 
> would recommend moving this down to #1345 and do re-checkpoint if necessary.

Fixed and re-tested. We now upgrade the checkpoint during recovery but only if 
we need to.


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106142
---


On Nov. 12, 2015, 5:41 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 12, 2015, 5:41 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-18 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/
---

(Updated Nov. 18, 2015, 6:49 p.m.)


Review request for mesos, Kapil Arya and Vinod Kone.


Bugs: MESOS-3834
https://issues.apache.org/jira/browse/MESOS-3834


Repository: mesos


Description
---

When performing an upgrade cycle, it is possible for a 0.24 and
later agent to recover from a framework checkpoint written by 0.22
or earlier. In this case, we need to compatibly accept a missing
FrameworkID, and then rewrite the framework checkpoint so that
subsequent upgrades don't hit the same problem.


Diffs (updated)
-

  src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
  src/slave/slave.cpp d1126f00d947fdb4823b0c495335b743254ac7ee 

Diff: https://reviews.apache.org/r/40177/diff/


Testing
---

make check on CentOS 6.7.
Manual testing with a rolling upgrade from 0.22


Thanks,

James Peach



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-18 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review107070
---



src/slave/slave.cpp (line 4321)


pull this down to #4344

s/upgradeCheckpoint/recheckpoint/



src/slave/slave.cpp (line 4344)


How about:

// In this case, we update `FrameworkInfo.framework_id`  from the directory 
name and re-checkpoint it.



src/slave/slave.cpp (line 4365)


is it possibel for framework->info.checkpoint() to be false if we are here? 
if not, this should be a CHECK instead.



src/slave/slave.cpp (line 4366)


kill this log line.


- Vinod Kone


On Nov. 18, 2015, 6:49 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 18, 2015, 6:49 p.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp d1126f00d947fdb4823b0c495335b743254ac7ee 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-18 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/
---

(Updated Nov. 18, 2015, 9:56 p.m.)


Review request for mesos, Kapil Arya and Vinod Kone.


Bugs: MESOS-3834
https://issues.apache.org/jira/browse/MESOS-3834


Repository: mesos


Description
---

When performing an upgrade cycle, it is possible for a 0.24 and
later agent to recover from a framework checkpoint written by 0.22
or earlier. In this case, we need to compatibly accept a missing
FrameworkID, and then rewrite the framework checkpoint so that
subsequent upgrades don't hit the same problem.


Diffs (updated)
-

  src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
  src/slave/slave.cpp d1126f00d947fdb4823b0c495335b743254ac7ee 

Diff: https://reviews.apache.org/r/40177/diff/


Testing
---

make check on CentOS 6.7.
Manual testing with a rolling upgrade from 0.22


Thanks,

James Peach



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-18 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review107087
---


Patch looks great!

Reviews applied: [40177]

Passed command: export OS=ubuntu:14.04;export CONFIGURATION="--verbose";export 
COMPILER=gcc; ./support/docker_build.sh

- Mesos ReviewBot


On Nov. 18, 2015, 6:49 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 18, 2015, 6:49 p.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp d1126f00d947fdb4823b0c495335b743254ac7ee 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-17 Thread Vinod Kone


> On Nov. 11, 2015, 9:28 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 4244-4247
> > 
> >
> > why do it here instead of in recoverFramework() #4363? that feels more 
> > consistent with #1345.
> 
> James Peach wrote:
> I did this after recovery because the original code did not write 
> framework checkpoints if the slave was in RECOVERING state. I did not see a 
> reason for that, but decided to preserve the behavior as much as possible 
> just in case.

originally, the slave didn't checkpoint framework during recovery stage because 
it was not needed. if it is creating a framework object during recovery, it is 
because it read the checkpointed data. so no need to checkpoint again. 

but due to the compatibility issue you found, the slave can re-checkpoint 
framework info during recovery because the framework info is *updated*. so i 
would recommend moving this down to #1345 and do re-checkpoint if necessary.


- Vinod


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106142
---


On Nov. 12, 2015, 5:41 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 12, 2015, 5:41 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-11 Thread Guangya Liu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106061
---

Ship it!


Ship It!

- Guangya Liu


On 十一月 11, 2015, 5:59 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated 十一月 11, 2015, 5:59 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-11 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/
---

(Updated Nov. 12, 2015, 5:41 a.m.)


Review request for mesos, Kapil Arya and Vinod Kone.


Bugs: MESOS-3834
https://issues.apache.org/jira/browse/MESOS-3834


Repository: mesos


Description
---

When performing an upgrade cycle, it is possible for a 0.24 and
later agent to recover from a framework checkpoint written by 0.22
or earlier. In this case, we need to compatibly accept a missing
FrameworkID, and then rewrite the framework checkpoint so that
subsequent upgrades don't hit the same problem.


Diffs (updated)
-

  src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
  src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 

Diff: https://reviews.apache.org/r/40177/diff/


Testing
---

make check on CentOS 6.7.
Manual testing with a rolling upgrade from 0.22


Thanks,

James Peach



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-11 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106200
---


Patch looks great!

Reviews applied: [40177]

All tests passed.

- Mesos ReviewBot


On Nov. 12, 2015, 5:41 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 12, 2015, 5:41 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-11 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106142
---



src/slave/slave.cpp (lines 4244 - 4247)


why do it here instead of in recoverFramework() #4363? that feels more 
consistent with #1345.



src/slave/slave.cpp (lines 4928 - 4929)


put these on the same line on #4927 

```
completedExecutors(MAX_COMPLETED_EXECUTORS_PER_FRAMEWORK) {}
```



src/slave/slave.cpp (line 4934)


This should be the responsibility of the caller, i.e., the caller should 
call this only when checkpointing is enabled.


- Vinod Kone


On Nov. 11, 2015, 5:59 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 11, 2015, 5:59 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-10 Thread James Peach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/
---

Review request for mesos, Kapil Arya and Vinod Kone.


Bugs: MESOS-3834
https://issues.apache.org/jira/browse/MESOS-3834


Repository: mesos


Description
---

When performing an upgrade cycle, it is possible for a 0.24 and
later agent to recover from a framework checkpoint written by 0.22
or earlier. In this case, we need to compatibly accept a missing
FrameworkID, and then rewrite the framework checkpoint so that
subsequent upgrades don't hit the same problem.


Diffs
-

  src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
  src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 

Diff: https://reviews.apache.org/r/40177/diff/


Testing
---

make check on CentOS 6.7.
Manual testing with a rolling upgrade from 0.22


Thanks,

James Peach



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-10 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106018
---


Patch looks great!

Reviews applied: [40177]

All tests passed.

- Mesos ReviewBot


On Nov. 11, 2015, 5:59 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 11, 2015, 5:59 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.

2015-11-10 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106020
---


Patch looks great!

Reviews applied: [40177]

All tests passed.

- Mesos ReviewBot


On Nov. 11, 2015, 5:59 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> ---
> 
> (Updated Nov. 11, 2015, 5:59 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
> https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> ---
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>