Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-17 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/#review99394
---


Patch looks great!

Reviews applied: [38003]

All tests passed.

- Mesos ReviewBot


On Sept. 17, 2015, 3:07 p.m., Klaus Ma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38003/
> ---
> 
> (Updated Sept. 17, 2015, 3:07 p.m.)
> 
> 
> Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.
> 
> 
> Bugs: MESOS-3351
> https://issues.apache.org/jira/browse/MESOS-3351
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> __Phenomenon:__
> In some race condition, the slave was shutdown when after master failover.
> 
> __Root Cause:__
> The slave was shutdown because of duplicated SlavID: in master, the SlaveID 
> is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
> nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
> pid) maybe un-changed which lead to duplicated SlaveID. 
> 
> __Solution/Fix:__
> Generate masterInfo.id by UUID instead of "date + ip + port + pid".
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 1c4e7af 
>   src/tests/master_tests.cpp 06d74c3 
> 
> Diff: https://reviews.apache.org/r/38003/diff/
> 
> 
> Testing
> ---
> 
> make
> make check
> 
> 
> Thanks,
> 
> Klaus Ma
> 
>



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-17 Thread Klaus Ma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/
---

(Updated Sept. 17, 2015, 3:07 p.m.)


Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.


Changes
---

Address comments


Bugs: MESOS-3351
https://issues.apache.org/jira/browse/MESOS-3351


Repository: mesos


Description
---

__Phenomenon:__
In some race condition, the slave was shutdown when after master failover.

__Root Cause:__
The slave was shutdown because of duplicated SlavID: in master, the SlaveID is 
genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
pid) maybe un-changed which lead to duplicated SlaveID. 

__Solution/Fix:__
Generate masterInfo.id by UUID instead of "date + ip + port + pid".


Diffs (updated)
-

  src/master/master.cpp 1c4e7af 
  src/tests/master_tests.cpp 06d74c3 

Diff: https://reviews.apache.org/r/38003/diff/


Testing
---

make
make check


Thanks,

Klaus Ma



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-17 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/#review99442
---

Ship it!


feel free to resolve the open issues.


src/tests/master_tests.cpp (line 3696)


you don't need this. just do

StartSlave();



src/tests/master_tests.cpp (line 3731)


why the change in formatting? we allow 80 char comments now?


- Vinod Kone


On Sept. 17, 2015, 3:07 p.m., Klaus Ma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38003/
> ---
> 
> (Updated Sept. 17, 2015, 3:07 p.m.)
> 
> 
> Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.
> 
> 
> Bugs: MESOS-3351
> https://issues.apache.org/jira/browse/MESOS-3351
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> __Phenomenon:__
> In some race condition, the slave was shutdown when after master failover.
> 
> __Root Cause:__
> The slave was shutdown because of duplicated SlavID: in master, the SlaveID 
> is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
> nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
> pid) maybe un-changed which lead to duplicated SlaveID. 
> 
> __Solution/Fix:__
> Generate masterInfo.id by UUID instead of "date + ip + port + pid".
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 1c4e7af 
>   src/tests/master_tests.cpp 06d74c3 
> 
> Diff: https://reviews.apache.org/r/38003/diff/
> 
> 
> Testing
> ---
> 
> make
> make check
> 
> 
> Thanks,
> 
> Klaus Ma
> 
>



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-17 Thread Klaus Ma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/
---

(Updated Sept. 18, 2015, 12:13 a.m.)


Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.


Changes
---

Address comments


Bugs: MESOS-3351
https://issues.apache.org/jira/browse/MESOS-3351


Repository: mesos


Description
---

__Phenomenon:__
In some race condition, the slave was shutdown when after master failover.

__Root Cause:__
The slave was shutdown because of duplicated SlavID: in master, the SlaveID is 
genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
pid) maybe un-changed which lead to duplicated SlaveID. 

__Solution/Fix:__
Generate masterInfo.id by UUID instead of "date + ip + port + pid".


Diffs (updated)
-

  src/master/master.cpp 1c4e7af 
  src/tests/master_tests.cpp 06d74c3 

Diff: https://reviews.apache.org/r/38003/diff/


Testing
---

make
make check


Thanks,

Klaus Ma



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-17 Thread Klaus Ma


> On Sept. 17, 2015, 9:13 p.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, line 3733
> > 
> >
> > why the change in formatting? we allow 80 char comments now?

do you mean the following comments? Unfortunately, its length is 82 if in one 
line.

// If both the slaves get the same SlaveID, the re-registration would
// fail here.


- Klaus


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/#review99442
---


On Sept. 18, 2015, 12:13 a.m., Klaus Ma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38003/
> ---
> 
> (Updated Sept. 18, 2015, 12:13 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.
> 
> 
> Bugs: MESOS-3351
> https://issues.apache.org/jira/browse/MESOS-3351
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> __Phenomenon:__
> In some race condition, the slave was shutdown when after master failover.
> 
> __Root Cause:__
> The slave was shutdown because of duplicated SlavID: in master, the SlaveID 
> is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
> nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
> pid) maybe un-changed which lead to duplicated SlaveID. 
> 
> __Solution/Fix:__
> Generate masterInfo.id by UUID instead of "date + ip + port + pid".
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 1c4e7af 
>   src/tests/master_tests.cpp 06d74c3 
> 
> Diff: https://reviews.apache.org/r/38003/diff/
> 
> 
> Testing
> ---
> 
> make
> make check
> 
> 
> Thanks,
> 
> Klaus Ma
> 
>



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-16 Thread Klaus Ma


> On Sept. 15, 2015, 10:32 p.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, line 3637
> > 
> >
> > Does this test reliably fail (i.e., every time) without the code change 
> > in master.cpp?

Nop; the repro rate is about 90% (9 in 10 times). The root cause is master host 
re-used port; but if master did not re-use port, this issue will not trigger. 
For example, I can not reproduce this issue in Ubuntu 14.04 by default setting; 
but it's easy repro in OS X.


> On Sept. 15, 2015, 10:32 p.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, line 3636
> > 
> >
> > Also add a CHECK_NE() check with both the slave ids?

No sure whether it's necessary; if duplicated slave ids in master, master will 
ask the second slave (with the same id) to shutdown; in this case, it will 
failed when waiting for re-register message.


> On Sept. 15, 2015, 10:32 p.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, lines 3607-3608
> > 
> >
> > Why specify a mock executor and test containerizer? There's a 
> > StartSlave() overload that takes just the detector (and optionally flags), 
> > which you can use?

Yes, it's only for detector; let me try to use StartSlave with detector only.


- Klaus


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/#review99112
---


On Sept. 14, 2015, 6:08 p.m., Klaus Ma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38003/
> ---
> 
> (Updated Sept. 14, 2015, 6:08 p.m.)
> 
> 
> Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.
> 
> 
> Bugs: MESOS-3351
> https://issues.apache.org/jira/browse/MESOS-3351
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> __Phenomenon:__
> In some race condition, the slave was shutdown when after master failover.
> 
> __Root Cause:__
> The slave was shutdown because of duplicated SlavID: in master, the SlaveID 
> is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
> nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
> pid) maybe un-changed which lead to duplicated SlaveID. 
> 
> __Solution/Fix:__
> Generate masterInfo.id by UUID instead of "date + ip + port + pid".
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 5589eca 
>   src/tests/master_tests.cpp 8a6b98b 
> 
> Diff: https://reviews.apache.org/r/38003/diff/
> 
> 
> Testing
> ---
> 
> make
> make check
> 
> 
> Thanks,
> 
> Klaus Ma
> 
>



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-15 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/#review99112
---



src/tests/master_tests.cpp (line 3596)


new line.



src/tests/master_tests.cpp (lines 3597 - 3598)


// This test ensures that a slave gets a unique SlaveID even after
// master fails over. Please refer to MESOS-3351 for further details.



src/tests/master_tests.cpp (lines 3607 - 3608)


Why specify a mock executor and test containerizer? There's a StartSlave() 
overload that takes just the detector (and optionally flags), which you can use?



src/tests/master_tests.cpp (line 3622)


// Start a new slave and make sure it registers before the old slave.



src/tests/master_tests.cpp (line 3630)


// Now let the first slave re-register.



src/tests/master_tests.cpp (lines 3633 - 3634)


// If both the slaves get the same SlaveID, the re-registration would fail 
here.



src/tests/master_tests.cpp (line 3636)


Also add a CHECK_NE() check with both the slave ids?



src/tests/master_tests.cpp (line 3637)


Does this test reliably fail (i.e., every time) without the code change in 
master.cpp?


- Vinod Kone


On Sept. 14, 2015, 6:08 p.m., Klaus Ma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38003/
> ---
> 
> (Updated Sept. 14, 2015, 6:08 p.m.)
> 
> 
> Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone.
> 
> 
> Bugs: MESOS-3351
> https://issues.apache.org/jira/browse/MESOS-3351
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> __Phenomenon:__
> In some race condition, the slave was shutdown when after master failover.
> 
> __Root Cause:__
> The slave was shutdown because of duplicated SlavID: in master, the SlaveID 
> is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
> nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
> pid) maybe un-changed which lead to duplicated SlaveID. 
> 
> __Solution/Fix:__
> Generate masterInfo.id by UUID instead of "date + ip + port + pid".
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 5589eca 
>   src/tests/master_tests.cpp 8a6b98b 
> 
> Diff: https://reviews.apache.org/r/38003/diff/
> 
> 
> Testing
> ---
> 
> make
> make check
> 
> 
> Thanks,
> 
> Klaus Ma
> 
>



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-04 Thread Klaus Ma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/
---

(Updated Sept. 5, 2015, 2:46 a.m.)


Review request for mesos and Vinod Kone.


Changes
---

Address Vinod's comments


Summary (updated)
-

MESOS-3351 (duplicated slave id in master after master failover)


Bugs: MESOS-3351
https://issues.apache.org/jira/browse/MESOS-3351


Repository: mesos


Description (updated)
---

__Phenomenon:__
In some race condition, the slave was shutdown when after master failover.

__Root Cause:__
The slave was shutdown because of duplicated SlavID: in master, the SlaveID is 
genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
pid) maybe un-changed which lead to duplicated SlaveID. 

__Solution/Fix:__
Generate masterInfo.id by UUID instead of "date + ip + port + pid".


Diffs (updated)
-

  src/master/master.cpp 5589eca 
  src/tests/master_tests.cpp 8a6b98b 

Diff: https://reviews.apache.org/r/38003/diff/


Testing
---

make
make check


Thanks,

Klaus Ma



Re: Review Request 38003: MESOS-3351 (duplicated slave id in master after master failover)

2015-09-04 Thread Klaus Ma


> On Sept. 4, 2015, 7:35 p.m., Vinod Kone wrote:
> >
> 
> Vinod Kone wrote:
> Also, please make the summary and description more meaningful than just 
> the ticket ID.

Yes, both summary & description are updated for this fix


> On Sept. 4, 2015, 7:35 p.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 306-317
> > 
> >
> > Just do this.
> > 
> > ```
> > 
> > // Master ID is generated randomly based on UUID.
> > info_.set_id(UUID::random().toString());
> > 
> > ```

addressed


> On Sept. 4, 2015, 7:35 p.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, line 3607
> > 
> >
> > All our comments are expected to be proper sentences, i.e., start with 
> > a capital letter and end with period. Please fix here and everywhere.

addressed


> On Sept. 4, 2015, 7:35 p.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, lines 3638-3666
> > 
> >
> > Why do you need to launch a scheduler and task for this test?
> > 
> > I think you can simplify this test by not launching them.

Agree, scheduler & tasks are not necessary, both of them are removed.


- Klaus


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38003/#review97785
---


On Sept. 5, 2015, 2:46 a.m., Klaus Ma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38003/
> ---
> 
> (Updated Sept. 5, 2015, 2:46 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3351
> https://issues.apache.org/jira/browse/MESOS-3351
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> __Phenomenon:__
> In some race condition, the slave was shutdown when after master failover.
> 
> __Root Cause:__
> The slave was shutdown because of duplicated SlavID: in master, the SlaveID 
> is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, 
> nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + 
> pid) maybe un-changed which lead to duplicated SlaveID. 
> 
> __Solution/Fix:__
> Generate masterInfo.id by UUID instead of "date + ip + port + pid".
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 5589eca 
>   src/tests/master_tests.cpp 8a6b98b 
> 
> Diff: https://reviews.apache.org/r/38003/diff/
> 
> 
> Testing
> ---
> 
> make
> make check
> 
> 
> Thanks,
> 
> Klaus Ma
> 
>