Re: Review Request 69082: Correctly propagated `close` failures in some instances.

2019-01-24 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69082/#review212309
---


Fix it, then Ship it!




Thanks for putting this up and also thank BenM for a thorough review!


3rdparty/stout/include/stout/os/posix/mktemp.hpp
Line 39 (original), 39 (patched)


I agree with BenM. `string` seems not very readable and shadows 
`std::string`, which I don't like.

How about:
```
s/temp/_path/
s/string/temp/
```


- Chun-Hung Hsiao


On Jan. 24, 2019, 10:30 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69082/
> ---
> 
> (Updated Jan. 24, 2019, 10:30 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Chun-Hung Hsiao, and Alex Clemmer.
> 
> 
> Bugs: MESOS-9331
> https://issues.apache.org/jira/browse/MESOS-9331
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When e.g., writing to disk, errors from write might only manifest at
> ::close time. We update some instances to correctly propagate these
> errors instead of dropping them silently. We only propagate the
> ::close error if the write operation succeeded, otherwise we just
> propagate the error from the write operation.
> 
> 
> Diffs
> -
> 
>   3rdparty/stout/include/stout/os/posix/mktemp.hpp 
> 63b3d1a7720d07f877fa1d4eb7f32a548916637a 
>   3rdparty/stout/include/stout/os/write.hpp 
> f7538f94f5a953a7a90a05bc1d2f138b6c17f814 
>   3rdparty/stout/include/stout/protobuf.hpp 
> eb4adef56f1701e3c101284e05e4e6c66eef9180 
> 
> 
> Diff: https://reviews.apache.org/r/69082/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 69781: Fixed flakiness of resource provider ContainerTerminationMetric test.

2019-01-24 Thread Chun-Hung Hsiao


> On Jan. 24, 2019, 3:41 p.m., Chun-Hung Hsiao wrote:
> > src/tests/storage_local_resource_provider_tests.cpp
> > Lines 3388 (patched)
> > 
> >
> > IIUC the async loop you mentioned is the one trying to detect the 
> > creation of the domain socket, which is asynchronously called in 
> > `launchContainer`, so waiting for just this function to be dispatched might 
> > not be sufficient before pausing the clock again. How about waiting for 
> > `waitContainer` (to ensure that the post-start hook and thus the async loop 
> > has been finished) after `launchContainer` and before the pause?
> > 
> > Alternatively, since this test doesn't actually care what happens after 
> > `launchContainer`, is it needed to resume the clock at all?
> 
> Chun-Hung Hsiao wrote:
> I believe it's because of some timers we used in, e.g., reaper or cgroup 
> to destroy the container. How about:
> ```
> AWAIT_ASSERT_EQ(0, pluginKilled);
> 
> // Resume the clock so the plugin container can be properly destroyed.
> Clock::resume();
> 
> AWAIT_READY(pluginRestarted);
> 
> Clock::pause();
> ```
> 
> Feel free to drop this. Gave you a ship-it.

Actually, let's just don't pause the clock again.


- Chun-Hung


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69781/#review212275
---


On Jan. 18, 2019, 11:45 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69781/
> ---
> 
> (Updated Jan. 18, 2019, 11:45 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9130
> https://issues.apache.org/jira/browse/MESOS-9130
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> While we addressed one source of flakiness of this test in
> `2117f671c450d7c74edc53078e8c0ed6035020aa` we introduced a new source of
> flakiness (agents send more than the expected number of
> `UpdateSlaveMessage`s since they failed a timeout in registration with
> the master).
> 
> This patch ensures that the agent registers successfully before
> continuing with the test.
> 
> 
> Diffs
> -
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> 164e93a3749d4d668e12de31641aecece384 
> 
> 
> Diff: https://reviews.apache.org/r/69781/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 69781: Fixed flakiness of resource provider ContainerTerminationMetric test.

2019-01-24 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69781/#review212307
---


Ship it!




Ship It!

- Chun-Hung Hsiao


On Jan. 18, 2019, 11:45 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69781/
> ---
> 
> (Updated Jan. 18, 2019, 11:45 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9130
> https://issues.apache.org/jira/browse/MESOS-9130
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> While we addressed one source of flakiness of this test in
> `2117f671c450d7c74edc53078e8c0ed6035020aa` we introduced a new source of
> flakiness (agents send more than the expected number of
> `UpdateSlaveMessage`s since they failed a timeout in registration with
> the master).
> 
> This patch ensures that the agent registers successfully before
> continuing with the test.
> 
> 
> Diffs
> -
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> 164e93a3749d4d668e12de31641aecece384 
> 
> 
> Diff: https://reviews.apache.org/r/69781/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 69781: Fixed flakiness of resource provider ContainerTerminationMetric test.

2019-01-24 Thread Chun-Hung Hsiao


> On Jan. 24, 2019, 3:41 p.m., Chun-Hung Hsiao wrote:
> > src/tests/storage_local_resource_provider_tests.cpp
> > Lines 3388 (patched)
> > 
> >
> > IIUC the async loop you mentioned is the one trying to detect the 
> > creation of the domain socket, which is asynchronously called in 
> > `launchContainer`, so waiting for just this function to be dispatched might 
> > not be sufficient before pausing the clock again. How about waiting for 
> > `waitContainer` (to ensure that the post-start hook and thus the async loop 
> > has been finished) after `launchContainer` and before the pause?
> > 
> > Alternatively, since this test doesn't actually care what happens after 
> > `launchContainer`, is it needed to resume the clock at all?

I believe it's because of some timers we used in, e.g., reaper or cgroup to 
destroy the container. How about:
```
AWAIT_ASSERT_EQ(0, pluginKilled);

// Resume the clock so the plugin container can be properly destroyed.
Clock::resume();

AWAIT_READY(pluginRestarted);

Clock::pause();
```

Feel free to drop this. Gave you a ship-it.


- Chun-Hung


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69781/#review212275
---


On Jan. 18, 2019, 11:45 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69781/
> ---
> 
> (Updated Jan. 18, 2019, 11:45 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9130
> https://issues.apache.org/jira/browse/MESOS-9130
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> While we addressed one source of flakiness of this test in
> `2117f671c450d7c74edc53078e8c0ed6035020aa` we introduced a new source of
> flakiness (agents send more than the expected number of
> `UpdateSlaveMessage`s since they failed a timeout in registration with
> the master).
> 
> This patch ensures that the agent registers successfully before
> continuing with the test.
> 
> 
> Diffs
> -
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> 164e93a3749d4d668e12de31641aecece384 
> 
> 
> Diff: https://reviews.apache.org/r/69781/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 69815: Added a unit test for RPC retry in SLRP.

2019-01-24 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69815/#review212305
---



FAIL: Failed to apply the dependent review: 69811.

Failed command: `python.exe .\support\apply-reviews.py -n -r 69811`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2817/mesos-review-69815

Relevant logs:

- 
[apply-review-69811.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2817/mesos-review-69815/logs/apply-review-69811.log):

```
error: patch failed: src/resource_provider/storage/provider.cpp:2017
error: src/resource_provider/storage/provider.cpp: patch does not apply
```

- Mesos Reviewbot Windows


On Jan. 24, 2019, 7:29 p.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69815/
> ---
> 
> (Updated Jan. 24, 2019, 7:29 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9517
> https://issues.apache.org/jira/browse/MESOS-9517
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch adds a unit test to verify that SLRP will retry
> `CreateVolume` and `DeleteVolume` CSI calls with a random exponential
> backoff upon receiving `DEADLINE_EXCEEDED` or `UNAVAILABLE`.
> 
> 
> Diffs
> -
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> e8ed20f818ed7f1a3ce15758ea3c366520443377 
> 
> 
> Diff: https://reviews.apache.org/r/69815/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> TODO: Will make the test more robust and add testing for `DESTROY_DISK`.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69815: Added a unit test for RPC retry in SLRP.

2019-01-24 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69815/
---

(Updated Jan. 25, 2019, 3:29 a.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
---

Added assertions for RPC metrics.


Bugs: MESOS-9517
https://issues.apache.org/jira/browse/MESOS-9517


Repository: mesos


Description
---

This patch adds a unit test to verify that SLRP will retry
`CreateVolume` and `DeleteVolume` CSI calls with a random exponential
backoff upon receiving `DEADLINE_EXCEEDED` or `UNAVAILABLE`.


Diffs (updated)
-

  src/tests/storage_local_resource_provider_tests.cpp 
e8ed20f818ed7f1a3ce15758ea3c366520443377 


Diff: https://reviews.apache.org/r/69815/diff/4/

Changes: https://reviews.apache.org/r/69815/diff/3-4/


Testing
---

make check

TODO: Will make the test more robust and add testing for `DESTROY_DISK`.


Thanks,

Chun-Hung Hsiao



Re: Review Request 69815: Added a unit test for RPC retry in SLRP.

2019-01-24 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69815/#review212300
---



FAIL: Failed to apply the dependent review: 69811.

Failed command: `python.exe .\support\apply-reviews.py -n -r 69811`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2816/mesos-review-69815

Relevant logs:

- 
[apply-review-69811.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2816/mesos-review-69815/logs/apply-review-69811.log):

```
error: patch failed: src/resource_provider/storage/provider.cpp:2017
error: src/resource_provider/storage/provider.cpp: patch does not apply
```

- Mesos Reviewbot Windows


On Jan. 25, 2019, 12:36 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69815/
> ---
> 
> (Updated Jan. 25, 2019, 12:36 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9517
> https://issues.apache.org/jira/browse/MESOS-9517
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch adds a unit test to verify that SLRP will retry
> `CreateVolume` and `DeleteVolume` CSI calls with a random exponential
> backoff upon receiving `DEADLINE_EXCEEDED` or `UNAVAILABLE`.
> 
> 
> Diffs
> -
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> e8ed20f818ed7f1a3ce15758ea3c366520443377 
> 
> 
> Diff: https://reviews.apache.org/r/69815/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> TODO: Will make the test more robust and add testing for `DESTROY_DISK`.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69815: Added a unit test for RPC retry in SLRP.

2019-01-24 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69815/
---

(Updated Jan. 25, 2019, 12:36 a.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
---

Rebased.


Bugs: MESOS-9517
https://issues.apache.org/jira/browse/MESOS-9517


Repository: mesos


Description
---

This patch adds a unit test to verify that SLRP will retry
`CreateVolume` and `DeleteVolume` CSI calls with a random exponential
backoff upon receiving `DEADLINE_EXCEEDED` or `UNAVAILABLE`.


Diffs (updated)
-

  src/tests/storage_local_resource_provider_tests.cpp 
e8ed20f818ed7f1a3ce15758ea3c366520443377 


Diff: https://reviews.apache.org/r/69815/diff/3/

Changes: https://reviews.apache.org/r/69815/diff/2-3/


Testing
---

make check

TODO: Will make the test more robust and add testing for `DESTROY_DISK`.


Thanks,

Chun-Hung Hsiao



Re: Review Request 69787: Added a forwarding mode to the test CSI plugin.

2019-01-24 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69787/
---

(Updated Jan. 25, 2019, 12:35 a.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
---

Addressed Benjamin's comments and rename `proxy` to `forward` for a more 
intuitive naming.


Summary (updated)
-

Added a forwarding mode to the test CSI plugin.


Bugs: MESOS-9517
https://issues.apache.org/jira/browse/MESOS-9517


Repository: mesos


Description (updated)
---

If the `--forward` flag is set, the test CSI plugin would forward all gRPC
requests to the specified gRPC server URI, and return the response to
the caller. This can be used to forward all CSI calls to a mock CSI
plugin object in the test process.


Diffs (updated)
-

  src/examples/test_csi_plugin.cpp af183037b280bab65578a4c447196a9ccf261e32 


Diff: https://reviews.apache.org/r/69787/diff/3/

Changes: https://reviews.apache.org/r/69787/diff/2-3/


Testing
---

Manually tweaked the plugin to forward requests to itself and all tests pass.


Thanks,

Chun-Hung Hsiao



Re: Review Request 69825: Added test to ensure backwards compatibility of resources checkpointing.

2019-01-24 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69825/#review212298
---



PASS: Mesos patch 69825 was successfully built and tested.

Reviews applied: `['69790', '69791', '69792', '69793', '69794', '69795', 
'69825']`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2815/mesos-review-69825

- Mesos Reviewbot Windows


On Jan. 24, 2019, 10:39 p.m., Gastón Kleiman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69825/
> ---
> 
> (Updated Jan. 24, 2019, 10:39 p.m.)
> 
> 
> Review request for mesos, Benno Evers and Greg Mann.
> 
> 
> Bugs: MESOS-9356
> https://issues.apache.org/jira/browse/MESOS-9356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added test to ensure backwards compatibility of resources checkpointing.
> 
> 
> Diffs
> -
> 
>   src/tests/reservation_tests.cpp 3f33885099c3377c6477475ae43dcaa88d920547 
> 
> 
> Diff: https://reviews.apache.org/r/69825/diff/2/
> 
> 
> Testing
> ---
> 
> The test passed 10,000 consecutive runs.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>



Re: Review Request 69787: Added a proxy mode to the test CSI plugin.

2019-01-24 Thread Chun-Hung Hsiao


> On Jan. 18, 2019, 11:03 a.m., Benjamin Bannier wrote:
> > src/examples/test_csi_plugin.cpp
> > Lines 930-931 (patched)
> > 
> >
> > Just accept values here? Move construction of `unique_ptr` is cheap and 
> > possibly elided.

Created the service and the completion queue in `CSIProxy` for proper shutdown 
instead so this is no longer an issue.


> On Jan. 18, 2019, 11:03 a.m., Benjamin Bannier wrote:
> > src/examples/test_csi_plugin.cpp
> > Lines 978 (patched)
> > 
> >
> > Let's use a smart pointer to make clear that we pass ownership below, 
> > e.g.,
> > 
> > std::unique_ptr first(new Call);
> > 
> > // Pass ownership below.
> > ... first.release() ...
> > 
> > (Btw, we pull `unique_ptr` into the current scope with a using decl, 
> > but still spell it out in full in most places).

As discussed, this would lead to an undefined behavior in C++ because of the 
undetermined evaluation order for function arguments. Dropping this. Instead, 
I'll add more comments to explain the ownership transition.


- Chun-Hung


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69787/#review212136
---


On Jan. 23, 2019, 7:14 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69787/
> ---
> 
> (Updated Jan. 23, 2019, 7:14 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9517
> https://issues.apache.org/jira/browse/MESOS-9517
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If the `--proxy` flag is set, the test CSI plugin would forward all gRPC
> requests to the specified gRPC server URI, and return the response to
> the caller. This can be used to forward all CSI calls to a mock CSI
> plugin object in the test process.
> 
> 
> Diffs
> -
> 
>   src/examples/test_csi_plugin.cpp af183037b280bab65578a4c447196a9ccf261e32 
> 
> 
> Diff: https://reviews.apache.org/r/69787/diff/2/
> 
> 
> Testing
> ---
> 
> Manually tweaked the plugin to forward requests to itself and all tests pass.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69793: Added the `ResourceState` agent protobuf message.

2019-01-24 Thread Gastón Kleiman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69793/
---

(Updated Jan. 24, 2019, 2:36 p.m.)


Review request for mesos, Benno Evers and Greg Mann.


Changes
---

Addressed feedback.


Bugs: MESOS-9356
https://issues.apache.org/jira/browse/MESOS-9356


Repository: mesos


Description
---

This protobuf message will be used by the agent to atomically checkpoint
resources and operations.


Diffs (updated)
-

  src/CMakeLists.txt a574d449dc26b820cbef7ff0b5e94b42b6fe86cf 
  src/Makefile.am cd785255fcdf1302a8f9fa358039e5d1f200e132 
  src/slave/state.proto PRE-CREATION 


Diff: https://reviews.apache.org/r/69793/diff/3/

Changes: https://reviews.apache.org/r/69793/diff/2-3/


Testing
---

Current tests still pass.

Compiled new proto with both cmake and autotools.


Thanks,

Gastón Kleiman



Re: Review Request 69795: Made agent recover atomically checkpointed resources and operations.

2019-01-24 Thread Gastón Kleiman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69795/
---

(Updated Jan. 24, 2019, 2:37 p.m.)


Review request for mesos, Benno Evers and Greg Mann.


Changes
---

Addressed feedback.


Bugs: MESOS-9356
https://issues.apache.org/jira/browse/MESOS-9356


Repository: mesos


Description
---

Made agent recover atomically checkpointed resources and operations.


Diffs (updated)
-

  src/slave/slave.hpp 2bcd7a93a8f25b77c71c7f931bfaac87649f987c 
  src/slave/slave.cpp ed92f672f5155d70a36ba3619bb6f06fa09bc836 
  src/slave/state.hpp e2180ae37a65a57d1edb29d2ad6cc8232029906e 
  src/slave/state.cpp ae16d6f656cff71bcbdd4ffb02888024530b9274 


Diff: https://reviews.apache.org/r/69795/diff/4/

Changes: https://reviews.apache.org/r/69795/diff/3-4/


Testing
---

Current tests still pass.


Thanks,

Gastón Kleiman



Re: Review Request 69795: Made agent recover atomically checkpointed resources and operations.

2019-01-24 Thread Gastón Kleiman


> On Jan. 23, 2019, 4:58 p.m., Greg Mann wrote:
> > src/slave/slave.cpp
> > Lines 7356-7380 (patched)
> > 
> >
> > If we change the checkpointing code to checkpoint the operation 
> > _before_ it is applied, then we need to apply speculative operations here 
> > as well, since `updateOperation()` only applies non-speculative operations 
> > when they transition to terminal.

Revision two makes the agent apply pending operations and adds some additional 
checks.


> On Jan. 23, 2019, 4:58 p.m., Greg Mann wrote:
> > src/slave/slave.cpp
> > Lines 7381-7384 (patched)
> > 
> >
> > Should we add a `CHECK(operation->latest_status().state() == 
> > OPERATION_FINISHED)` here?

Dropping this issue per offline discussion.

Even though right now operations can either be pending or finished, future 
versions of the agent might use other terminal states, so the check might be 
too strict.


- Gastón


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69795/#review212257
---


On Jan. 24, 2019, 2:37 p.m., Gastón Kleiman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69795/
> ---
> 
> (Updated Jan. 24, 2019, 2:37 p.m.)
> 
> 
> Review request for mesos, Benno Evers and Greg Mann.
> 
> 
> Bugs: MESOS-9356
> https://issues.apache.org/jira/browse/MESOS-9356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Made agent recover atomically checkpointed resources and operations.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 2bcd7a93a8f25b77c71c7f931bfaac87649f987c 
>   src/slave/slave.cpp ed92f672f5155d70a36ba3619bb6f06fa09bc836 
>   src/slave/state.hpp e2180ae37a65a57d1edb29d2ad6cc8232029906e 
>   src/slave/state.cpp ae16d6f656cff71bcbdd4ffb02888024530b9274 
> 
> 
> Diff: https://reviews.apache.org/r/69795/diff/4/
> 
> 
> Testing
> ---
> 
> Current tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>



Re: Review Request 69825: Added test to ensure backwards compatibility of resources checkpointing.

2019-01-24 Thread Gastón Kleiman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69825/
---

(Updated Jan. 24, 2019, 2:39 p.m.)


Review request for mesos, Benno Evers and Greg Mann.


Changes
---

Addressed feedback.


Bugs: MESOS-9356
https://issues.apache.org/jira/browse/MESOS-9356


Repository: mesos


Description
---

Added test to ensure backwards compatibility of resources checkpointing.


Diffs (updated)
-

  src/tests/reservation_tests.cpp 3f33885099c3377c6477475ae43dcaa88d920547 


Diff: https://reviews.apache.org/r/69825/diff/2/

Changes: https://reviews.apache.org/r/69825/diff/1-2/


Testing
---

The test passed 10,000 consecutive runs.


Thanks,

Gastón Kleiman



Re: Review Request 69793: Added the `ResourceState` agent protobuf message.

2019-01-24 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69793/#review212286
---




src/slave/state.proto
Lines 25 (patched)


This comment seems insufficient, since this could also include some pending 
operations?


- Greg Mann


On Jan. 24, 2019, 12:16 a.m., Gastón Kleiman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69793/
> ---
> 
> (Updated Jan. 24, 2019, 12:16 a.m.)
> 
> 
> Review request for mesos, Benno Evers and Greg Mann.
> 
> 
> Bugs: MESOS-9356
> https://issues.apache.org/jira/browse/MESOS-9356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This protobuf message will be used by the agent to atomically checkpoint
> resources and operations.
> 
> 
> Diffs
> -
> 
>   src/CMakeLists.txt a574d449dc26b820cbef7ff0b5e94b42b6fe86cf 
>   src/Makefile.am cd785255fcdf1302a8f9fa358039e5d1f200e132 
>   src/slave/state.proto PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/69793/diff/2/
> 
> 
> Testing
> ---
> 
> Current tests still pass.
> 
> Compiled new proto with both cmake and autotools.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>



Re: Review Request 69792: Added an ostream operator for `Operation`.

2019-01-24 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69792/#review212283
---


Ship it!




Ship It!

- Greg Mann


On Jan. 23, 2019, 11:45 p.m., Gastón Kleiman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69792/
> ---
> 
> (Updated Jan. 23, 2019, 11:45 p.m.)
> 
> 
> Review request for mesos, Benno Evers and Greg Mann.
> 
> 
> Bugs: MESOS-9356
> https://issues.apache.org/jira/browse/MESOS-9356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added an ostream operator for Operations.
> 
> 
> Diffs
> -
> 
>   include/mesos/type_utils.hpp 3926f491f41aee255ca13ed6a2731fbc88efb9a5 
>   src/common/type_utils.cpp df888efbdb99e07584e446079c337e7b9eb7cede 
> 
> 
> Diff: https://reviews.apache.org/r/69792/diff/2/
> 
> 
> Testing
> ---
> 
> Current tests still pass.
> 
> Manually looked at the logs generated by the changes in this change, and they 
> look pretty good.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>



Re: Review Request 69791: Made it possible to checkpoint resources without downgrading them.

2019-01-24 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69791/#review212282
---


Ship it!




Ship It!

- Greg Mann


On Jan. 23, 2019, 6:24 p.m., Gastón Kleiman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69791/
> ---
> 
> (Updated Jan. 23, 2019, 6:24 p.m.)
> 
> 
> Review request for mesos, Benno Evers and Greg Mann.
> 
> 
> Bugs: MESOS-9356
> https://issues.apache.org/jira/browse/MESOS-9356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> For backwards compatibility the agent currently downgrades resources
> before checkpointing them.
> 
> In order to support providing operation feedback for operation affecting
> agent default resources, we're going to start atomically checkpointing
> resources and operations in a new file.
> 
> Since we're going to checkpoint the resources to a new file that is not
> used by old agents, there is no need to downgrade resources before
> writing them to that file.
> 
> Note: this patch doesn't change the format in which resources are
> currently checkpointed.
> 
> 
> Diffs
> -
> 
>   src/slave/state.hpp e2180ae37a65a57d1edb29d2ad6cc8232029906e 
> 
> 
> Diff: https://reviews.apache.org/r/69791/diff/2/
> 
> 
> Testing
> ---
> 
> Current tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>



Re: Review Request 69781: Fixed flakiness of resource provider ContainerTerminationMetric test.

2019-01-24 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69781/#review212275
---




src/tests/storage_local_resource_provider_tests.cpp
Lines 3388 (patched)


IIUC the async loop you mentioned is the one trying to detect the creation 
of the domain socket, which is asynchronously called in `launchContainer`, so 
waiting for just this function to be dispatched might not be sufficient before 
pausing the clock again. How about waiting for `waitContainer` (to ensure that 
the post-start hook and thus the async loop has been finished) after 
`launchContainer` and before the pause?

Alternatively, since this test doesn't actually care what happens after 
`launchContainer`, is it needed to resume the clock at all?


- Chun-Hung Hsiao


On Jan. 18, 2019, 11:45 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69781/
> ---
> 
> (Updated Jan. 18, 2019, 11:45 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9130
> https://issues.apache.org/jira/browse/MESOS-9130
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> While we addressed one source of flakiness of this test in
> `2117f671c450d7c74edc53078e8c0ed6035020aa` we introduced a new source of
> flakiness (agents send more than the expected number of
> `UpdateSlaveMessage`s since they failed a timeout in registration with
> the master).
> 
> This patch ensures that the agent registers successfully before
> continuing with the test.
> 
> 
> Diffs
> -
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> 164e93a3749d4d668e12de31641aecece384 
> 
> 
> Diff: https://reviews.apache.org/r/69781/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 69817: Refactored 'support/verify-reviews.py' to be closer to commit 7412179.

2019-01-24 Thread Armand Grillet


> On Jan. 23, 2019, 8:55 p.m., Vinod Kone wrote:
> > Looks ok to me. I'll let Till ship this.
> > 
> > Has this been tested?

I have tested the changes locally but this was also the case for previous 
review requests on this support script thus I am still not confident regarding 
the reliability of this script.


- Armand


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69817/#review212250
---


On Jan. 23, 2019, 12:47 p.m., Armand Grillet wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69817/
> ---
> 
> (Updated Jan. 23, 2019, 12:47 p.m.)
> 
> 
> Review request for mesos, Till Toenshoff and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The goal of this commit is to make 'verify-reviews.py' as close as
> possible to its last known working state (commit
> 74121798f24fca372180b8c4bc00b4df07d46240).
> 
> The diff between 'verify-reviews.py' at this commit and 7412179 is
> available here: https://www.diffchecker.com/cmfaM83O All the new
> features added since 7412179 have been added again and the code
> has been made Python 3 compatible again.
> 
> The goal is to have a diff as minimal as possible with improvements
> regarding logs so that we do not face CI issues anymore. Changes
> have been made regarding how we run commands from the script so
> that they are run from the repository instead of where the
> script is being called.
> 
> 
> Diffs
> -
> 
>   support/verify-reviews.py 71326d34bb649e27a3a2901867d31a2a1fffd4e9 
> 
> 
> Diff: https://reviews.apache.org/r/69817/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Armand Grillet
> 
>



Re: Review Request 69812: Implemented the RPC retry logic for SLRP.

2019-01-24 Thread James DeFelice


> On Jan. 23, 2019, 3:33 p.m., James DeFelice wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1916 (patched)
> > 
> >
> > what about a metric for call retries?
> 
> Chun-Hung Hsiao wrote:
> `resource_providers/./csi_plugin/rpcs//errors` should be 
> a good approximation. I'll create a follow-up patch for finer-grained error 
> metrics, but probably won't backport it. Is it good enough?

Yes, good enough. Thanks.


- James


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69812/#review212242
---


On Jan. 23, 2019, 7:10 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69812/
> ---
> 
> (Updated Jan. 23, 2019, 7:10 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, Jie Yu, and Jan 
> Schlicht.
> 
> 
> Bugs: MESOS-9517
> https://issues.apache.org/jira/browse/MESOS-9517
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the CSI plugin returns a retryable error (i.e., `DEADLINE_EXCEEDED`
> or `UNAVAILABLE`) for `CreateVolume` or `DeleteVolume` CSI calls, SLRP
> will now retry indefinitely with a random exponential backoff.
> 
> 
> Diffs
> -
> 
>   src/csi/client.hpp 5d40d54c2abbd03993ce8835d37db23e209c7554 
>   src/csi/client.cpp 61ed410985099828a2f58b1527ab57daa4b379df 
>   src/resource_provider/storage/provider.hpp 
> 331f7b785b14b814c2889488effd53f3a48a1b95 
>   src/resource_provider/storage/provider.cpp 
> d6e20a549ede189c757ae3ae922ab7cb86d2be2c 
> 
> 
> Diff: https://reviews.apache.org/r/69812/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> A unit test will be added later in the chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69082: Correctly propagated `close` failures in some instances.

2019-01-24 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69082/#review212269
---



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['69082']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2814/mesos-review-69082

Relevant logs:

- 
[mesos-tests-cmake.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2814/mesos-review-69082/logs/mesos-tests-cmake.log):

```
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\zookeeper.c(3500):
 warning C4996: 'inet_ntoa': Use inet_ntop() or InetNtop() instead or define 
_WINSOCK_DEPRECATED_NO_WARNINGS to disable deprecated API warnings 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\zookeeper.c(3501):
 warning C4996: 'sprintf': This function or variable may be unsafe. Consider 
using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. 
See online help for details. 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\zookeeper.c(3479):
 warning C4101: 'addrstr': unreferenced local variable 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\recordio.c(170):
 warning C4267: '=': conversion from 'size_t' to 'int32_t', possible loss of 
data 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\mt_adaptor.c(496):
 warning C4244: '=': conversion from 'time_t' to 'int32_t', possible loss of 
data 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\winport.c(256):
 warning C4090: 'function': different 'const' qualifiers 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\winport.c(166):
 warning C4716: 'pthread_cond_broadcast': must return a value 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\winport.c(205):
 warning C4716: 'pthread_cond_wait': must return a value 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj]
 [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(124):
 warning C4996: 'fopen': This function or variable may be unsafe. Consider 
using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See 
online help for details. 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(128):
 warning C4267: 'initializing': conversion from 'size_t' to 'int', possible 
loss of data 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(279):
 warning C4267: 'function': conversion from 'size_t' to 'int', possible loss of 
data 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(301):
 warning C4267: 'initializing': conversion from 'size_t' to 'int', possible 
loss of data 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 
d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(368):
 warning C4267: 'function': conversion from 'size_t' to 'int', possible loss of 
data 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] 
[D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
 

Re: Review Request 69082: Correctly propagated `close` failures in some instances.

2019-01-24 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69082/
---

(Updated Jan. 24, 2019, 11:30 a.m.)


Review request for mesos, Benjamin Mahler, Chun-Hung Hsiao, and Alex Clemmer.


Changes
---

Addressed some more comments from bmahler.


Bugs: MESOS-9331
https://issues.apache.org/jira/browse/MESOS-9331


Repository: mesos


Description
---

When e.g., writing to disk, errors from write might only manifest at
::close time. We update some instances to correctly propagate these
errors instead of dropping them silently. We only propagate the
::close error if the write operation succeeded, otherwise we just
propagate the error from the write operation.


Diffs (updated)
-

  3rdparty/stout/include/stout/os/posix/mktemp.hpp 
63b3d1a7720d07f877fa1d4eb7f32a548916637a 
  3rdparty/stout/include/stout/os/write.hpp 
f7538f94f5a953a7a90a05bc1d2f138b6c17f814 
  3rdparty/stout/include/stout/protobuf.hpp 
eb4adef56f1701e3c101284e05e4e6c66eef9180 


Diff: https://reviews.apache.org/r/69082/diff/4/

Changes: https://reviews.apache.org/r/69082/diff/3-4/


Testing
---

`make check`


Thanks,

Benjamin Bannier



Re: Review Request 69082: Correctly propagated `close` failures in some instances.

2019-01-24 Thread Benjamin Bannier


> On Jan. 18, 2019, 6:59 p.m., Benjamin Mahler wrote:
> > 3rdparty/stout/include/stout/os/posix/mktemp.hpp
> > Line 43 (original), 43 (patched)
> > 
> >
> > s/path/template/ and s/string/path/ would be clearer?

I believe this would warrant another path as such a change would have no 
relation to what is done here.

Unfortunately `template` is a (not context sensitive) keyword, so I am not sure 
what exactly would help making this much clearer. I can post a patch if you 
could suggest something.

Dropping for now.


> On Jan. 18, 2019, 6:59 p.m., Benjamin Mahler wrote:
> > 3rdparty/stout/include/stout/os/write.hpp
> > Line 143 (original), 140-146 (patched)
> > 
> >
> > ditto here

See above.


> On Jan. 18, 2019, 6:59 p.m., Benjamin Mahler wrote:
> > 3rdparty/stout/include/stout/protobuf.hpp
> > Lines 164 (patched)
> > 
> >
> > Does `stringify(*fd)` make it fit on one line?

Yes, it would, but AFAIK we don't have that operator defined for `Try`. I 
created https://issues.apache.org/jira/browse/MESOS-7124 for this some time ago 
and also gave some argument why I think this would have better semantics than 
current use of `get`.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69082/#review212146
---


On Jan. 24, 2019, 11:30 a.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69082/
> ---
> 
> (Updated Jan. 24, 2019, 11:30 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Chun-Hung Hsiao, and Alex Clemmer.
> 
> 
> Bugs: MESOS-9331
> https://issues.apache.org/jira/browse/MESOS-9331
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When e.g., writing to disk, errors from write might only manifest at
> ::close time. We update some instances to correctly propagate these
> errors instead of dropping them silently. We only propagate the
> ::close error if the write operation succeeded, otherwise we just
> propagate the error from the write operation.
> 
> 
> Diffs
> -
> 
>   3rdparty/stout/include/stout/os/posix/mktemp.hpp 
> 63b3d1a7720d07f877fa1d4eb7f32a548916637a 
>   3rdparty/stout/include/stout/os/write.hpp 
> f7538f94f5a953a7a90a05bc1d2f138b6c17f814 
>   3rdparty/stout/include/stout/protobuf.hpp 
> eb4adef56f1701e3c101284e05e4e6c66eef9180 
> 
> 
> Diff: https://reviews.apache.org/r/69082/diff/4/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>