Re: Review Request 71414: Gracefully handled duplicated volumes from non-conforming CSI plugins.
> On Aug. 30, 2019, 11:02 a.m., Benno Evers wrote: > > src/resource_provider/storage/provider.cpp > > Line 1114 (original), 1135 (patched) > > <https://reviews.apache.org/r/71414/diff/1/?file=2163192#file2163192line1142> > > > > It seems like this could be moved to the beginning of the loop, so we > > don't even have to create the raw disk resource at all in case of a > > duplicated volume id. Yes I thought about that. I put it here just for printing out the warning message with the whole resource proto but it's probably not necessary. > On Aug. 30, 2019, 11:02 a.m., Benno Evers wrote: > > src/resource_provider/storage/provider.cpp > > Line 1145 (original), 1167 (patched) > > <https://reviews.apache.org/r/71414/diff/1/?file=2163192#file2163192line1182> > > > > Probably the name doesnt quite match anymore, now that the conversion > > converts more than just metadata. Good point. I'll have a follow up patch to change the name. > On Aug. 30, 2019, 11:02 a.m., Benno Evers wrote: > > src/tests/storage_local_resource_provider_tests.cpp > > Lines 1379 (patched) > > <https://reviews.apache.org/r/71414/diff/1/?file=2163193#file2163193line1379> > > > > Can this test be run when the clock is stopped? If not, can we document > > in a comment why not? Given the many moving parts here, I fear that there's > > a high chance this test will become flaky. In general I usually lean towards stopping the clock only when necessary, and enforce as few synchronizations as possible. A more nodeterministic test would help revealing race problems in the code, but also might reveal race problems in the test (which doesn't worth the time to fix). - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71414/#review217508 --- On Aug. 30, 2019, 12:48 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71414/ > --- > > (Updated Aug. 30, 2019, 12:48 a.m.) > > > Review request for mesos, Benjamin Bannier and Jan Schlicht. > > > Bugs: MESOS-9956 > https://issues.apache.org/jira/browse/MESOS-9956 > > > Repository: mesos > > > Description > --- > > If the SLRP uses a plugin that does not conform to the CSI spec and > reports duplicated volumes, the duplicate would be removed. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71414/diff/1/ > > > Testing > --- > > make check > > The test would fail without the fix. > > > Thanks, > > Chun-Hung Hsiao > >
Review Request 71414: Gracefully handled duplicated volumes from non-conforming CSI plugins.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71414/ --- Review request for mesos, Benjamin Bannier and Jan Schlicht. Bugs: MESOS-9956 https://issues.apache.org/jira/browse/MESOS-9956 Repository: mesos Description --- If the SLRP uses a plugin that does not conform to the CSI spec and reports duplicated volumes, the duplicate would be removed. Diffs - src/resource_provider/storage/provider.cpp 6d632606f411d3ca99d3573a57c9f68b02ba8072 src/tests/storage_local_resource_provider_tests.cpp 69b59d48ceefebbb7accefe411c54ac5cecff1c3 Diff: https://reviews.apache.org/r/71414/diff/1/ Testing --- make check The test would fail without the fix. Thanks, Chun-Hung Hsiao
Re: Review Request 71151: Performed periodic storage local provider reconciliations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71151/#review217315 --- Fix it, then Ship it! src/resource_provider/storage/provider.cpp Lines 777 (patched) <https://reviews.apache.org/r/71151/#comment304613> Shouldn't `alwaysUpdate` be false here? src/resource_provider/storage/provider.cpp Line 753 (original), 820 (patched) <https://reviews.apache.org/r/71151/#comment304616> `resourceVersion` needs to be updated every time `UPDATE_STATE` is sent to avoid races. In the original implementation, since it is initialized to a random UUID in the constructor, we avoided assigning a new random value for the first `UPDATE_STATE` call. In this new codepath, it's possible that we update the random UUID for the first `UPDATE_STATE` call, which is okay, just that the initial random UUID is wasted. If we don't care about this small overhead, I'd suggest that we move this line into the `sendResourceProviderStateUpdate` function. src/resource_provider/storage/provider.cpp Lines 760-775 (original), 827-842 (patched) <https://reviews.apache.org/r/71151/#comment304614> These should be be removed. L741-753 already handled the transition. src/resource_provider/storage/provider.cpp Line 779 (original), 846 (patched) <https://reviews.apache.org/r/71151/#comment304615> This should be removed. L739 resumed the SUM. - Chun-Hung Hsiao On Aug. 19, 2019, 11:58 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71151/ > --- > > (Updated Aug. 19, 2019, 11:58 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Performed periodic storage local provider reconciliations. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71151/diff/6/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71151: Performed periodic storage local provider reconciliations.
> On Aug. 9, 2019, 7:41 a.m., Chun-Hung Hsiao wrote: > > src/resource_provider/storage/provider.cpp > > Line 752 (original), 811 (patched) > > <https://reviews.apache.org/r/71151/diff/4/?file=2159963#file2159963line813> > > > > As commented in the previous patch, since we're having two functions, > > `reconcileStoragePools` and `reconcileResources`, it's alright to have a > > little bit of repeated code and don't introduce an extra function without a > > very focused purpose. > > > > In `reconcileResources`, we could have the logic that checks for > > `alwaysUpdate`. In `reconcileStoragePools`, since it's only called when > > profiles are updated or volumes of a stale profile are deleted, there's no > > need for the extra `alwaysUpdate` logic there. > > Benjamin Bannier wrote: > I updated the code in question. I am still not sure I understood what > exactly you were after, could you have a look? Since you went for merging `reconfileStoragePools` into `reconcileResources`, please ignore it :) - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71151/#review217143 --- On Aug. 19, 2019, 11:58 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71151/ > --- > > (Updated Aug. 19, 2019, 11:58 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Performed periodic storage local provider reconciliations. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71151/diff/6/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71150: Factored out storage provider method to update resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71150/#review217316 --- src/resource_provider/storage/provider.cpp Lines 722-723 (original), 729-739 (patched) <https://reviews.apache.org/r/71150/#comment304617> Do you think this extra flattern code makes the next continuation more readable? If not maybe let's just keep the original `foreach` loop to save an extra copy and an extra continuation? Otherwise, please feel free to drop this. I already gave you another ship-it :) - Chun-Hung Hsiao On Aug. 19, 2019, 11:58 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71150/ > --- > > (Updated Aug. 19, 2019, 11:58 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Factored out storage provider method to update resources. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71150/diff/5/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71150: Factored out storage provider method to update resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71150/#review217314 --- Ship it! Ship It! - Chun-Hung Hsiao On Aug. 19, 2019, 11:58 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71150/ > --- > > (Updated Aug. 19, 2019, 11:58 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Factored out storage provider method to update resources. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71150/diff/5/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71151: Performed periodic storage local provider reconciliations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71151/#review217143 --- Fix it, then Ship it! src/resource_provider/storage/provider.cpp Line 735 (original), 796 (patched) <https://reviews.apache.org/r/71151/#comment304408> Let's check that the `reconciled` future is pending at the very beginning of this function, similar to what we do in `reconcileStoragePools`, because `getExistingVolumes` and `getStoragePools` must be called after a reconciliation starts. src/resource_provider/storage/provider.cpp Line 752 (original), 811 (patched) <https://reviews.apache.org/r/71151/#comment304409> As commented in the previous patch, since we're having two functions, `reconcileStoragePools` and `reconcileResources`, it's alright to have a little bit of repeated code and don't introduce an extra function without a very focused purpose. In `reconcileResources`, we could have the logic that checks for `alwaysUpdate`. In `reconcileStoragePools`, since it's only called when profiles are updated or volumes of a stale profile are deleted, there's no need for the extra `alwaysUpdate` logic there. src/tests/storage_local_resource_provider_tests.cpp Lines 6743 (patched) <https://reviews.apache.org/r/71151/#comment304411> These are kinda ugly. Probably should find some time to refactor them in the future :) src/tests/storage_local_resource_provider_tests.cpp Lines 6871 (patched) <https://reviews.apache.org/r/71151/#comment304412> Let's do a `Clock::settle()` before `Clock::advance()` to ensure that the timer has been set up. src/tests/storage_local_resource_provider_tests.cpp Lines 6881 (patched) <https://reviews.apache.org/r/71151/#comment304413> Ditto. - Chun-Hung Hsiao On Aug. 6, 2019, 2:27 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71151/ > --- > > (Updated Aug. 6, 2019, 2:27 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Performed periodic storage local provider reconciliations. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71151/diff/4/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71150: Factored out storage provider method to update resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71150/#review217142 --- Fix it, then Ship it! src/resource_provider/storage/provider.cpp Line 718 (original), 728 (patched) <https://reviews.apache.org/r/71150/#comment304403> Move `.then` to the next line. src/resource_provider/storage/provider.cpp Lines 746-750 (original), 739-749 (patched) <https://reviews.apache.org/r/71150/#comment304401> The order of applying the resource conversions in the inner vector is important, so if you prefer having a a flatterned vector, let's add a comment saying that the order is preserved when flatterning. src/resource_provider/storage/provider.cpp Lines 949-950 (original), 931-932 (patched) <https://reviews.apache.org/r/71150/#comment304399> It seems to me now that there's no need to hard-fail the SLRP here. Also, let's establish a convention that the error handling is always handled in the top-level caller that doesn't propagate the failure, to avoid repeatitive log messages. So let's remove the `onFailed` and `onDiscarded`. src/resource_provider/storage/provider.cpp Lines 976 (patched) <https://reviews.apache.org/r/71150/#comment304405> The return type should be `Try` or simply `Nothing` since this is not an asynchoronus function and never returns an error. src/resource_provider/storage/provider.cpp Lines 979-995 (patched) <https://reviews.apache.org/r/71150/#comment304404> Sorry for not making my suggestion clear enough. I was actually thinking about removing `reconcileStoragePools()` and always calling `reconcileResources()` even when we only want to reconcile storage pools. This suggestion makes more sense if we don't reconcile storage pools after destroying a MOUNT disk with a stale profile, as I suggested in the next patch. But if we want to keep this behavior, then this approach would introduce an extra `ListVolumes` call. If you prefer to avoid having this extra grpc call, then adding this extra function seems only give us a little bit of code reuse, and I'm not sure if it worths adding an extra function name to the already-long list of functions in this class. I'd vote for keeping the code in its original place and avoid introducing a function name that doesn't convey its purpose clearly. src/resource_provider/storage/provider.cpp Line 1874 (original), 1902 (patched) <https://reviews.apache.org/r/71150/#comment304402> Let's print out log messages here instead: ``` auto err = [](const Resource& resource, const string& message) { LOG(ERROR) << "Failed to reconcile storage pools after resource '" << resource << "' has been freed: " << message; }; reconciled = sequence.add(...) .onFailed(std::bind(err, resource, lambda::_1)) .onDiscarded(std::bind(err, resource, "future discarded")); ``` - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71150/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Factored out storage provider method to update resources. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > > > Diff: https://reviews.apache.org/r/71150/diff/3/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71151: Performed periodic storage local provider reconciliations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71151/#review217049 --- src/resource_provider/storage/provider.cpp Lines 749-768 (patched) <https://reviews.apache.org/r/71151/#comment304269> ``` loop( self(), std::bind(::after, reconciliationInterval), [this](const Nothing&) { // Poll resource provider state ... reconciled = sequence.add(defer(self(), ::reconcileResources)); return reconciled.then([](const Nothing&) -> ControlFlow { return Continue(); }); }); ``` src/resource_provider/storage/provider.cpp Lines 756-760 (original), 801-821 (patched) <https://reviews.apache.org/r/71151/#comment304272> How about introduce a `bool alwaysUpdate` parameter in this function: ``` bool shouldUpdate = alwaysUpdate; if (result != totalResources) { LOG(INFO) << "Removing"... ; // Update the resource version... totalResources = result; resourceVersion = id::UUID::random(); checkpointResourceProviderState(); shouldUpdate = true; } if (shouldUpdate) { sendResourceProviderStateUpdate(); } ``` Then, in `reconcileResourceProviderState()`: ``` return reconcileOperationStatuses() .then(defer(self(), ::reconcileResources, true)) .then(defer(self(), [this] { statusUpdateManager.resume(); LOG(INFO) << "Resource provider " << info.id() << " is in READY state"; state = READY; return Nothing(); })); ``` So we don't put the state transition logic inside this function? src/resource_provider/storage/provider.cpp Lines 1356 (patched) <https://reviews.apache.org/r/71151/#comment304273> The delay won't guarantee the happens-before order between the first reconciliation and the ones in `watchResources`. How about: ``` reconciled = reconcileResourceProviderState() .onReady(defer(self(), ::watchProfiles)) .onReady(defer(self(), ::watchResources)) .onFailed(...) .onDiscarded(...); src/resource_provider/storage/provider.cpp Lines 1869-1886 (original), 1933-1950 (patched) <https://reviews.apache.org/r/71151/#comment304274> With periodic reconciliation, this won't be necessary. Removing this would also reduce a source of reconciliation, so the reconciliation can only happen in either `watchProfiles` or `watchResources`. - Chun-Hung Hsiao On July 29, 2019, 8:56 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71151/ > ------- > > (Updated July 29, 2019, 8:56 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Performed periodic storage local provider reconciliations. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71151/diff/2/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71150: Factored out storage provider method to update resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71150/#review216947 --- I'm not sure if this is src/resource_provider/storage/provider.cpp Lines 292 (patched) <https://reviews.apache.org/r/71150/#comment304250> Did you accidentally add a new line? src/resource_provider/storage/provider.cpp Lines 728 (patched) <https://reviews.apache.org/r/71150/#comment304270> How about merging this function and `reconcileStoragePools` since the logic is very similar? src/resource_provider/storage/provider.cpp Lines 730 (patched) <https://reviews.apache.org/r/71150/#comment304182> Remove this debug log. src/resource_provider/storage/provider.cpp Lines 735 (patched) <https://reviews.apache.org/r/71150/#comment304183> Remove this debug log. - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71150/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Factored out storage provider method to update resources. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > > > Diff: https://reviews.apache.org/r/71150/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 70831: Fixed synchronization in MasterTest.UpdateSlaveMessageWithPendingOffers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70831/#review217026 --- src/tests/master_tests.cpp Lines 9014 (patched) <https://reviews.apache.org/r/70831/#comment304251> Alternatively, we can avoid this expectation and do the following: ``` Future providerId = resourceProvider->process->id(); AWAIT_READY(providerId); ``` Then s/`resourceProvider->process->info.id()`/`providerId.get()`/ I'm fine with either approach so please feel free to drop this. - Chun-Hung Hsiao On June 11, 2019, 12:38 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70831/ > --- > > (Updated June 11, 2019, 12:38 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Jan Schlicht. > > > Repository: mesos > > > Description > --- > > In order for a resource provider to have been assigned and stored an > assigned resource provider ID it needs to receive and process the > corresponding resource provider `SUBSCRIBED` event. > > This patch fixes `MasterTest.UpdateSlaveMessageWithPendingOffers` so we > await the correct event. The test was previously awaiting an > `UpdateSlaveMessage` which is only tangentially related, but does not > guarantee correct event ordering. > > > Diffs > - > > src/tests/master_tests.cpp 097f1b77a59e29c6690210773d1556ebf2bb701e > > > Diff: https://reviews.apache.org/r/70831/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 70831: Fixed synchronization in MasterTest.UpdateSlaveMessageWithPendingOffers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70831/#review217025 --- Ship it! Ship It! - Chun-Hung Hsiao On June 11, 2019, 12:38 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70831/ > --- > > (Updated June 11, 2019, 12:38 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Jan Schlicht. > > > Repository: mesos > > > Description > --- > > In order for a resource provider to have been assigned and stored an > assigned resource provider ID it needs to receive and process the > corresponding resource provider `SUBSCRIBED` event. > > This patch fixes `MasterTest.UpdateSlaveMessageWithPendingOffers` so we > await the correct event. The test was previously awaiting an > `UpdateSlaveMessage` which is only tangentially related, but does not > guarantee correct event ordering. > > > Diffs > - > > src/tests/master_tests.cpp 097f1b77a59e29c6690210773d1556ebf2bb701e > > > Diff: https://reviews.apache.org/r/70831/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71148: Explicitly disabled periodic reconciliation for some provider tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71148/#review217024 --- Fix it, then Ship it! src/tests/storage_local_resource_provider_tests.cpp Line 330 (original), 330 (patched) <https://reviews.apache.org/r/71148/#comment304249> Fits in one line. - Chun-Hung Hsiao On July 29, 2019, 8:56 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71148/ > --- > > (Updated July 29, 2019, 8:56 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Explicitly disabled periodic reconciliation for some provider tests. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71148/diff/2/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71147: Update config factory to set resource provider reconciliation interval.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71147/#review217023 --- Ship it! Ship It! - Chun-Hung Hsiao On July 29, 2019, 8:56 a.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71147/ > --- > > (Updated July 29, 2019, 8:56 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Update config factory to set resource provider reconciliation interval. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 69b59d48ceefebbb7accefe411c54ac5cecff1c3 > > > Diff: https://reviews.apache.org/r/71147/diff/2/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 70728: Backed `MockResourceProvider` by a process.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70728/#review216979 --- Fix it, then Ship it! src/tests/mesos.hpp Lines 3038 (patched) <https://reviews.apache.org/r/70728/#comment304212> It seems that I'm wrong about `Self`, which seems to become a type template in a class template. However, I've tested that we can simply use `MockResourceProviderProcess` (whitout any template argument) within this class without introducing a type alias. For example, use `::connectDefault`. That said, please feel free to drop this if you think introducing the `Self` type alias make the code more consistent. src/tests/mesos.hpp Line 3142 (original), 3150 (patched) <https://reviews.apache.org/r/70728/#comment304213> s/since // src/tests/mesos.hpp Lines 3151 (patched) <https://reviews.apache.org/r/70728/#comment304214> This introduces a coupling beween this test and the implementation of the JWT secret generator being synchoronous. How about the following instead: ``` process::Future> token = None(); #ifdef USE_SSL_SOCKET ... token = secretGenerator.generate(principal) .then([](const Secret& secret) -> Option { return secret.value().data(); }); #endif // TODO(bbannier): Remove the `shared_ptr` once we get C++14. auto detector_ = std::make_shared>(std::move(detector)); token.then(defer(self(), [=](const Option& token) { driver.reset(new Driver( std::move(*detector_), ..., token)); driver->start(); return Nothing(); })); ``` src/tests/mesos.hpp Lines 3302 (patched) <https://reviews.apache.org/r/70728/#comment304216> Do we use `FIXME` in the codebase? src/tests/mesos.hpp Lines 3312 (patched) <https://reviews.apache.org/r/70728/#comment304215> s/provider_id/providerId/ for naming consistency. src/tests/mesos.hpp Lines 3359 (patched) <https://reviews.apache.org/r/70728/#comment304217> No need to make it public anymore. src/tests/mesos.hpp Line 3350 (original), 3408 (patched) <https://reviews.apache.org/r/70728/#comment304218> How about s/Mock/Test/ here and below? This is more of a personal naming opinion so please feel free to drop it. src/tests/resource_provider_manager_tests.cpp Lines 1203 (patched) <https://reviews.apache.org/r/70728/#comment304219> s/terminate/stop/ and remove the line break. src/tests/resource_provider_manager_tests.cpp Lines 1436 (patched) <https://reviews.apache.org/r/70728/#comment304220> Ditto. - Chun-Hung Hsiao On July 27, 2019, 7:04 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70728/ > --- > > (Updated July 27, 2019, 7:04 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Jan Schlicht. > > > Bugs: MESOS-9560 > https://issues.apache.org/jira/browse/MESOS-9560 > > > Repository: mesos > > > Description > --- > > We were passing callbacks into `MockResourceProvider` to the HTTP > driver. Since the lifecycle of the callbacks and the mock provider were > decoupled and these callbacks were binding the mock provider instance > the code was not safe as written as the driver could invoke the callback > after the provider had been destructed. > > This patch makes sure that the callbacks are defered to a process. We > also dispatch a number of other functions which are strongly coupled to > the lifecycle of the provider. We still do not hide the provider away > completely so the provider can be mocked in tests. > > > Diffs > - > > src/tests/api_tests.cpp 641eb15153ddb85df322aa6a133ca8e4c6d6a510 > src/tests/master_slave_reconciliation_tests.cpp > 7b6ac50adacc8416b91c0dde55ff7ba46a20515c > src/tests/master_tests.cpp b9ef13c31a9c3ae16e55d3ae8f9b1538a49cf49a > src/tests/mesos.hpp 4612e2e3d2bd32590248df58b546de8756636964 > src/tests/operation_reconciliation_tests.cpp > eae318da2273faae904f0155e49bb23cdb24f007 > src/tests/resource_provider_manager_tests.cpp > 7d48f18e89f046df6c92e52edeef592acfb13b10 > src/tests/slave_tests.cpp abee107720d6b78bb017d2762431ae36c0679026 > > > Diff: https://reviews.apache.org/r/70728/diff/7/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71149: Renamed a storage provider function.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71149/#review216878 --- Ship it! You know what? I got stuck at this step trying to come up with a good name when I worked on this ticket ;) Thanks for coming up with this name! - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71149/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Renamed a storage provider function. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > 6d632606f411d3ca99d3573a57c9f68b02ba8072 > > > Diff: https://reviews.apache.org/r/71149/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71148: Explicitly disabled periodic reconciliation for some provider tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71148/#review216877 --- src/tests/storage_local_resource_provider_tests.cpp Line 846 (original), 846 (patched) <https://reviews.apache.org/r/71148/#comment304121> How about using zero here, as described in `mesos.proto`? And since most tests require no reconciliation, maybe making zero the default in `setupResourceProviderConfig`? - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71148/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Explicitly disabled periodic reconciliation for some provider tests. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 69861265d94ddf344da96b593797ce145394413e > > > Diff: https://reviews.apache.org/r/71148/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71147: Update config factory to set resource provider reconciliation interval.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71147/#review216876 --- src/tests/storage_local_resource_provider_tests.cpp Lines 331 (patched) <https://reviews.apache.org/r/71147/#comment304120> I'm actually thinking about haveing a really short reconciliation interval for testing here, so certain tests won't wait for a long time. WDYT? Or, maybe we can inject proper reconciliation intervals only in the tests that checks for reconciliations. - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71147/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Update config factory to set resource provider reconciliation interval. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 69861265d94ddf344da96b593797ce145394413e > > > Diff: https://reviews.apache.org/r/71147/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71146: Clarified a comment in storage local resource provider tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71146/#review216875 --- Fix it, then Ship it! src/tests/storage_local_resource_provider_tests.cpp Line 615 (original), 615 (patched) <https://reviews.apache.org/r/71146/#comment304119> The RP "daemon" does not "get subscribed." The RP daemon is started after agent registration, which instantiate the resource providers, which subscribes to the RP manager. So maybe "Since the local resource provider gets subscribed after the agent is registered?" Ditto below. - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71146/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Clarified a comment in storage local resource provider tests. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 69861265d94ddf344da96b593797ce145394413e > > > Diff: https://reviews.apache.org/r/71146/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71145: Fixed signature to pass parameter by const ref instead of value.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71145/#review216874 --- Ship it! Ship It! - Chun-Hung Hsiao On July 23, 2019, 8:18 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71145/ > --- > > (Updated July 23, 2019, 8:18 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9254 > https://issues.apache.org/jira/browse/MESOS-9254 > > > Repository: mesos > > > Description > --- > > Fixed signature to pass parameter by const ref instead of value. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 69861265d94ddf344da96b593797ce145394413e > > > Diff: https://reviews.apache.org/r/71145/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 71144: Added `reconciliation_interval_seconds` for storage resource providers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71144/ --- (Updated July 26, 2019, 4:26 a.m.) Review request for mesos and Benjamin Bannier. Changes --- Addressed Benjamin's comment. Bugs: MESOS-9254 https://issues.apache.org/jira/browse/MESOS-9254 Repository: mesos Description --- This new configuration option controls how frequent a storage resource provider reconciles existing volumes and storage pools against its CSI plugin to detect new or missing disk resources. Diffs (updated) - include/mesos/mesos.proto 2b4f350815935220c2d2b0dd0e52346bc74c91d9 include/mesos/v1/mesos.proto bafc27499f810791700c4a30dcb1da33b6f31d2e src/Makefile.am 761dde1d63e0f4f1ac4ab86f129f84f3746d3153 src/resource_provider/constants.hpp PRE-CREATION Diff: https://reviews.apache.org/r/71144/diff/2/ Changes: https://reviews.apache.org/r/71144/diff/1-2/ Testing --- Thanks, Chun-Hung Hsiao
Re: Review Request 71143: Moved default constants for CSI RPC retry to a new header.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71143/ --- (Updated July 26, 2019, 4:25 a.m.) Review request for mesos and Benjamin Bannier. Changes --- Rebased. Bugs: MESOS-9254 https://issues.apache.org/jira/browse/MESOS-9254 Repository: mesos Description --- Since the default constants for CSI RPC retry do not depend on CSI versions, these constants are pulled off from version-specific headers to a common header. Diffs (updated) - src/Makefile.am 761dde1d63e0f4f1ac4ab86f129f84f3746d3153 src/csi/constants.hpp PRE-CREATION src/csi/v0_volume_manager.cpp e19dc7c2933e2bbee5e817f3089bd569b259786b src/csi/v0_volume_manager_process.hpp 4cfb5b564af9e187a6e3d81f86964db288ae852e src/csi/v1_volume_manager.cpp e7e032988bce576ef8d6a0c3457f26f1aad9bb58 src/csi/v1_volume_manager_process.hpp 30788c3593d4b4bd6271705a1188f73836bc7a85 src/tests/storage_local_resource_provider_tests.cpp 38233053452259571743317326a6efc74d97bd29 Diff: https://reviews.apache.org/r/71143/diff/2/ Changes: https://reviews.apache.org/r/71143/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 71144: Added `reconciliation_interval_seconds` for storage resource providers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71144/ --- Review request for mesos and Benjamin Bannier. Bugs: MESOS-9254 https://issues.apache.org/jira/browse/MESOS-9254 Repository: mesos Description --- This new configuration option controls how frequent a storage resource provider reconciles existing volumes and storage pools against its CSI plugin to detect new or missing disk resources. Diffs - include/mesos/mesos.proto 2b4f350815935220c2d2b0dd0e52346bc74c91d9 include/mesos/v1/mesos.proto bafc27499f810791700c4a30dcb1da33b6f31d2e src/Makefile.am 761dde1d63e0f4f1ac4ab86f129f84f3746d3153 src/resource_provider/constants.hpp PRE-CREATION Diff: https://reviews.apache.org/r/71144/diff/1/ Testing --- Thanks, Chun-Hung Hsiao
Review Request 71143: Moved default constants for CSI RPC retry to a new header.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71143/ --- Review request for mesos and Benjamin Bannier. Bugs: MESOS-9254 https://issues.apache.org/jira/browse/MESOS-9254 Repository: mesos Description --- Since the default constants for CSI RPC retry do not depend on CSI versions, these constants are pulled off from version-specific headers to a common header. Diffs - src/Makefile.am 761dde1d63e0f4f1ac4ab86f129f84f3746d3153 src/csi/constants.hpp PRE-CREATION src/csi/v0_volume_manager.cpp e19dc7c2933e2bbee5e817f3089bd569b259786b src/csi/v0_volume_manager_process.hpp 4cfb5b564af9e187a6e3d81f86964db288ae852e src/csi/v1_volume_manager.cpp e7e032988bce576ef8d6a0c3457f26f1aad9bb58 src/csi/v1_volume_manager_process.hpp 30788c3593d4b4bd6271705a1188f73836bc7a85 src/tests/storage_local_resource_provider_tests.cpp 38233053452259571743317326a6efc74d97bd29 Diff: https://reviews.apache.org/r/71143/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70728: Backed `MockResourceProvider` by a process.
thPendingOffers`. src/tests/resource_provider_manager_tests.cpp Line 1167 (original), 1167 (patched) <https://reviews.apache.org/r/70728/#comment303959> We need to have a proper synchronization here to avoid data races. See my comments in `UpdateSlaveMessageWithPendingOffers`. src/tests/resource_provider_manager_tests.cpp Line 1195 (original), 1195 (patched) <https://reviews.apache.org/r/70728/#comment303970> We're accessing the local variable `resourceProvider` in its process context, which seems error prone. How about having a `MockResourceProviderProcess::stop` function that resets its driver, and invoke it here: ``` Invoke(*resourceProvder->process, ::stop) ``` src/tests/resource_provider_manager_tests.cpp Line 1272 (original), 1272 (patched) <https://reviews.apache.org/r/70728/#comment303971> Let's avoid accessing the local variable `resourceProvider` in its process context: ``` Invoke(*resourceProvder->process, ::stop) ``` See my comments in `ResubscribeResourceProvider`. src/tests/resource_provider_manager_tests.cpp Line 1342 (original), 1342 (patched) <https://reviews.apache.org/r/70728/#comment303960> We need to have a proper synchronization here to avoid data races. See my comments in `UpdateSlaveMessageWithPendingOffers`. src/tests/resource_provider_manager_tests.cpp Line 1421 (original), 1420 (patched) <https://reviews.apache.org/r/70728/#comment303972> Let's avoid accessing the local variable `resourceProvider1` in its process context: ``` Invoke(*resourceProvder1->process, ::stop) ``` See my comments in `ResubscribeResourceProvider`. src/tests/slave_tests.cpp Line 10835 (original), 10836 (patched) <https://reviews.apache.org/r/70728/#comment303962> In this test, `applyOperation` being ready could indicate that `subscribed` has been called, which means that `info` has been set up, so no data race here. However, it would be nice to make these more explicit and avoid direct accesses to `process->info` in different threads. See my comments in `UpdateSlaveMessageWithPendingOffers`. src/tests/slave_tests.cpp Lines 10870 (patched) <https://reviews.apache.org/r/70728/#comment303964> Let's use `resourceProviderId` here instead. src/tests/slave_tests.cpp Line 11256 (original), 11258 (patched) <https://reviews.apache.org/r/70728/#comment303965> In this test, `operation` being ready could indicate that `subscribed` has been called, which means that `info` has been set up, so no data race here. However, it would be nice to make these more explicit and avoid direct accesses to `process->info` in the remaining of this test. See my comments in `UpdateSlaveMessageWithPendingOffers`. src/tests/slave_tests.cpp Line 11402 (original), 11406 (patched) <https://reviews.apache.org/r/70728/#comment303967> s/`CHECK`/`ASSERT_TRUE`/. We need to have a proper synchronization here to avoid data races. See my comments in `UpdateSlaveMessageWithPendingOffers`. Also, let's avoid accessing `process->info` directly in the remaining of this test. src/tests/slave_tests.cpp Line 11769 (original), 11775 (patched) <https://reviews.apache.org/r/70728/#comment303975> Let's make `resourceProvider` an `Owned` or `unique_ptr`, and do `resourceProvider.reset()` instead. - Chun-Hung Hsiao On May 27, 2019, 4:32 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70728/ > --- > > (Updated May 27, 2019, 4:32 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Jan Schlicht. > > > Bugs: MESOS-9560 > https://issues.apache.org/jira/browse/MESOS-9560 > > > Repository: mesos > > > Description > --- > > We were passing callbacks into `MockResourceProvider` to the HTTP > driver. Since the lifecycle of the callbacks and the mock provider were > decoupled and these callbacks were binding the mock provider instance > the code was not safe as written as the driver could invoke the callback > after the provider had been destructed. > > This patch makes sure that the callbacks are defered to a process. We > also dispatch a number of other functions which are strongly coupled to > the lifecycle of the provider. We still do not hide the provider away > completely so the provider can be mocked in tests. > > > Diffs > - > > src/tests/api_tests.cpp af1d215f00c8c2224e807677afb4af2d3521235a > src/tests/master_slave_reconciliation_tests.cpp > 7b6ac50adacc8416b91c0dde55ff7ba46a20515c > src/tests/master_tests.cpp 097f1b77a59e29c6690210773d1556ebf2bb701e > src/tests/mesos.hpp d
Review Request 71018: Fixed a race between status updates and acknowledgements in SLRP tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71018/ --- Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Greg Mann. Bugs: MESOS-9881 https://issues.apache.org/jira/browse/MESOS-9881 Repository: mesos Description --- SLRP tests `RetryOperationStatusUpdate*` check that no operation update will be sent by SLRP once acknowledgements are sent by master. However, since the acknowledgements are delivered via HTTP, it is possible that operation status updates race with acknowledgements that are still in HTTP pipes. This patch fixes this race condition. Diffs - src/tests/storage_local_resource_provider_tests.cpp 38233053452259571743317326a6efc74d97bd29 Diff: https://reviews.apache.org/r/71018/diff/1/ Testing --- Ran the tests under stress 200 times. Thanks, Chun-Hung Hsiao
Re: Review Request 70788: Garbage-collected disappeared RPs when agent resources remain unchanged.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70788/ --- (Updated June 6, 2019, 6:35 p.m.) Review request for mesos and Benjamin Bannier. Changes --- Addressed Benjamin's comment. Bugs: MESOS-9831 https://issues.apache.org/jira/browse/MESOS-9831 Repository: mesos Description --- Previously when there is a missing resource provider in an `UpdateSlaveMessage` but the agent's total resources remain unchanged, the update message will be completely ignored, so the missing resource provider will still be cached in the master's state, which is not the desired behavior. This patch ensures that the master's state gets updated if any resource provider is missing. Diffs (updated) - src/master/master.cpp 4d7c37cf8814e45432b3fe15173f5343676a372b src/tests/api_tests.cpp 37d0cb11ddb24b32f1e83feefe168f2aa2eb659c Diff: https://reviews.apache.org/r/70788/diff/2/ Changes: https://reviews.apache.org/r/70788/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70766: Fixed chaining futures infinitely in `UriDiskProfileAdaptor`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70766/ --- (Updated June 6, 2019, 4:58 a.m.) Review request for mesos, Benjamin Bannier and Benjamin Mahler. Changes --- Made `watchers` a vector and avoided the use of iterators. Bugs: MESOS-9803 https://issues.apache.org/jira/browse/MESOS-9803 Repository: mesos Description --- Previously it is possible to have an infinite chain of futures when `UriDiskProfileAdaptor::watch` is called: if the set of profiles remains fixed for every poll, each poll would satisfy a promise that triggers an asynchronous recursive call to `UriDiskProfileAdaptor::watch` again. This patch fixes the problem by removing the asynchronous recursion. Instead, we maintain a separated promise for each watcher that is never associated to another promise. After each poll, we check if the current set of profiles differs from the known set for a watcher, and satisfy its own promise if so. Diffs (updated) - src/resource_provider/storage/uri_disk_profile_adaptor.hpp a5a34dc4dc1d518dc69aeb15fe62bd124d828ed3 src/resource_provider/storage/uri_disk_profile_adaptor.cpp 215f7f92b5c2a0e60555134ce9887e8a187e3b1d Diff: https://reviews.apache.org/r/70766/diff/2/ Changes: https://reviews.apache.org/r/70766/diff/1-2/ Testing --- make check Manually tested with the unit test in r/70764. The unit test will fail at the 5th poll without the fix and will pass with the fix. Thanks, Chun-Hung Hsiao
Review Request 70788: Garbage-collected disappeared RPs when agent resources remain unchanged.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70788/ --- Review request for mesos and Benjamin Bannier. Bugs: MESOS-9831 https://issues.apache.org/jira/browse/MESOS-9831 Repository: mesos Description --- Previously when there is a missing resource provider in an `UpdateSlaveMessage` but the agent's total resources remain unchanged, the update message will be completely ignored, so the missing resource provider will still be cached in the master's state, which is not the desired behavior. This patch ensures that the master's state gets updated if any resource provider is missing. Diffs - src/master/master.cpp 4d7c37cf8814e45432b3fe15173f5343676a372b src/tests/api_tests.cpp 37d0cb11ddb24b32f1e83feefe168f2aa2eb659c Diff: https://reviews.apache.org/r/70788/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70728: Backed `MockResourceProvider` by a process.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70728/#review215705 --- s/completelt/completely/ in the description. src/tests/master_tests.cpp Line 9017 (original), 9017 (patched) <https://reviews.apache.org/r/70728/#comment302518> Not yours, but it seems to me that the dequeueing of the `UpdateSlaveMessage` in the master does not causally happen before the `subscribedDefault` call (which sets up the resource provider ID). - Chun-Hung Hsiao On May 27, 2019, 4:32 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70728/ > --- > > (Updated May 27, 2019, 4:32 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Jan Schlicht. > > > Repository: mesos > > > Description > --- > > We were passing callbacks into `MockResourceProvider` to the HTTP > driver. Since the lifecycle of the callbacks and the mock provider were > decoupled and these callbacks were binding the mock provider instance > the code was not safe as written as the driver could invoke the callback > after the provider had been destructed. > > This patch makes sure that the callbacks are defered to a process. We > also dispatch a number of other functions which are strongly coupled to > the lifecycle of the provider. We still do not hide the provider away > completelt so the provider can be mocked in tests. > > > Diffs > - > > src/tests/api_tests.cpp 2220cecc22778a86f0c29317adf495927e1a900d > src/tests/master_slave_reconciliation_tests.cpp > 7b6ac50adacc8416b91c0dde55ff7ba46a20515c > src/tests/master_tests.cpp 097f1b77a59e29c6690210773d1556ebf2bb701e > src/tests/mesos.hpp d0c82fa4c1800ed2a1825d563a9462ecd41b77fd > src/tests/operation_reconciliation_tests.cpp > eae318da2273faae904f0155e49bb23cdb24f007 > src/tests/resource_provider_manager_tests.cpp > 7d48f18e89f046df6c92e52edeef592acfb13b10 > src/tests/slave_tests.cpp c2035976713abb31b3646c0d23771fa40df93271 > > > Diff: https://reviews.apache.org/r/70728/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Review Request 70765: Renamed struct `ProfileRecord` to `ProfileData` for consistency.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70765/ --- Review request for mesos, Benjamin Bannier and Benjamin Mahler. Repository: mesos Description --- Renamed struct `ProfileRecord` to `ProfileData` for consistency. Diffs - src/resource_provider/storage/uri_disk_profile_adaptor.hpp a5a34dc4dc1d518dc69aeb15fe62bd124d828ed3 src/resource_provider/storage/uri_disk_profile_adaptor.cpp 215f7f92b5c2a0e60555134ce9887e8a187e3b1d Diff: https://reviews.apache.org/r/70765/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70766: Fixed chaining futures infinitely in `UriDiskProfileAdaptor`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70766/ --- Review request for mesos, Benjamin Bannier and Benjamin Mahler. Bugs: MESOS-9803 https://issues.apache.org/jira/browse/MESOS-9803 Repository: mesos Description --- Previously it is possible to have an infinite chain of futures when `UriDiskProfileAdaptor::watch` is called: if the set of profiles remains fixed for every poll, each poll would satisfy a promise that triggers an asynchronous recursive call to `UriDiskProfileAdaptor::watch` again. This patch fixes the problem by removing the asynchronous recursion. Instead, we maintain a separated promise for each watcher that is never associated to another promise. After each poll, we check if the current set of profiles differs from the known set for a watcher, and satisfy its own promise if so. Diffs - src/resource_provider/storage/uri_disk_profile_adaptor.hpp a5a34dc4dc1d518dc69aeb15fe62bd124d828ed3 src/resource_provider/storage/uri_disk_profile_adaptor.cpp 215f7f92b5c2a0e60555134ce9887e8a187e3b1d Diff: https://reviews.apache.org/r/70766/diff/1/ Testing --- make check Manually tested with the unit test in r/70764. The unit test will fail at the 5th poll without the fix and will pass with the fix. Thanks, Chun-Hung Hsiao
Re: Review Request 70740: Windows: Removed multiple categories of sources from the build.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70740/#review215558 --- src/CMakeLists.txt Lines 32 (patched) <https://reviews.apache.org/r/70740/#comment302287> Why are you guarding just the CSI v1 but not v0? They're different from the Mesos v0 vs v1 API protos. They are different APIs. Let's just disable CSI on Windows and guard both for now. src/CMakeLists.txt Lines 61 (patched) <https://reviews.apache.org/r/70740/#comment302288> IIRC Windows uses some v2s2 spec? https://github.com/apache/mesos/blob/fa410f2fb8efb988590f4da2d4cfffbb2ce70637/src/uri/fetchers/docker.cpp#L974 - Chun-Hung Hsiao On May 28, 2019, 11:11 p.m., Joseph Wu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70740/ > --- > > (Updated May 28, 2019, 11:11 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Gilbert Song. > > > Repository: mesos > > > Description > --- > > This removes some categories of sources from the Windows build, > where it is possible to do so with minimal ifdef-ing. > > The features removed are all Linux-specific features that cannot be > feasibly ported to Windows, including: > * Container Storage Interface (CSI) > * Docker image provisioning, specifically related to V2 > * Open Container Interface > * Volume GID Manager > > Protobufs are excluded where possible, but many of the above categories > of protobufs are interleaved with other protobufs or source code, > which makes exclusion non-trivial. For example, CSI V0 protobufs > cannot be excluded without a large change; or libseccomp is a Linux-only > feature, but its protobufs are now required to build the Mesos > containerizer's protobufs. > > Docker image provisioning was semi-trivial to exclude, because the > related components (provisioner & URI fetcher) are somewhat modularized. > > > Diffs > - > > src/CMakeLists.txt 1d4f541b73c07a307a8b61f217e0cfad5dc095e4 > src/slave/containerizer/mesos/containerizer.cpp > 043244841a73fa3f5f7119bc38f6d3a04be8990b > src/slave/containerizer/mesos/provisioner/store.cpp > 11fce0eb47e9e6dfce6289afe04a1d58a0c4461a > src/tests/CMakeLists.txt e6b1d8a097246f0921467d7f778034d6754fcff6 > src/tests/uri_fetcher_tests.cpp c727cc52e82a396fe187a00c8cc3c9e78a919c5d > src/uri/fetcher.hpp cc4bd93b3b8bcb7803f8f912f4ad9d3cf45a58a9 > src/uri/fetcher.cpp 8db43eb9763f1cf8040db93a1f03aae0fe9ab3c7 > > > Diff: https://reviews.apache.org/r/70740/diff/1/ > > > Testing > --- > > cmake --build . --target check > > This slightly decreases the memory footprint of the build, and allowed my > build instance (4GB mem) to proceed beyond some agent files (which is where > the Windows CI is also running out of memory). It still ran out of memory > when compiling tests however. After giving the instance more memory (8GB), > the build succeeds. > > > Thanks, > > Joseph Wu > >
Re: Review Request 70651: Notifies master `/api/v1` subscribers about recovered frameworks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70651/ --- (Updated May 23, 2019, 4:17 a.m.) Review request for mesos and Greg Mann. Bugs: MESOS-9785 https://issues.apache.org/jira/browse/MESOS-9785 Repository: mesos Description --- If one subscribes to master's `/api/v1` endpoint after a master failover but before an agent reregistration, frameworks recovered through the agent registration should be notified to the subscriber, otherwise recovered tasks will have framework IDs referring to frameworks unknown to the subscriber. Diffs - src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def src/master/master.cpp 4d7c37cf8814e45432b3fe15173f5343676a372b src/tests/api_tests.cpp 37d0cb11ddb24b32f1e83feefe168f2aa2eb659c Diff: https://reviews.apache.org/r/70651/diff/3/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70702: Serialized all events to master's `/api/v1` subscribers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70702/ --- Review request for mesos, Benjamin Mahler and Greg Mann. Bugs: MESOS-9785 https://issues.apache.org/jira/browse/MESOS-9785 Repository: mesos Description --- The master needs to create object approvers before sending an event to its `/api/v1` subscribers. The creation calls `process::collect`, which does not have any ordering guarantee. As a result, events might be reordered, which could be unexpected by subscribers. This patch imposes an order between events by serializing the creation of object approvers. The actual creations can still go in parallel, but the returned futures will be completed in the creation order. Diffs - src/master/master.hpp 2771f4c045c877b7d8aa5db042810232c0e40ba0 src/master/master.cpp 4d7c37cf8814e45432b3fe15173f5343676a372b Diff: https://reviews.apache.org/r/70702/diff/1/ Testing --- make check Ran the test updated in r/70651 for 500 times under stress. Thanks, Chun-Hung Hsiao
Re: Review Request 70651: Notifies master `/api/v1` subscribers about recovered frameworks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70651/ --- (Updated May 22, 2019, 9:41 p.m.) Review request for mesos and Greg Mann. Changes --- Addressed Greg's comments. Bugs: MESOS-9785 https://issues.apache.org/jira/browse/MESOS-9785 Repository: mesos Description --- If one subscribes to master's `/api/v1` endpoint after a master failover but before an agent reregistration, frameworks recovered through the agent registration should be notified to the subscriber, otherwise recovered tasks will have framework IDs referring to frameworks unknown to the subscriber. Diffs (updated) - src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def src/master/master.cpp 4d7c37cf8814e45432b3fe15173f5343676a372b src/tests/api_tests.cpp 37d0cb11ddb24b32f1e83feefe168f2aa2eb659c Diff: https://reviews.apache.org/r/70651/diff/2/ Changes: https://reviews.apache.org/r/70651/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70651: Notifies master `/api/v1` subscribers about recovered frameworks.
> On May 17, 2019, 10:16 a.m., Greg Mann wrote: > > src/tests/api_tests.cpp > > Lines 2774 (patched) > > <https://reviews.apache.org/r/70651/diff/1/?file=2145475#file2145475line2777> > > > > Nit: make this a const ref? > > Chun-Hung Hsiao wrote: > This is copied because `event` is overwritten later. Do you think that > it's okay to use a ref, since we don't access to it afterwards? > > Greg Mann wrote: > As long as `agentAdded` isn't modified, we can make it a const ref? I was > suggesting we do the same thing with `agentAdded` that we do previously in > this test with `getState`: > ``` > const v1::master::Response::GetState& getState = > event->get().subscribed().get_state(); > ``` > > Chun-Hung Hsiao wrote: > The problem is that `event->get()` will be destructed once `event` is > overwritten, so the ref, const or not, will no longer be valid. So making a > copy is a bit safer (if anyone touches this test in the future). > > Greg Mann wrote: > But it looks like we already use the const ref pattern in this test? It > seems best to me to just be consistent, but I don't have strong feelings. If > you don't want to change it, you can drop the issues. Given that it's not very likely we'll change this test to access these references in the future, I'll take your suggestion for consistency :) - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70651/#review215319 --- On May 16, 2019, 5:11 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70651/ > --- > > (Updated May 16, 2019, 5:11 a.m.) > > > Review request for mesos and Greg Mann. > > > Bugs: MESOS-9785 > https://issues.apache.org/jira/browse/MESOS-9785 > > > Repository: mesos > > > Description > --- > > If one subscribes to master's `/api/v1` endpoint after a master failover > but before an agent reregistration, frameworks recovered through the > agent registration should be notified to the subscriber, otherwise > recovered tasks will have framework IDs referring to frameworks unknown > to the subscriber. > > > Diffs > - > > src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def > src/master/master.cpp 6c0e30bff3fbfe0ad1f3af44eb3df6f82048f0e1 > src/tests/api_tests.cpp bc19d7e9661f091aebf3072967548c7c3196239d > > > Diff: https://reviews.apache.org/r/70651/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70686: Added a unit test for master operation authorization.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70686/ --- (Updated May 22, 2019, 4:09 a.m.) Review request for mesos, Benjamin Bannier, Gastón Kleiman, and Greg Mann. Changes --- Addressed Benjamin's comments. Bugs: MESOS-9485 https://issues.apache.org/jira/browse/MESOS-9485 Repository: mesos Description --- This test verifies that allowing or denying an action will only result in a success or failure on specific operations but not other operations in an accept call. This is a regression test for MESOS-9474 and MESOS-9480. Diffs (updated) - src/slave/slave.hpp c740bf7c4af1699dd71a31f9169d29c1b4adb40d src/tests/master_authorization_tests.cpp f65b6210a8189ec5e9089bce845a46d325253058 src/tests/mesos.hpp e70e0e1bbebef41557f9bb1f26c379357fc74e2a src/tests/mock_slave.hpp 326a450eb2ae9912a19c6e0220e80f74d2953add src/tests/mock_slave.cpp 1122c2a8b56b9d4e5deea4e892d81061f5e39a8d Diff: https://reviews.apache.org/r/70686/diff/2/ Changes: https://reviews.apache.org/r/70686/diff/1-2/ Testing --- make check Ran the unit test under stress for 500 iterations. Thanks, Chun-Hung Hsiao
Re: Review Request 70686: Added a unit test for master operation authorization.
> On May 21, 2019, 12:07 p.m., Benjamin Bannier wrote: > > src/tests/master_authorization_tests.cpp > > Lines 3317-3320 (patched) > > <https://reviews.apache.org/r/70686/diff/1/?file=2145964#file2145964line3317> > > > > If you get rid of the `vector` `evolve` function in r/70683/ we'd > > create a temporary `RepeatedPtrField` and convert when creating the accept > > call. > > > > > > RepeatedPtrField operations_ = > > evolve( > > {operations.begin(), operations.end()}); > > > > mesos.send(v1::createCallAccept( > > subscribed->framework_id(), > > offers->offers(0), > > {operations_.begin(), operations_.end()})); > > Instead of doing this, the operations are now created with v1 proto messages, since operation feeback is only supported in v1. As a result, we no longer need the vector evolve helper. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70686/#review215405 --- On May 20, 2019, 10:37 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70686/ > --- > > (Updated May 20, 2019, 10:37 p.m.) > > > Review request for mesos, Benjamin Bannier, Gastón Kleiman, and Greg Mann. > > > Bugs: MESOS-9485 > https://issues.apache.org/jira/browse/MESOS-9485 > > > Repository: mesos > > > Description > --- > > This test verifies that allowing or denying an action will only result > in a success or failure on specific operations but not other operations > in an accept call. This is a regression test for MESOS-9474 and > MESOS-9480. > > > Diffs > - > > src/slave/slave.hpp c740bf7c4af1699dd71a31f9169d29c1b4adb40d > src/tests/master_authorization_tests.cpp > f65b6210a8189ec5e9089bce845a46d325253058 > src/tests/mesos.hpp e70e0e1bbebef41557f9bb1f26c379357fc74e2a > src/tests/mock_slave.hpp 326a450eb2ae9912a19c6e0220e80f74d2953add > src/tests/mock_slave.cpp 1122c2a8b56b9d4e5deea4e892d81061f5e39a8d > > > Diff: https://reviews.apache.org/r/70686/diff/1/ > > > Testing > --- > > make check > > Ran the unit test under stress for 500 iterations. > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70683: Added helpers for evolving operation-related proto messages.
> On May 21, 2019, 12:07 p.m., Benjamin Bannier wrote: > > src/internal/evolve.hpp > > Lines 106-114 (original), 122-133 (patched) > > <https://reviews.apache.org/r/70683/diff/1/?file=2145951#file2145951line122> > > > > This helper doesn't seem to fit well, and we'd only use this helper in > > a single spot later on in the chain. > > > > I'd suggest to remove it here and just manually adjust in the later > > patch. Removed all evolve helpers and added a devolve helper instead. In the test introduced in the later patch, all operations are generated with v1 proto messages, since operation feedback is only supported in v1. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70683/#review215403 --- On May 20, 2019, 10:30 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70683/ > --- > > (Updated May 20, 2019, 10:30 p.m.) > > > Review request for mesos, Benjamin Bannier and Greg Mann. > > > Bugs: MESOS-9485 > https://issues.apache.org/jira/browse/MESOS-9485 > > > Repository: mesos > > > Description > --- > > Added helpers for evolving operation-related proto messages. > > > Diffs > - > > src/internal/evolve.hpp 1044d9df75b6fc1f60d3704be9cb5751e6d4321d > src/internal/evolve.cpp 19c155967bf090fb2ec39211805ff1385787ab59 > > > Diff: https://reviews.apache.org/r/70683/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70683: Added a helper to devolve offer operations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70683/ --- (Updated May 22, 2019, 4:06 a.m.) Review request for mesos, Benjamin Bannier and Greg Mann. Changes --- Addressed Benjamin's comment. Summary (updated) - Added a helper to devolve offer operations. Bugs: MESOS-9485 https://issues.apache.org/jira/browse/MESOS-9485 Repository: mesos Description (updated) --- Added a helper to devolve offer operations. Diffs (updated) - src/internal/devolve.hpp a1f8d8d333c5304b4d3795800485fdbfc14d4455 src/internal/devolve.cpp e23ed3ccf6c05f5b8ea91788788469250c70248c Diff: https://reviews.apache.org/r/70683/diff/2/ Changes: https://reviews.apache.org/r/70683/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70622: Added a unit test to verify if SLRP allows changes in volume context.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70622/ --- (Updated May 22, 2019, 3:15 a.m.) Review request for mesos and Benjamin Bannier. Changes --- Addressed Benjamin's comments. Bugs: MESOS-9395 https://issues.apache.org/jira/browse/MESOS-9395 Repository: mesos Description --- Added a unit test to verify if SLRP allows changes in volume context. Diffs (updated) - src/examples/test_csi_plugin.cpp 03f782ead136a79c4c5b099496072933f6737598 src/tests/storage_local_resource_provider_tests.cpp ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 Diff: https://reviews.apache.org/r/70622/diff/2/ Changes: https://reviews.apache.org/r/70622/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70621: Used full paths as volume IDs for the test CSI plugin.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70621/ --- (Updated May 22, 2019, 3:03 a.m.) Review request for mesos and Benjamin Bannier. Changes --- Addressed Benjamin's comments. Bugs: MESOS-9395 https://issues.apache.org/jira/browse/MESOS-9395 Repository: mesos Description --- The full paths of simulated volumes are now in their ID instead of their contextual information. This simplifies SLRP tests, and makes it cleaner if we want to customize the contextual information in the future. Diffs (updated) - src/examples/test_csi_plugin.cpp 03f782ead136a79c4c5b099496072933f6737598 src/tests/storage_local_resource_provider_tests.cpp ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 Diff: https://reviews.apache.org/r/70621/diff/2/ Changes: https://reviews.apache.org/r/70621/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70620: Made SLRP allow changes in volume context.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70620/ --- (Updated May 22, 2019, 2:58 a.m.) Review request for mesos, Benjamin Bannier and James DeFelice. Changes --- Addressed Benjamin's comments. Bugs: MESOS-9395 https://issues.apache.org/jira/browse/MESOS-9395 Repository: mesos Description --- To make SLRP more robust against non-conforming CSI plugins that change volume contexts, the `getExistVolumes` method returns a list of resource conversions consisting of one for converting old volume contexts to new volume contexts, and one to remove missing volumes and add new volumes. To make the interfaces consistent, `getStoragePools` now also returns a list of resource conversions consisting of one conversion. Diffs (updated) - include/mesos/mesos.proto dc6a87f14864cb4d46b9856f251c2946a475162d include/mesos/v1/mesos.proto e8086e0f70cac73876a8ae31db3365b5059b5c44 src/resource_provider/storage/provider.cpp 999fe95bb6f38f5a25068accd854b37788b24028 Diff: https://reviews.apache.org/r/70620/diff/2/ Changes: https://reviews.apache.org/r/70620/diff/1-2/ Testing --- sudo make check more testing done later in chain Thanks, Chun-Hung Hsiao
Re: Review Request 70621: Used full paths as volume IDs for the test CSI plugin.
> On May 14, 2019, 11:37 a.m., Benjamin Bannier wrote: > > src/examples/test_csi_plugin.cpp > > Line 1253 (original), 1244 (patched) > > <https://reviews.apache.org/r/70621/diff/1/?file=2144329#file2144329line1267> > > > > It would be great to have some testing that `getVolumePath -> > > parseVolumePath` roundtrips. Let's either add a simple dedicated test or, > > since this is test code, just add an assertion in e.g., `getVolumePath` > > that the created path would roundtrip. Sure will do. However this function does not decode volume names so extra work needs to be done. > On May 14, 2019, 11:37 a.m., Benjamin Bannier wrote: > > src/examples/test_csi_plugin.cpp > > Lines 1250 (patched) > > <https://reviews.apache.org/r/70621/diff/1/?file=2144329#file2144329line1274> > > > > Can we do that "normalization" when we initialize `workDir` instead? Doing that seems to increase the coupling between the initialization of `workDir`, which happens in the constructor, and the code here. Dropping. > On May 14, 2019, 11:37 a.m., Benjamin Bannier wrote: > > src/examples/test_csi_plugin.cpp > > Line 1269 (original), 1269 (patched) > > <https://reviews.apache.org/r/70621/diff/1/?file=2144329#file2144329line1296> > > > > We are leaving `id` default-initialized here. No it's the context which got default-initialized. Let me make it explicit here. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70621/#review215248 --- On May 10, 2019, 1:17 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70621/ > --- > > (Updated May 10, 2019, 1:17 a.m.) > > > Review request for mesos and Benjamin Bannier. > > > Bugs: MESOS-9395 > https://issues.apache.org/jira/browse/MESOS-9395 > > > Repository: mesos > > > Description > --- > > The full paths of simulated volumes are now in their ID instead of their > contextual information. This simplifies SLRP tests, and makes it cleaner > if we want to customize the contextual information in the future. > > > Diffs > - > > src/examples/test_csi_plugin.cpp 03f782ead136a79c4c5b099496072933f6737598 > src/tests/storage_local_resource_provider_tests.cpp > ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 > > > Diff: https://reviews.apache.org/r/70621/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70620: Made SLRP allow changes in volume context.
> On May 14, 2019, 10:04 a.m., Benjamin Bannier wrote: > > src/resource_provider/storage/provider.cpp > > Line 923 (original), 930 (patched) > > <https://reviews.apache.org/r/70620/diff/1/?file=2144328#file2144328line930> > > > > Let's create a dedicated error message. Do you mean printing out an error or making it a hard failure? In theory this should never fail and that's why I make it a hard assertion here. > On May 14, 2019, 10:04 a.m., Benjamin Bannier wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1067 (patched) > > <https://reviews.apache.org/r/70620/diff/1/?file=2144328#file2144328line1076> > > > > I don't think we need optional semantics here, I'd suggest to just work > > with an empty `Labels` here. > > > > If you want you can convert an empty `Labels` to a `None` when invoking > > `createRawDiskResource` below. I think even that function might not need > > optional semantics for labels though. This is for maintaining the following proto2 <-> proto3 semantic, since proto3 doesn't have an OPTIONAL semantic, instead any field w/ the default value will not go onto the wire: Resource.DiskInfo.Source.has_metadata() <=> !VolumeInfo.context.empty() Since in practice there's no difference, we can change this to set up `metadata`to an empty `Labels` even if the `context` map is empty, as long as we handle backward compatibility carefully. But if we don't do this, it seems more reasonable to me to pass in a `None` here than passing a `Labels` here and doing a special check in another function. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70620/#review215220 --- On May 10, 2019, 1:14 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70620/ > --- > > (Updated May 10, 2019, 1:14 a.m.) > > > Review request for mesos, Benjamin Bannier and James DeFelice. > > > Bugs: MESOS-9395 > https://issues.apache.org/jira/browse/MESOS-9395 > > > Repository: mesos > > > Description > --- > > To make SLRP more robust against non-conforming CSI plugins that change > volume contexts, the `getExistVolumes` method returns a list of resource > conversions consisting of one for converting old volume contexts to new > volume contexts, and one to remove missing volumes and add new volumes. > > To make the interfaces consistent, `getStoragePools` now also returns a > list of resource conversions consisting of one conversion. > > > Diffs > - > > include/mesos/mesos.proto dc6a87f14864cb4d46b9856f251c2946a475162d > include/mesos/v1/mesos.proto e8086e0f70cac73876a8ae31db3365b5059b5c44 > src/resource_provider/storage/provider.cpp > 999fe95bb6f38f5a25068accd854b37788b24028 > > > Diff: https://reviews.apache.org/r/70620/diff/1/ > > > Testing > --- > > sudo make check > more testing done later in chain > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70620: Made SLRP allow changes in volume context.
> On May 12, 2019, 12:44 a.m., James DeFelice wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1078 (patched) > > <https://reviews.apache.org/r/70620/diff/1/?file=2144328#file2144328line1087> > > > > how often are these conversions applied? i suppose it's every time that > > reconciliation happens - how often is that? Right now it only happens once when the SLRP is restarted. The plan is to make this happen periodically or on demand through https://issues.apache.org/jira/browse/MESOS-9254. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70620/#review215202 --- On May 10, 2019, 1:14 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70620/ > --- > > (Updated May 10, 2019, 1:14 a.m.) > > > Review request for mesos, Benjamin Bannier and James DeFelice. > > > Bugs: MESOS-9395 > https://issues.apache.org/jira/browse/MESOS-9395 > > > Repository: mesos > > > Description > --- > > To make SLRP more robust against non-conforming CSI plugins that change > volume contexts, the `getExistVolumes` method returns a list of resource > conversions consisting of one for converting old volume contexts to new > volume contexts, and one to remove missing volumes and add new volumes. > > To make the interfaces consistent, `getStoragePools` now also returns a > list of resource conversions consisting of one conversion. > > > Diffs > - > > include/mesos/mesos.proto dc6a87f14864cb4d46b9856f251c2946a475162d > include/mesos/v1/mesos.proto e8086e0f70cac73876a8ae31db3365b5059b5c44 > src/resource_provider/storage/provider.cpp > 999fe95bb6f38f5a25068accd854b37788b24028 > > > Diff: https://reviews.apache.org/r/70620/diff/1/ > > > Testing > --- > > sudo make check > more testing done later in chain > > > Thanks, > > Chun-Hung Hsiao > >
Review Request 70686: Added a unit test for master operation authorization.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70686/ --- Review request for mesos, Benjamin Bannier and Greg Mann. Bugs: MESOS-9485 https://issues.apache.org/jira/browse/MESOS-9485 Repository: mesos Description --- This test verifies that allowing or denying an action will only result in a success or failure on specific operations but not other operations in an accept call. This is a regression test for MESOS-9474 and MESOS-9480. Diffs - src/slave/slave.hpp c740bf7c4af1699dd71a31f9169d29c1b4adb40d src/tests/master_authorization_tests.cpp f65b6210a8189ec5e9089bce845a46d325253058 src/tests/mesos.hpp e70e0e1bbebef41557f9bb1f26c379357fc74e2a src/tests/mock_slave.hpp 326a450eb2ae9912a19c6e0220e80f74d2953add src/tests/mock_slave.cpp 1122c2a8b56b9d4e5deea4e892d81061f5e39a8d Diff: https://reviews.apache.org/r/70686/diff/1/ Testing --- make check Ran the unit test under stress for 500 iterations. Thanks, Chun-Hung Hsiao
Review Request 70685: Removed the `TaskStatusUpdateIsTerminalState` matcher.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70685/ --- Review request for mesos, Benjamin Bannier and Gilbert Song. Repository: mesos Description --- The `TaskStatusUpdateIsTerminalState` test matcher does not handle both v0 and v1 status updates. Since introducing such a matcher in the `v1::scheduler` namespace is inconsistent and it is not hard to use the built-in `Truly` matcher to implement the same functionality, this matcher is removed for now. Diffs - src/tests/containerizer/nvidia_gpu_isolator_tests.cpp 7f5bd8cd4c07b724fe2c49068bf327e5afe9ec41 src/tests/mesos.hpp e70e0e1bbebef41557f9bb1f26c379357fc74e2a Diff: https://reviews.apache.org/r/70685/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70684: Changed the `*TaskIdEq` test matchers to take a `TaskID`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70684/ --- Review request for mesos, Benjamin Bannier and Greg Mann. Bugs: MESOS-9485 https://issues.apache.org/jira/browse/MESOS-9485 Repository: mesos Description --- The `TaskStatusTaskIdEq` and `TaskStatusUpdateTaskIdEq` matchers previously take a `TaskInfo`, which is not very intuitive and inconsistent with other matchers such as `OptionTaskHasTaskID`. By making these matchers take a `TaskID` we also reduce coupling of this matcher and the structure of `TaskInfo`. Diffs - src/tests/api_tests.cpp bc19d7e9661f091aebf3072967548c7c3196239d src/tests/containerizer/nvidia_gpu_isolator_tests.cpp 7f5bd8cd4c07b724fe2c49068bf327e5afe9ec41 src/tests/default_executor_tests.cpp 93d7a1cfa1a9df838426c8257b6158e3553b5037 src/tests/gc_tests.cpp a583f1d79a9dd9def24ccfeb9e49ea0392173420 src/tests/master_tests.cpp 964d935771a99efaee63187affe46b551146f310 src/tests/mesos.hpp e70e0e1bbebef41557f9bb1f26c379357fc74e2a src/tests/partition_tests.cpp 8148aff6bf1d9637d486b549d7f8c7124566d86c src/tests/slave_tests.cpp 50882a5aa1b20c234f3cec8366f346fa9bfd4f83 Diff: https://reviews.apache.org/r/70684/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70683: Added helpers for evolving operation-related proto messages.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70683/ --- Review request for mesos, Benjamin Bannier and Greg Mann. Bugs: MESOS-9485 https://issues.apache.org/jira/browse/MESOS-9485 Repository: mesos Description --- Added helpers for evolving operation-related proto messages. Diffs - src/internal/evolve.hpp 1044d9df75b6fc1f60d3704be9cb5751e6d4321d src/internal/evolve.cpp 19c155967bf090fb2ec39211805ff1385787ab59 Diff: https://reviews.apache.org/r/70683/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70651: Notifies master `/api/v1` subscribers about recovered frameworks.
> On May 17, 2019, 10:16 a.m., Greg Mann wrote: > > src/tests/api_tests.cpp > > Lines 2774 (patched) > > <https://reviews.apache.org/r/70651/diff/1/?file=2145475#file2145475line2777> > > > > Nit: make this a const ref? > > Chun-Hung Hsiao wrote: > This is copied because `event` is overwritten later. Do you think that > it's okay to use a ref, since we don't access to it afterwards? > > Greg Mann wrote: > As long as `agentAdded` isn't modified, we can make it a const ref? I was > suggesting we do the same thing with `agentAdded` that we do previously in > this test with `getState`: > ``` > const v1::master::Response::GetState& getState = > event->get().subscribed().get_state(); > ``` The problem is that `event->get()` will be destructed once `event` is overwritten, so the ref, const or not, will no longer be valid. So making a copy is a bit safer (if anyone touches this test in the future). - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70651/#review215319 ------- On May 16, 2019, 5:11 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70651/ > --- > > (Updated May 16, 2019, 5:11 a.m.) > > > Review request for mesos and Greg Mann. > > > Bugs: MESOS-9785 > https://issues.apache.org/jira/browse/MESOS-9785 > > > Repository: mesos > > > Description > --- > > If one subscribes to master's `/api/v1` endpoint after a master failover > but before an agent reregistration, frameworks recovered through the > agent registration should be notified to the subscriber, otherwise > recovered tasks will have framework IDs referring to frameworks unknown > to the subscriber. > > > Diffs > - > > src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def > src/master/master.cpp 6c0e30bff3fbfe0ad1f3af44eb3df6f82048f0e1 > src/tests/api_tests.cpp bc19d7e9661f091aebf3072967548c7c3196239d > > > Diff: https://reviews.apache.org/r/70651/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70651: Notifies master `/api/v1` subscribers about recovered frameworks.
> On May 17, 2019, 10:16 a.m., Greg Mann wrote: > > src/tests/api_tests.cpp > > Lines 2774 (patched) > > <https://reviews.apache.org/r/70651/diff/1/?file=2145475#file2145475line2777> > > > > Nit: make this a const ref? This is copied because `event` is overwritten later. Do you think that it's okay to use a ref, since we don't access to it afterwards? - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70651/#review215319 ------- On May 16, 2019, 5:11 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70651/ > --- > > (Updated May 16, 2019, 5:11 a.m.) > > > Review request for mesos and Greg Mann. > > > Bugs: MESOS-9785 > https://issues.apache.org/jira/browse/MESOS-9785 > > > Repository: mesos > > > Description > --- > > If one subscribes to master's `/api/v1` endpoint after a master failover > but before an agent reregistration, frameworks recovered through the > agent registration should be notified to the subscriber, otherwise > recovered tasks will have framework IDs referring to frameworks unknown > to the subscriber. > > > Diffs > - > > src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def > src/master/master.cpp 6c0e30bff3fbfe0ad1f3af44eb3df6f82048f0e1 > src/tests/api_tests.cpp bc19d7e9661f091aebf3072967548c7c3196239d > > > Diff: https://reviews.apache.org/r/70651/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Review Request 70651: Notifies master `/api/v1` subscribers about recovered frameworks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70651/ --- Review request for mesos and Greg Mann. Bugs: MESOS-9785 https://issues.apache.org/jira/browse/MESOS-9785 Repository: mesos Description --- If one subscribes to master's `/api/v1` endpoint after a master failover but before an agent reregistration, frameworks recovered through the agent registration should be notified to the subscriber, otherwise recovered tasks will have framework IDs referring to frameworks unknown to the subscriber. Diffs - src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def src/master/master.cpp 6c0e30bff3fbfe0ad1f3af44eb3df6f82048f0e1 src/tests/api_tests.cpp bc19d7e9661f091aebf3072967548c7c3196239d Diff: https://reviews.apache.org/r/70651/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70628: Return 409 if `UPDATE_RESOURCE_PROVIDER_CONFIG` names a missing config.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70628/ --- Review request for mesos, Benjamin Bannier, Greg Mann, and James DeFelice. Bugs: MESOS-9779 https://issues.apache.org/jira/browse/MESOS-9779 Repository: mesos Description --- Since 404 is returned if the API endpoint route is not set yet, this error code becomes ambiguous and makes clients hard to programmatically handle it. Therefore, the error code for specifying a missing config in this API call is changed to 409 Conflict. Diffs - include/mesos/agent/agent.proto ff408a4a1f144237e7425961aa3d11cba163e7c0 include/mesos/v1/agent/agent.proto 19d6c424365f8fd6e6d4a5bb7a61fbe5a66f031d src/slave/http.cpp 2c4e792d16ad4fd3303760a9db3cba4269152e7d src/tests/agent_resource_provider_config_api_tests.cpp 7185ac00b5137c7ab208b623e36d03ffd43aab6e Diff: https://reviews.apache.org/r/70628/diff/1/ Testing --- sudo make check Thanks, Chun-Hung Hsiao
Review Request 70627: Explicitly marked agent resource provider config calls as experimental.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70627/ --- Review request for mesos, Benjamin Bannier and James DeFelice. Repository: mesos Description --- The resource provider feature has been marked as experimental, so we should also call out the related config API calls are experimental. Diffs - include/mesos/agent/agent.proto ff408a4a1f144237e7425961aa3d11cba163e7c0 include/mesos/v1/agent/agent.proto 19d6c424365f8fd6e6d4a5bb7a61fbe5a66f031d Diff: https://reviews.apache.org/r/70627/diff/1/ Testing --- N/A Thanks, Chun-Hung Hsiao
Re: Review Request 70579: Added a Task Scheduler to simplify testing.
> On May 10, 2019, 3:31 a.m., Chun-Hung Hsiao wrote: > > src/tests/utils/task_scheduler.cpp > > Lines 289 (patched) > > <https://reviews.apache.org/r/70579/diff/1/?file=2142143#file2142143line289> > > > > ``` > > offers_.erase(std::remove_if(...), offers_.end()); > > ``` > > Although the effect is the same since there will be only one element > > removed, pairing `std::remove_if` and `end` will be more idomatic. > > Benjamin Mahler wrote: > Hm.. why is this an "issue" and why is that more idiomatic..? Erase has > two versions, one take a position and one takes a range. > > Looking at this code again, it's rather unreadable, and I probably will > just write the version with a loop to find the one to erase. The reason it's more idiomatic is because `it = std::remove_if(...)` returns an iterator that points to the new end of valid items, which means that, `[it, offers_.end())` should be removed. Since internally `std::remove_if` moves items past `it` to their "correct" positions before `it`, erasing only `it` but not things after it seems very strange at a first glance. Calling `erase` with one iterator works here because we're relying on the fact that `std::remove_if` will only remove one item in this particular case, which is not trivial to me (before looking into what `offers_` contains and when the lambda returns true. If you prefer to use `erase` with one position, how about the following: ``` offers_.erase(std::find_if( offers_.begin(), offers_.end(), [&](const v1::Offer& offer) { ... })); ``` Also, I just noticed that both the snippet you have and the one I wrote above relies on the invariant where `event.rescind().offer_id()` exists in `offers_`. `vector::erase(pos)` requires that the iterator passed in should be both valid and deferenceable. If there are more than one agents, then this invariant will be violated. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70579/#review215174 --- On May 1, 2019, 9:17 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70579/ > ------- > > (Updated May 1, 2019, 9:17 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8511 > https://issues.apache.org/jira/browse/MESOS-8511 > > > Repository: mesos > > > Description > --- > > The caller can submit tasks to the scheduler and get back statuses. > This reduces the boilerplate of using the low level scheduler > library. > > > Diffs > - > > include/mesos/v1/scheduler.hpp 2f72129ba48943c891d7020bfd2cad3066355026 > src/Makefile.am 5f97523fbe2d80733fbdcc7706f2761f5a071f9f > src/scheduler/scheduler.cpp 674483aa80bc74b654343c97892a96f49d5c7ed4 > src/tests/CMakeLists.txt e6b1d8a097246f0921467d7f778034d6754fcff6 > src/tests/utils/task_scheduler.hpp PRE-CREATION > src/tests/utils/task_scheduler.cpp PRE-CREATION > > > Diff: https://reviews.apache.org/r/70579/diff/1/ > > > Testing > --- > > Tested it via the subsequent patch. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 70596: Launched tasks with more memory in SLRP unit tests.
> On May 6, 2019, 11:55 a.m., Benjamin Bannier wrote: > > src/tests/storage_local_resource_provider_tests.cpp > > Line 1795 (original), 1795 (patched) > > <https://reviews.apache.org/r/70596/diff/1/?file=2142847#file2142847line1795> > > > > Could we move this into a test class-specific constant? > > > > Here and below. Made this a function that takes `persistentVolumes` as additional resources and adds cpus and mem to them. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70596/#review215058 ------- On May 3, 2019, 9:09 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70596/ > --- > > (Updated May 3, 2019, 9:09 p.m.) > > > Review request for mesos, Benjamin Bannier and Jan Schlicht. > > > Bugs: MESOS-9765 > https://issues.apache.org/jira/browse/MESOS-9765 > > > Repository: mesos > > > Description > --- > > Raised the task memory to 128MB (which are the value used in most persistent > volume tests) in all SLRP tests that launch tasks to avoid OOM. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 > > > Diff: https://reviews.apache.org/r/70596/diff/1/ > > > Testing > --- > > sudo make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70579: Added a Task Scheduler to simplify testing.
(patched) <https://reviews.apache.org/r/70579/#comment301727> Instead of doing the wiring, we could instead make `TaskSchedulerProcess::submit` to take `output` as its second argument. src/tests/utils/task_scheduler.cpp Lines 486 (patched) <https://reviews.apache.org/r/70579/#comment301728> We could avoid this wiring by making `TaskSchedulerProcess::submit` take a second `hashmap` parameter. - Chun-Hung Hsiao On May 1, 2019, 9:17 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70579/ > --- > > (Updated May 1, 2019, 9:17 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8511 > https://issues.apache.org/jira/browse/MESOS-8511 > > > Repository: mesos > > > Description > --- > > The caller can submit tasks to the scheduler and get back statuses. > This reduces the boilerplate of using the low level scheduler > library. > > > Diffs > - > > include/mesos/v1/scheduler.hpp 2f72129ba48943c891d7020bfd2cad3066355026 > src/Makefile.am 5f97523fbe2d80733fbdcc7706f2761f5a071f9f > src/scheduler/scheduler.cpp 674483aa80bc74b654343c97892a96f49d5c7ed4 > src/tests/CMakeLists.txt e6b1d8a097246f0921467d7f778034d6754fcff6 > src/tests/utils/task_scheduler.hpp PRE-CREATION > src/tests/utils/task_scheduler.cpp PRE-CREATION > > > Diff: https://reviews.apache.org/r/70579/diff/1/ > > > Testing > --- > > Tested it via the subsequent patch. > > > Thanks, > > Benjamin Mahler > >
Review Request 70620: Made SLRP allow changes in volume context.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70620/ --- Review request for mesos, Benjamin Bannier and James DeFelice. Bugs: MESOS-9395 https://issues.apache.org/jira/browse/MESOS-9395 Repository: mesos Description --- To make SLRP more robust against non-conforming CSI plugins that change volume contexts, the `getExistVolumes` method returns a list of resource conversions consisting of one for converting old volume contexts to new volume contexts, and one to remove missing volumes and add new volumes. To make the interfaces consistent, `getStoragePools` now also returns a list of resource conversions consisting of one conversion. Diffs - include/mesos/mesos.proto dc6a87f14864cb4d46b9856f251c2946a475162d include/mesos/v1/mesos.proto e8086e0f70cac73876a8ae31db3365b5059b5c44 src/resource_provider/storage/provider.cpp 999fe95bb6f38f5a25068accd854b37788b24028 Diff: https://reviews.apache.org/r/70620/diff/1/ Testing --- sudo make check more testing done later in chain Thanks, Chun-Hung Hsiao
Review Request 70622: Added a unit test to verify if SLRP allows changes in volume context.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70622/ --- Review request for mesos and Benjamin Bannier. Bugs: MESOS-9395 https://issues.apache.org/jira/browse/MESOS-9395 Repository: mesos Description --- Added a unit test to verify if SLRP allows changes in volume context. Diffs - src/examples/test_csi_plugin.cpp 03f782ead136a79c4c5b099496072933f6737598 src/tests/storage_local_resource_provider_tests.cpp ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 Diff: https://reviews.apache.org/r/70622/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70621: Used full paths as volume IDs for the test CSI plugin.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70621/ --- Review request for mesos and Benjamin Bannier. Bugs: MESOS-9395 https://issues.apache.org/jira/browse/MESOS-9395 Repository: mesos Description --- The full paths of simulated volumes are now in their ID instead of their contextual information. This simplifies SLRP tests, and makes it cleaner if we want to customize the contextual information in the future. Diffs - src/examples/test_csi_plugin.cpp 03f782ead136a79c4c5b099496072933f6737598 src/tests/storage_local_resource_provider_tests.cpp ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 Diff: https://reviews.apache.org/r/70621/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70184/#review215120 --- Ship it! Ship It! - Chun-Hung Hsiao On May 6, 2019, 9:48 a.m., Jan Schlicht wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70184/ > --- > > (Updated May 6, 2019, 9:48 a.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9594 > https://issues.apache.org/jira/browse/MESOS-9594 > > > Repository: mesos > > > Description > --- > > Under some circumstances, offers would be filtered, resulting in the > test being stuck while waiting for offers. This has been resolved by > settling the clock before accepting new offers. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 > > > Diff: https://reviews.apache.org/r/70184/diff/2/ > > > Testing > --- > > make check > > I could not reproduce the flaky behavior of this test case, hence only assume > that this patch resolves the flakiness. > > > Thanks, > > Jan Schlicht > >
Re: Review Request 70595: Adds a regression test for MESOS-9766.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70595/#review215075 --- Ship it! Ship It! - Chun-Hung Hsiao On May 6, 2019, 10:04 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70595/ > --- > > (Updated May 6, 2019, 10:04 p.m.) > > > Review request for mesos, Alexander Rukletsov and Chun-Hung Hsiao. > > > Bugs: MESOS-9766 > https://issues.apache.org/jira/browse/MESOS-9766 > > > Repository: mesos > > > Description > --- > > This test fails on master prior to applying the fix for MESOS-9766. > It attempts to ensure that processes are terminated after the > /__processes__ handler dispatches to them. > > > Diffs > - > > 3rdparty/libprocess/src/tests/process_tests.cpp > 60f3dd653153c2b2ccf9c3a7eae8c75fd6ff025c > > > Diff: https://reviews.apache.org/r/70595/diff/4/ > > > Testing > --- > > Ran in repetition, although it appears to consistently fail on master without > repetition needed. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 70595: Adds a regression test for MESOS-9766.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70595/#review215074 --- Fix it, then Ship it! Thanks for making this test more deterministic! Got some comments, but IMO this test is robust enough in practice so please feel free to drop my issues. 3rdparty/libprocess/src/tests/process_tests.cpp Lines 2122 (patched) <https://reviews.apache.org/r/70595/#comment301475> If you want libprocess to manager this actor, you can do the following: ``` PID pid = spawn(new TestProcess(), true); Future waited = dispatch(pid, ::wait_for_terminate); ``` Then you won't need to do `process::wait` and `delete` later. If you do so, you don't even need `waited`. 3rdparty/libprocess/src/tests/process_tests.cpp Lines 2138 (patched) <https://reviews.apache.org/r/70595/#comment301474> Although highly unlikely, it is theoratically possible that the terminate event is enqueued before any event of `process` got dequeued. If this happen, this await would fail even w/ the fix. A possible way to fix this is to set up some promise in `wait_for_terminate`, and then wait for the corresponding future before calling `terminate` in this test. That said, fixing this flake that's unlikely to happen might not worth it, so please feel free to drop this issue. - Chun-Hung Hsiao On May 6, 2019, 6:34 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70595/ > --- > > (Updated May 6, 2019, 6:34 p.m.) > > > Review request for mesos, Alexander Rukletsov and Chun-Hung Hsiao. > > > Bugs: MESOS-9766 > https://issues.apache.org/jira/browse/MESOS-9766 > > > Repository: mesos > > > Description > --- > > This test fails on master prior to applying the fix for MESOS-9766. > It attempts to ensure that processes are terminated after the > /__processes__ handler dispatches to them. > > > Diffs > - > > 3rdparty/libprocess/src/tests/process_tests.cpp > 60f3dd653153c2b2ccf9c3a7eae8c75fd6ff025c > > > Diff: https://reviews.apache.org/r/70595/diff/2/ > > > Testing > --- > > Ran in repetition, although it appears to consistently fail on master without > repetition needed. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 70595: Adds a regression test for MESOS-9766.
> On May 3, 2019, 11:31 p.m., Chun-Hung Hsiao wrote: > > 3rdparty/libprocess/src/tests/process_tests.cpp > > Lines 2092-2108 (patched) > > <https://reviews.apache.org/r/70595/diff/1/?file=2142838#file2142838line2092> > > > > I tried this unit test w/o the fix in the following settings: > > 1. `--gtest_repeat=100`, which resulted in 3 passes and 97 failures. > > 2. Invoked the test w/o repetition in a for-loop, which resulted in 60 > > passes and 40 failures. > > > > So it seems to me that this test passes fairly easily if it only runs > > once. > > > > Then I replace this snippet with the following code, which just used 1 > > actor but failed every time in the above two settings: > > ``` > > UPID pid = spawn(new ProcessBase(), true); > > > > dispatch(pid, [] { os::sleep(Milliseconds(50)); }); > > > > http::URL url = http::URL( > > "http", > > process::address().ip, > > process::address().port, > > "/__processes__"); > > > > Future response = http::get(url); > > > > terminate(pid, true); > > ``` > > Alexander Rukletsov wrote: > Agree with Chun, we should avoid races as much as we can. Several > suggestions to further improve Chun's version: > - Pause the clock in the beginnig. We can then sleep for 42 minutes to > make sure the first message is still being processes when `http::get()` is > invoked (will of course require `settle()` before). > - `settle()` after `http::get` to ensure `/__processes__` handler is put > into the `pid`'s mailbaox. > - Check that the resulted JSON does not include the entry for `pid`, > i.e., is empty. I'm not sure if adding settles works: 1. Adding a `settle` before `http::get` makes it wait for the completion of the sleep. 2. Adding a `settle` after `http::get` makes `terminate` wait for the completion of the sleep. IIUC both will increase the likelihood of the test passing w/o che fix. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70595/#review215035 --- On May 3, 2019, 7:58 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70595/ > --- > > (Updated May 3, 2019, 7:58 p.m.) > > > Review request for mesos, Alexander Rukletsov and Chun-Hung Hsiao. > > > Bugs: MESOS-9766 > https://issues.apache.org/jira/browse/MESOS-9766 > > > Repository: mesos > > > Description > --- > > This test fails on master prior to applying the fix for MESOS-9766. > It attempts to ensure that processes are terminated after the > /__processes__ handler dispatches to them. > > > Diffs > - > > 3rdparty/libprocess/src/tests/process_tests.cpp > 60f3dd653153c2b2ccf9c3a7eae8c75fd6ff025c > > > Diff: https://reviews.apache.org/r/70595/diff/1/ > > > Testing > --- > > Ran in repetition, although it appears to consistently fail on master without > repetition needed. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 70594: Fixed an issue where /__processes__ never returns a response.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70594/#review215037 --- Ship it! How about quoting `/__processes__` with backticks? - Chun-Hung Hsiao On May 3, 2019, 7:58 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70594/ > --- > > (Updated May 3, 2019, 7:58 p.m.) > > > Review request for mesos, Alexander Rukletsov and Chun-Hung Hsiao. > > > Bugs: MESOS-9766 > https://issues.apache.org/jira/browse/MESOS-9766 > > > Repository: mesos > > > Description > --- > > It's possible for /__processes__ to never return a response if > a process is terminated after the /__processes__ handler dispatches > to it, thus leading the Future to be abandoned. > > This patch ensures we ignore those cases, rather than wait forever. > > See MESOS-9766. > > > Diffs > - > > 3rdparty/libprocess/src/process.cpp > 124836472313721a5dbfe4b1ca55f0da3cecd66b > > > Diff: https://reviews.apache.org/r/70594/diff/1/ > > > Testing > --- > > Added a test in the subsequent patch. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 70595: Adds a regression test for MESOS-9766.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70595/#review215035 --- 3rdparty/libprocess/src/tests/process_tests.cpp Lines 2086 (patched) <https://reviews.apache.org/r/70595/#comment301415> Should `/__processes__` be quoted by backticks? Ditto below. 3rdparty/libprocess/src/tests/process_tests.cpp Lines 2090 (patched) <https://reviews.apache.org/r/70595/#comment301422> How about `NoProcessesEndpointHang`? 3rdparty/libprocess/src/tests/process_tests.cpp Lines 2092-2108 (patched) <https://reviews.apache.org/r/70595/#comment301419> I tried this unit test w/o the fix in the following settings: 1. `--gtest_repeat=100`, which resulted in 3 passes and 97 failures. 2. Invoked the test w/o repetition in a for-loop, which resulted in 60 passes and 40 failures. So it seems to me that this test passes fairly easily if it only runs once. Then I replace this snippet with the following code, which just used 1 actor but failed every time in the above two settings: ``` UPID pid = spawn(new ProcessBase(), true); dispatch(pid, [] { os::sleep(Milliseconds(50)); }); http::URL url = http::URL( "http", process::address().ip, process::address().port, "/__processes__"); Future response = http::get(url); terminate(pid, true); ``` 3rdparty/libprocess/src/tests/process_tests.cpp Lines 2110-2112 (patched) <https://reviews.apache.org/r/70595/#comment301417> `AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);` - Chun-Hung Hsiao On May 3, 2019, 7:58 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70595/ > --- > > (Updated May 3, 2019, 7:58 p.m.) > > > Review request for mesos, Alexander Rukletsov and Chun-Hung Hsiao. > > > Bugs: MESOS-9766 > https://issues.apache.org/jira/browse/MESOS-9766 > > > Repository: mesos > > > Description > --- > > This test fails on master prior to applying the fix for MESOS-9766. > It attempts to ensure that processes are terminated after the > /__processes__ handler dispatches to them. > > > Diffs > - > > 3rdparty/libprocess/src/tests/process_tests.cpp > 60f3dd653153c2b2ccf9c3a7eae8c75fd6ff025c > > > Diff: https://reviews.apache.org/r/70595/diff/1/ > > > Testing > --- > > Ran in repetition, although it appears to consistently fail on master without > repetition needed. > > > Thanks, > > Benjamin Mahler > >
Review Request 70596: Launched tasks with more memory in SLRP unit tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70596/ --- Review request for mesos, Benjamin Bannier and Jan Schlicht. Bugs: MESOS-9765 https://issues.apache.org/jira/browse/MESOS-9765 Repository: mesos Description --- Raised the task memory to 128MB (which are the value used in most persistent volume tests) in all SLRP tests that launch tasks to avoid OOM. Diffs - src/tests/storage_local_resource_provider_tests.cpp ecd7aeef1cb3d1a5b4b3419dfd912d41a8c6 Diff: https://reviews.apache.org/r/70596/diff/1/ Testing --- sudo make check Thanks, Chun-Hung Hsiao
Re: Review Request 70133: Removed unnecessary accept filters in SLRP tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70133/ --- (Updated May 2, 2019, 11:40 p.m.) Review request for mesos, Benjamin Bannier and Meng Zhu. Changes --- Removed the accept filters in a newly-added test. Repository: mesos Description --- Removed unnecessary accept filters in SLRP tests. Diffs (updated) - src/tests/storage_local_resource_provider_tests.cpp 8bf4d2331b59770a7b7f3768499047bec2d67677 Diff: https://reviews.apache.org/r/70133/diff/3/ Changes: https://reviews.apache.org/r/70133/diff/2-3/ Testing --- `sudo make check` Especially, tested that each of the three modified tests finishes in 5 seconds. Thanks, Chun-Hung Hsiao
Re: Review Request 69955: Added SLRP unit tests for destroying unpublished persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69955/ --- (Updated May 2, 2019, 10:46 p.m.) Review request for mesos and Benjamin Bannier. Changes --- Addressed Benjamin's comments. Bugs: MESOS-9565 https://issues.apache.org/jira/browse/MESOS-9565 Repository: mesos Description --- This patch adds 3 unit tests: `DestroyUnpublishedPersistentVolume`, `DestroyUnpublishedPersistentVolumeWithRecovery`, and `DestroyUnpublishedPersistentVolumeWithReboot` to test that the SLRP is resilient to misbehaved CSI plugins that fail to publish volumes. Diffs (updated) - src/tests/storage_local_resource_provider_tests.cpp 8bf4d2331b59770a7b7f3768499047bec2d67677 Diff: https://reviews.apache.org/r/69955/diff/4/ Changes: https://reviews.apache.org/r/69955/diff/3-4/ Testing --- make check All tests passed 400 iterations (200 per parameter value) under stress. Thanks, Chun-Hung Hsiao
Re: Review Request 69954: Added a SLRP unit test for persistent block volume creation.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69954/ --- (Updated May 2, 2019, 10:44 p.m.) Review request for mesos and Benjamin Bannier. Changes --- Addressed Benjamin's comment. Bugs: MESOS-9565 https://issues.apache.org/jira/browse/MESOS-9565 Repository: mesos Description --- Test `CreateDestroyPersistentBlockVolume` verifies that SLRP would fail a `CREATE` operation on a BLOCK disk resource, and a followup `DESTROY` will be dropped (instead of failing the SLRP). Diffs (updated) - src/tests/mock_csi_plugin.cpp cbcb59f720642c2801f5b86c438793e24437b414 src/tests/storage_local_resource_provider_tests.cpp 8bf4d2331b59770a7b7f3768499047bec2d67677 Diff: https://reviews.apache.org/r/69954/diff/4/ Changes: https://reviews.apache.org/r/69954/diff/3-4/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 69954: Added a SLRP unit test for persistent block volume creation.
> On April 16, 2019, 1:06 p.m., Benjamin Bannier wrote: > > src/tests/storage_local_resource_provider_tests.cpp > > Lines 3128 (patched) > > <https://reviews.apache.org/r/69954/diff/3/?file=2138790#file2138790line3128> > > > > Are we checking that the resources get returned to the offer pool? In > > that case let's confirm that `offers` looks like what we would expect. > > > > Alternatively we might not want to test this here and could just remove > > this and add a matching ignore when setting up the mock expectation above. > > That seems to make more sense to me. > > Chun-Hung Hsiao wrote: > It's more about making it deterministic, and also as part of the > end-to-end scenario. The contents of the offer has already been checked by > the matcher. > > I could instead ignore the offer by doing a `Times(AtMost(1))`. Do you > think this would make the test simpler? As discussed offline, the offer has been checked with the matcher set in the expectation. Instead of using `AWAIT_READY`, we can use `AWAIT_EXPECT_READY` with logging here. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69954/#review214689 ------- On April 12, 2019, 3:15 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69954/ > --- > > (Updated April 12, 2019, 3:15 a.m.) > > > Review request for mesos and Benjamin Bannier. > > > Bugs: MESOS-9565 > https://issues.apache.org/jira/browse/MESOS-9565 > > > Repository: mesos > > > Description > --- > > Test `CreateDestroyPersistentBlockVolume` verifies that SLRP would fail > a `CREATE` operation on a BLOCK disk resource, and a followup `DESTROY` > will be dropped (instead of failing the SLRP). > > > Diffs > - > > src/tests/mock_csi_plugin.cpp cbcb59f720642c2801f5b86c438793e24437b414 > src/tests/storage_local_resource_provider_tests.cpp > 8bf4d2331b59770a7b7f3768499047bec2d67677 > > > Diff: https://reviews.apache.org/r/69954/diff/3/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70431: Made the `RetryRpcWithExponentialBackoff` SLRP test work with CSI v1.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70431/ --- (Updated May 2, 2019, 10:42 p.m.) Review request for mesos, Benjamin Bannier and Jan Schlicht. Changes --- Addressed Benjamin's comments. Bugs: MESOS-9627 https://issues.apache.org/jira/browse/MESOS-9627 Repository: mesos Description --- This patch enables the unit test to test against CSI v1 through the following changes: * The forwarding mode of the test CSI plugin now respects the `--api_version` option. When specified, only requests of the proper CSI version will be forwarded. * The expectations of `CreateVolume` and `DeleteVolume` calls in the unit tests are parameterized against the CSI version string. * The mock CSI plugin now provides a default implementation for the `GetCapacity` call so the unit test can be simplified. Diffs (updated) - src/examples/test_csi_plugin.cpp b54d666a878576f5845027fe6f704e10195079e1 src/tests/mock_csi_plugin.cpp cbcb59f720642c2801f5b86c438793e24437b414 src/tests/storage_local_resource_provider_tests.cpp 8bf4d2331b59770a7b7f3768499047bec2d67677 Diff: https://reviews.apache.org/r/70431/diff/2/ Changes: https://reviews.apache.org/r/70431/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70521: Renamed variables in `Master::_accept` to improve readability.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70521/ --- (Updated May 1, 2019, 8:36 p.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Rebased. Repository: mesos Description --- Renamed variables in `Master::_accept` to improve readability. Diffs (updated) - src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 Diff: https://reviews.apache.org/r/70521/diff/4/ Changes: https://reviews.apache.org/r/70521/diff/3-4/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/ --- (Updated May 1, 2019, 8:35 p.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Addressed BenM's comments. Bugs: MESOS-9616 https://issues.apache.org/jira/browse/MESOS-9616 Repository: mesos Description (updated) --- Currently if a framework accepts an offer with a `RESERVE` operation without a task consuming the reserved resources, the resources will be implicitly declined. This is counter to what one would expect (that only the remaining resources in the offer will be declined): Offer `cpus:10` -> `ACCEPT` with `RESERVE cpus(role):1` *Actual* implicit decline: `cpus:9;cpus(role):1` *Expected* implicit decline: `cpus:9` The same issue is present with other transformational operations (i.e., `UNRESERVE`, `CREATE` and `DESTROY`). This patch fixes this issue by only implicitly declining the "remaining" untransformed resources, computed as follows: Offered = `cpus:10` Remaining = `cpus:9;cpus(role):1` Implicitly declined = remaining - (remaining - offered) = `cpus:9;cpus(role):1` - `cpus:(role):1` = `cpus:9` Diffs (updated) - src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 Diff: https://reviews.apache.org/r/70132/diff/8/ Changes: https://reviews.apache.org/r/70132/diff/7-8/ Testing --- make check More testing done in r/70537. Thanks, Chun-Hung Hsiao
Re: Review Request 70537: Added unit tests for implicit decline.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70537/ --- (Updated April 25, 2019, 10:34 p.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Reordered test cases to match the documentation. Bugs: MESOS-9616 https://issues.apache.org/jira/browse/MESOS-9616 Repository: mesos Description --- Added unit tests for implicit decline. Diffs (updated) - src/tests/scheduler_tests.cpp e0ed02900330c678bbf5c609c1f45d05147851ed Diff: https://reviews.apache.org/r/70537/diff/2/ Changes: https://reviews.apache.org/r/70537/diff/1-2/ Testing --- make check Tests `ImplicitDecline3` and `ImplicitDecline5` will fail w/o r/70132. Thanks, Chun-Hung Hsiao
Review Request 70547: Documented `DECLINE` and added examples for implicit decline.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70547/ --- Review request for mesos. Repository: mesos Description --- Documented `DECLINE` and added examples for implicit decline. Diffs - docs/scheduler-http-api.md a5327c229142267836f327f9c382ef50b7e334db Diff: https://reviews.apache.org/r/70547/diff/1/ Testing --- N/A Thanks, Chun-Hung Hsiao
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
> On April 23, 2019, 10:47 a.m., Benjamin Bannier wrote: > > docs/scheduler-http-api.md > > Line 132 (original), 132 (patched) > > <https://reviews.apache.org/r/70132/diff/5/?file=2140649#file2140649line132> > > > > What do you think of getting rid of "implicitly declined" behavior for > > "cancelling operations"? > > > > It seems that behavior is more driven by the implementation than > > intuitive api behavior; it e.g., forces frameworks to reason differently > > about operations executed in isolation vs. executed together. It seems > > having the identical behavior for both cases would both be easier to > > explain and also program against. The behavior that seems to make most > > sense for me would be to only ever implictly decline "untouched resources", > > e.g., if accepting offered `cpus:4` with `RESERVE(cpus:2, role) && > > UNRESERVE(cpus:2, role)` we would implicitly decline only `cpus:2`. > > Chun-Hung Hsiao wrote: > It seems to me that "cancelling operations" as something that are both 1. > very rare and 2. make little sense for frameworks, so I'm more like > delivering a fix for common cases without making the alrealy-messy code path > more complicated. WDYT? Also @bmahler what's your opinion on @bbannier's > suggestion? IIRC you mentioned something like some are designed behaviors > before, but I didn't know the context. > > Benjamin Mahler wrote: > Thanks for bringing this up, it's certainly a bit bizarre of a use case. > I think the more common case is UNRESERVE on its own, where it still seems a > bit bizarre that the "untouched" resources are declined with the filter and > the UNRESERVE resources are not filtered. That seems a bit arbitrary to me, > but I'm not sure what to do about it without allowing the framework to be > explicit about which part it wants to "decline and filter" when accepting, > and this requires an interface change. > > Personally I would consider RESERVE+UNRESERVE to be "touching" those > resources, but I don't think we should worry about it in this patch (I assume > that wasn't your intent anyway, and you were more wanting to raise this topic > for discussion?) > > Benjamin Bannier wrote: > What I worry most is that this edge case makes explaining suggested > framework behavior harder ("should any of the offer operations in a single > accept call cancel each other out you will not get offered the resources > again until the default offer filter timeout expires (the timeout isn't up to > you here)" -> framework defensively revives after each accept call if it has > more work to do). Instead we would like frameworks to focus on getting their > offer handling and decline behavior correct and only ever revive in > exceptional scenarios (e.g., even "_new_ work arrived"). > > Since this patch tries to fix incorrect master behavior we should make > sure to get the behavior somewhat right or else risk that frameworks > implement suboptimal behavior which will be hard to unlearn. That being said, > the fact that no framework author complained when this bug was introduced > makes me worry that they either do not care about how fast offers arrive or > already implement a overly pessimistc approach (e.g., revive whenever there > is more work to do in their state machine). The timeout is still controlled by the scheduler even in the case where operations cancel each other out. We're trying to move to a world where frameworks are more expilict about what filters they want to set up. With this in mind, it seems to me that neither the current fix nor what you suggested fits into this goal well, because here we're *guessing* what the frameworks' intentions. Say if a framework reserves two CPUs and unreserves the two reserved CPUs and then reserves two CPUs again, there is no way to infer if the framework is trying to use just two CPUs or four CPUs. This could become really messy in terms of both semantics and implementation. Since as you said we're not aware of any complain about this, I'd say let's keep the logic simple and determine the declined resources based on the end result of an `ACCEPT` call. Dropping this for now. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/#review214812 --- On April 25, 2019, 10:14 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/ --- (Updated April 25, 2019, 10:14 p.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Moved documentation to r/70547. Bugs: MESOS-9616 https://issues.apache.org/jira/browse/MESOS-9616 Repository: mesos Description --- Currently if a framework accepts an offer to perform pipelined operations, e.g., reserving resource, without a final consumer, the converted resources will be implicitly declined. This is an undesired behavior as the framework might want to reserve one resource first but launch a task later in the next allocation cycle. This patch fixes this behavior. But, if the framework accepts an offers with multiple operations that cancel out each other, the resources consumed by these operations are still considered unused and will be declined. Diffs (updated) - src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 Diff: https://reviews.apache.org/r/70132/diff/7/ Changes: https://reviews.apache.org/r/70132/diff/6-7/ Testing --- make check More testing done in r/70537. Thanks, Chun-Hung Hsiao
Re: Review Request 70521: Renamed variables in `Master::_accept` to improve readability.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70521/ --- (Updated April 24, 2019, 2:04 a.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Rebased. Repository: mesos Description --- Renamed variables in `Master::_accept` to improve readability. Diffs (updated) - src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 Diff: https://reviews.apache.org/r/70521/diff/2/ Changes: https://reviews.apache.org/r/70521/diff/1-2/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70537: Added unit tests for implicit decline.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70537/ --- Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Bugs: MESOS-9616 https://issues.apache.org/jira/browse/MESOS-9616 Repository: mesos Description --- Added unit tests for implicit decline. Diffs - src/tests/scheduler_tests.cpp e0ed02900330c678bbf5c609c1f45d05147851ed Diff: https://reviews.apache.org/r/70537/diff/1/ Testing --- make check Tests `ImplicitDecline3` and `ImplicitDecline5` will fail w/o r/70132. Thanks, Chun-Hung Hsiao
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/ --- (Updated April 24, 2019, 2:02 a.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Addressed some of Benjamin's comments and added examples. Bugs: MESOS-9616 https://issues.apache.org/jira/browse/MESOS-9616 Repository: mesos Description --- Currently if a framework accepts an offer to perform pipelined operations, e.g., reserving resource, without a final consumer, the converted resources will be implicitly declined. This is an undesired behavior as the framework might want to reserve one resource first but launch a task later in the next allocation cycle. This patch fixes this behavior. But, if the framework accepts an offers with multiple operations that cancel out each other, the resources consumed by these operations are still considered unused and will be declined. Diffs (updated) - docs/scheduler-http-api.md a5327c229142267836f327f9c382ef50b7e334db src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 Diff: https://reviews.apache.org/r/70132/diff/6/ Changes: https://reviews.apache.org/r/70132/diff/5-6/ Testing (updated) --- make check More testing done in r/70537. Thanks, Chun-Hung Hsiao
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
> On April 23, 2019, 10:47 a.m., Benjamin Bannier wrote: > > src/tests/slave_tests.cpp > > Lines 6499 (patched) > > <https://reviews.apache.org/r/70132/diff/5/?file=2140651#file2140651line6499> > > > > Since the changes in this patch are strongly related to behavior > > framework authors need to reason about I strongly feel that we must add a > > test for the expected behavior. > > Chun-Hung Hsiao wrote: > I could add a unit test in a separated patch. This patch itself will be > backported, after discussed with @bmahler. Done in r/70537. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/#review214812 ------- On April 23, 2019, 1:15 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70132/ > --- > > (Updated April 23, 2019, 1:15 a.m.) > > > Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. > > > Bugs: MESOS-9616 > https://issues.apache.org/jira/browse/MESOS-9616 > > > Repository: mesos > > > Description > --- > > Currently if a framework accepts an offer to perform pipelined > operations, e.g., reserving resource, without a final consumer, the > converted resources will be implicitly declined. This is an undesired > behavior as the framework might want to reserve one resource first but > launch a task later in the next allocation cycle. This patch fixes this > behavior. > > But, if the framework accepts an offers with multiple operations that > cancel out each other, the resources consumed by these operations are > still considered unused and will be declined. > > > Diffs > - > > docs/scheduler-http-api.md a5327c229142267836f327f9c382ef50b7e334db > src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 > src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 > > > Diff: https://reviews.apache.org/r/70132/diff/5/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
> On April 23, 2019, 10:47 a.m., Benjamin Bannier wrote: > > docs/scheduler-http-api.md > > Line 132 (original), 132 (patched) > > <https://reviews.apache.org/r/70132/diff/5/?file=2140649#file2140649line132> > > > > What do you think of getting rid of "implicitly declined" behavior for > > "cancelling operations"? > > > > It seems that behavior is more driven by the implementation than > > intuitive api behavior; it e.g., forces frameworks to reason differently > > about operations executed in isolation vs. executed together. It seems > > having the identical behavior for both cases would both be easier to > > explain and also program against. The behavior that seems to make most > > sense for me would be to only ever implictly decline "untouched resources", > > e.g., if accepting offered `cpus:4` with `RESERVE(cpus:2, role) && > > UNRESERVE(cpus:2, role)` we would implicitly decline only `cpus:2`. It seems to me that "cancelling operations" as something that are both 1. very rare and 2. make little sense for frameworks, so I'm more like delivering a fix for common cases without making the alrealy-messy code path more complicated. WDYT? Also @bmahler what's your opinion on @bbannier's suggestion? IIRC you mentioned something like some are designed behaviors before, but I didn't know the context. > On April 23, 2019, 10:47 a.m., Benjamin Bannier wrote: > > src/master/master.cpp > > Lines 5963-5964 (original), 5983-5984 (patched) > > <https://reviews.apache.org/r/70132/diff/5/?file=2140650#file2140650line5983> > > > > Is this a workaround we need until MESOS-4553 gets resolved? If it is, > > let's add a `TODO`. I don't know actually lol. I just copied it from https://github.com/apache/mesos/blob/45c9788618e7123f408a1dffcf6772a1285cd2e5/src/master/master.cpp#L10969-L10972, as @mzhu suggested that if there's an allocation in between there might be offer fragmentation. Is this a workaround for MESOS-4553? > On April 23, 2019, 10:47 a.m., Benjamin Bannier wrote: > > src/tests/slave_tests.cpp > > Lines 6499 (patched) > > <https://reviews.apache.org/r/70132/diff/5/?file=2140651#file2140651line6499> > > > > Since the changes in this patch are strongly related to behavior > > framework authors need to reason about I strongly feel that we must add a > > test for the expected behavior. I could add a unit test in a separated patch. This patch itself will be backported, after discussed with @bmahler. - Chun-Hung ------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/#review214812 --- On April 23, 2019, 1:15 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70132/ > --- > > (Updated April 23, 2019, 1:15 a.m.) > > > Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. > > > Bugs: MESOS-9616 > https://issues.apache.org/jira/browse/MESOS-9616 > > > Repository: mesos > > > Description > --- > > Currently if a framework accepts an offer to perform pipelined > operations, e.g., reserving resource, without a final consumer, the > converted resources will be implicitly declined. This is an undesired > behavior as the framework might want to reserve one resource first but > launch a task later in the next allocation cycle. This patch fixes this > behavior. > > But, if the framework accepts an offers with multiple operations that > cancel out each other, the resources consumed by these operations are > still considered unused and will be declined. > > > Diffs > - > > docs/scheduler-http-api.md a5327c229142267836f327f9c382ef50b7e334db > src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 > src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 > > > Diff: https://reviews.apache.org/r/70132/diff/5/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Review Request 70521: Renamed variables in `Master::_accept` to improve readability.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70521/ --- Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Repository: mesos Description --- Renamed variables in `Master::_accept` to improve readability. Diffs - src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 Diff: https://reviews.apache.org/r/70521/diff/1/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70132/ --- (Updated April 23, 2019, 1:15 a.m.) Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu. Changes --- Updated variable names and comments to make it easier to understand. Summary (updated) - Do not implicitly decline speculatively converted resources. Bugs: MESOS-9616 https://issues.apache.org/jira/browse/MESOS-9616 Repository: mesos Description (updated) --- Currently if a framework accepts an offer to perform pipelined operations, e.g., reserving resource, without a final consumer, the converted resources will be implicitly declined. This is an undesired behavior as the framework might want to reserve one resource first but launch a task later in the next allocation cycle. This patch fixes this behavior. But, if the framework accepts an offers with multiple operations that cancel out each other, the resources consumed by these operations are still considered unused and will be declined. Diffs (updated) - docs/scheduler-http-api.md a5327c229142267836f327f9c382ef50b7e334db src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 Diff: https://reviews.apache.org/r/70132/diff/5/ Changes: https://reviews.apache.org/r/70132/diff/4-5/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 65085: WIP: Added a unit test for fd leakage with blocking HTTP calls.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65085/ --- (Updated April 16, 2019, 10 p.m.) Review request for Benjamin Bannier. Bugs: MESOS-8428 https://issues.apache.org/jira/browse/MESOS-8428 Repository: mesos Description --- When running `ContainerDaemonTest.FileDescriptorLeakage` in repetition, the test will use up all fds. Diffs (updated) - src/tests/container_daemon_tests.cpp 1396f25bf7aa9fbc62af9111ad412238faff432e Diff: https://reviews.apache.org/r/65085/diff/2/ Changes: https://reviews.apache.org/r/65085/diff/1-2/ Testing --- Thanks, Chun-Hung Hsiao
Re: Review Request 69954: Added a SLRP unit test for persistent block volume creation.
> On April 16, 2019, 1:06 p.m., Benjamin Bannier wrote: > > src/tests/storage_local_resource_provider_tests.cpp > > Lines 3128 (patched) > > <https://reviews.apache.org/r/69954/diff/3/?file=2138790#file2138790line3128> > > > > Are we checking that the resources get returned to the offer pool? In > > that case let's confirm that `offers` looks like what we would expect. > > > > Alternatively we might not want to test this here and could just remove > > this and add a matching ignore when setting up the mock expectation above. > > That seems to make more sense to me. It's more about making it deterministic, and also as part of the end-to-end scenario. The contents of the offer has already been checked by the matcher. I could instead ignore the offer by doing a `Times(AtMost(1))`. Do you think this would make the test simpler? - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69954/#review214689 ------- On April 12, 2019, 3:15 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69954/ > --- > > (Updated April 12, 2019, 3:15 a.m.) > > > Review request for mesos and Benjamin Bannier. > > > Bugs: MESOS-9565 > https://issues.apache.org/jira/browse/MESOS-9565 > > > Repository: mesos > > > Description > --- > > Test `CreateDestroyPersistentBlockVolume` verifies that SLRP would fail > a `CREATE` operation on a BLOCK disk resource, and a followup `DESTROY` > will be dropped (instead of failing the SLRP). > > > Diffs > - > > src/tests/mock_csi_plugin.cpp cbcb59f720642c2801f5b86c438793e24437b414 > src/tests/storage_local_resource_provider_tests.cpp > 8bf4d2331b59770a7b7f3768499047bec2d67677 > > > Diff: https://reviews.apache.org/r/69954/diff/3/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 69955: Added SLRP unit tests for destroying unpublished persistent volumes.
> On April 16, 2019, 1:18 p.m., Benjamin Bannier wrote: > > src/tests/storage_local_resource_provider_tests.cpp > > Lines 3328 (patched) > > <https://reviews.apache.org/r/69955/diff/3/?file=2139099#file2139099line3328> > > > > If we need to await offers, lets also check their contents. It seems > > checking the operation status would already be enough. This would eliminate the nondeterminism of the times of `resourceOffers` being called. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69955/#review214690 --- On April 12, 2019, 10:02 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69955/ > --- > > (Updated April 12, 2019, 10:02 p.m.) > > > Review request for mesos and Benjamin Bannier. > > > Bugs: MESOS-9565 > https://issues.apache.org/jira/browse/MESOS-9565 > > > Repository: mesos > > > Description > --- > > This patch adds 3 unit tests: `DestroyUnpublishedPersistentVolume`, > `DestroyUnpublishedPersistentVolumeWithRecovery`, and > `DestroyUnpublishedPersistentVolumeWithReboot` to test that the SLRP is > resilient to misbehaved CSI plugins that fail to publish volumes. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 8bf4d2331b59770a7b7f3768499047bec2d67677 > > > Diff: https://reviews.apache.org/r/69955/diff/3/ > > > Testing > --- > > make check > > All tests passed 400 iterations (200 per parameter value) under stress. > > > Thanks, > > Chun-Hung Hsiao > >
Review Request 70481: Added an image compatibility note for UCR.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70481/ --- Review request for mesos and Gilbert Song. Bugs: MESOS-7617 https://issues.apache.org/jira/browse/MESOS-7617 Repository: mesos Description --- Added an image compatibility note for UCR. Diffs - docs/isolators/docker-runtime.md a6272b4c83401d3c2cf88333f337b55b147fd083 Diff: https://reviews.apache.org/r/70481/diff/1/ Testing --- N/A Thanks, Chun-Hung Hsiao
Re: Review Request 69955: Added SLRP unit tests for destroying unpublished persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69955/ --- (Updated April 12, 2019, 10:02 p.m.) Review request for mesos and Benjamin Bannier. Changes --- Rebased and parameterized the tests. Summary (updated) - Added SLRP unit tests for destroying unpublished persistent volumes. Bugs: MESOS-9565 https://issues.apache.org/jira/browse/MESOS-9565 Repository: mesos Description --- This patch adds 3 unit tests: `DestroyUnpublishedPersistentVolume`, `DestroyUnpublishedPersistentVolumeWithRecovery`, and `DestroyUnpublishedPersistentVolumeWithReboot` to test that the SLRP is resilient to misbehaved CSI plugins that fail to publish volumes. Diffs (updated) - src/tests/storage_local_resource_provider_tests.cpp 8bf4d2331b59770a7b7f3768499047bec2d67677 Diff: https://reviews.apache.org/r/69955/diff/3/ Changes: https://reviews.apache.org/r/69955/diff/2-3/ Testing --- make check Thanks, Chun-Hung Hsiao
Review Request 70468: Fixed crash when recovering a volume failed to publish with CSI v1.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70468/ --- Review request for mesos and Benjamin Bannier. Bugs: MESOS-9729 https://issues.apache.org/jira/browse/MESOS-9729 Repository: mesos Description --- The CSI v1 volume manager falsely assumed that the target path always exists when unpublishing a volume, which is not true if there is a failure when publishing the volume. Diffs - src/csi/v1_volume_manager.cpp bf640f9ee836f6d8cc6bb2364f2cdeddc51a7cf3 Diff: https://reviews.apache.org/r/70468/diff/1/ Testing --- make check Testing of this bug is done later in the chain. Thanks, Chun-Hung Hsiao
Re: Review Request 69954: Added a SLRP unit test for persistent block volume creation.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69954/ --- (Updated April 12, 2019, 3:15 a.m.) Review request for mesos and Benjamin Bannier. Changes --- Rebased and fixed a test flake. Bugs: MESOS-9565 https://issues.apache.org/jira/browse/MESOS-9565 Repository: mesos Description --- Test `CreateDestroyPersistentBlockVolume` verifies that SLRP would fail a `CREATE` operation on a BLOCK disk resource, and a followup `DESTROY` will be dropped (instead of failing the SLRP). Diffs (updated) - src/tests/mock_csi_plugin.cpp cbcb59f720642c2801f5b86c438793e24437b414 src/tests/storage_local_resource_provider_tests.cpp 8bf4d2331b59770a7b7f3768499047bec2d67677 Diff: https://reviews.apache.org/r/69954/diff/3/ Changes: https://reviews.apache.org/r/69954/diff/2-3/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 69954: Added a SLRP unit test for persistent block volume creation.
(allocated: storage):1024; disk(allocated: storage):1024; > > ports(allocated: storage):[31000-32000]; disk(allocated: > > storage)(reservations: > > [(DYNAMIC,storage)])[BLOCK(org.apache.mesos.csi.test.local» > > I0305 10:39:55.772673 58949 master.cpp:12085] Removing offer > > 62009208-e484-499c-ac45-09c08ef1cf4d-O3 > > > > GMOCK WARNING: > > Uninteresting mock function call - returning directly. > > Function call: offerRescinded(0x7ffd5e523d20, @0x7f8874003458 > > 62009208-e484-499c-ac45-09c08ef1cf4d-O3) > > NOTE: You can safely ignore the above warning unless this call should not > > happen. Do not suppress it by blindly adding an EXPECT_CALL() if you don't > > mean to enforce the call. See > > https://github.com/google/googletest/blob/master/googlemock/docs/CookBook.md#knowing-when-to-expect > > for details. > > I0305 10:39:55.784617 58960 master.cpp:9865] Sending offers [ > > 62009208-e484-499c-ac45-09c08ef1cf4d-O4 ] to framework > > 62009208-e484-499c-ac45-09c08ef1cf4d- (default) at > > scheduler-e112e00f-8102-4bc0-9613-0cea89dd77d6@66.70.182.167:41287 > > ../src/tests/storage_local_resource_provider_tests.cpp:2887: Failure > > Mock function called more times than expected - returning directly. > > Function call: resourceOffers(0x7ffd5e523d20, @0x7f897275f2c8 { > > 160-byte object > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 05-00 00-00 05-00 00-00 40-96 > > 03-90 88-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 ... F0-DB > > 01-90 88-7F 00-00 70-DA 01-90 88-7F 00-00 60-AC 02-90 8» > > Expected: to be called once > >Actual: called twice - over-saturated and active > > *** Aborted at 1551778795 (unix time) try "date -d @1551778795" if you are > > using GNU date *** > > ``` > > > > We should also take care to potentially expect `offerRescinded`. Okay I understand what happened here now. Since the `CREATE` and `DESTROY` function cancel each other out, the net effect of the last `ACCEPT` call was a no-op, which caused an implicit decline and then the block volume was re-offered after the default refuse timeout. However, SLRP failed the `CREATE` operation after the block volume was re-offered. Since an speculative operation failed, an `UPDATE_STATE` was triggered, which in turn rescinded the offer. Since the purpose of this test is to verify that creating a PV on a block disk would fail, let me just remove the `DESTROY` part, which would avoid this scenario at all. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69954/#review213431 --- On March 4, 2019, 7:23 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69954/ > ------- > > (Updated March 4, 2019, 7:23 p.m.) > > > Review request for mesos and Benjamin Bannier. > > > Bugs: MESOS-9565 > https://issues.apache.org/jira/browse/MESOS-9565 > > > Repository: mesos > > > Description > --- > > Test `CreateDestroyPersistentBlockVolume` verifies that SLRP would fail > a `CREATE` operation on a BLOCK disk resource, and a followup `DESTROY` > will be dropped (instead of failing the SLRP). > > > Diffs > - > > src/tests/mock_csi_plugin.cpp 10245705ab39872da0fef1afd02213e2c7f345cb > src/tests/storage_local_resource_provider_tests.cpp > a661951a0a326cc342aa0c45dd0967692ae70941 > > > Diff: https://reviews.apache.org/r/69954/diff/2/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.
> On March 27, 2019, 5:10 p.m., Chun-Hung Hsiao wrote: > > src/tests/storage_local_resource_provider_tests.cpp > > Lines 4926 (patched) > > <https://reviews.apache.org/r/70184/diff/1/?file=2130933#file2130933line4926> > > > > I'm not sure if this is the most appropriate fix before understanding > > why the offer was filtered. Can you elaborate more on why this is needed? Okay I think I now understand what happened here. It was not because that the offer filter prevented the framework from getting the offer. No filter is set to refuse the newly created volume. What actually happened was that, after the master received `OPERATION_FINISHED`, it would call `allocator->updateAllocation(...)` and `allocator->recoverResources(...)`: https://github.com/apache/mesos/blob/cb4c9660eecae59bc2acbeb37ab19214f7b0e19b/src/master/master.cpp#L11962-L11972 Since these steps were done asynchronously, they raced with the clock advanced that triggered the batch allocation after L4926 in this test. So in the failed run, the batch allocation was done before the resource (i.e., the created volume) was recovered, so the offer allocated to the framework did not have any "new" resource and thus filtered. This can be reproduced by adding an `os::sleep(Milliseconds(10))` right before `allocator->updateAllocation(...)` above. Adding a `REVIVE` call does the trick, because the revive call, which was processed sequentially *after* `OPERATION_FINISHED`, would asynchronuosly queue up an event-based allocation after the above two allocator calls. So even if the race happens, it ensures that another allocation would be triggered after the created volume is recovered. However it seems to me that a more proper fix is the following: ``` EXPECT_EQ(OPERATION_FINISHED, updateOperationStatus->status().state()); - // Advance the clock to trigger a batch allocation. + // Settle the clock to recover the created disk then advance the clock to + // trigger a batch allocation. + Clock::settle(); Clock::advance(masterFlags.allocation_interval); AWAIT_READY(offers); ``` - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70184/#review214121 --- On March 11, 2019, 3:39 p.m., Jan Schlicht wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70184/ > --- > > (Updated March 11, 2019, 3:39 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9594 > https://issues.apache.org/jira/browse/MESOS-9594 > > > Repository: mesos > > > Description > --- > > Under some circumstances, offers would be filtered, resulting in the > test being stuck while waiting for offers. This has been resolved by > removing all offer filters before waiting for offers. > > > Diffs > - > > src/tests/storage_local_resource_provider_tests.cpp > 7945384867f26fa15dc734a235ae509d5d6d350f > > > Diff: https://reviews.apache.org/r/70184/diff/1/ > > > Testing > --- > > make check > > I could not reproduce the flaky behavior of this test case, hence only assume > that this patch resolves the flakiness. > > > Thanks, > > Jan Schlicht > >
Re: Review Request 70449: Avoid publishing resources when an HTTP executor resubscribes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70449/ --- (Updated April 11, 2019, 3:55 p.m.) Review request for mesos, Benjamin Bannier, Greg Mann, and Vinod Kone. Changes --- Fixed a typo. Bugs: MESOS-9711 https://issues.apache.org/jira/browse/MESOS-9711 Repository: mesos Description --- After an agent failover, an HTTP executor may resubscribe before any resource provider resubscribes. If that happens and the executor has tasks consuming resources from an unsubscribed resource provider, the agent will fail to publish the resources and kill the executor, which is an undesired behavior. The patch fixes this issue. Diffs (updated) - src/slave/slave.cpp 794a9c986b266b4916f6fbada670142798245bd1 src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 Diff: https://reviews.apache.org/r/70449/diff/5/ Changes: https://reviews.apache.org/r/70449/diff/4-5/ Testing --- make check Thanks, Chun-Hung Hsiao
Re: Review Request 70449: Avoid publishing resources when an HTTP executor resubscribes.
> On April 11, 2019, 3:35 p.m., Vinod Kone wrote: > > src/slave/slave.cpp > > Lines 5033 (patched) > > <https://reviews.apache.org/r/70449/diff/4/?file=2138629#file2138629line5033> > > > > s/updated/update/ Oops thanks. - Chun-Hung --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70449/#review214586 --- On April 11, 2019, 1:30 a.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70449/ > --- > > (Updated April 11, 2019, 1:30 a.m.) > > > Review request for mesos, Benjamin Bannier, Greg Mann, and Vinod Kone. > > > Bugs: MESOS-9711 > https://issues.apache.org/jira/browse/MESOS-9711 > > > Repository: mesos > > > Description > --- > > After an agent failover, an HTTP executor may resubscribe before any > resource provider resubscribes. If that happens and the executor > has tasks consuming resources from an unsubscribed resource provider, > the agent will fail to publish the resources and kill the executor, > which is an undesired behavior. The patch fixes this issue. > > > Diffs > - > > src/slave/slave.cpp 794a9c986b266b4916f6fbada670142798245bd1 > src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 > > > Diff: https://reviews.apache.org/r/70449/diff/4/ > > > Testing > --- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 70454: Fixed potential use-after-free bug in storage local resource provider.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70454/#review214587 --- Ship it! Ship It! - Chun-Hung Hsiao On April 11, 2019, 12:55 p.m., Benjamin Bannier wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70454/ > --- > > (Updated April 11, 2019, 12:55 p.m.) > > > Review request for mesos and Chun-Hung Hsiao. > > > Bugs: MESOS-9711 > https://issues.apache.org/jira/browse/MESOS-9711 > > > Repository: mesos > > > Description > --- > > The storage local resource provider manages a metrics instance which is > shared with the service and volume managers it also holds; these > services do not manage the lifetime of the metrics instance. > > This patch fixes the lifetime of the metrics instance. > > > Diffs > - > > src/resource_provider/storage/provider.cpp > b2ca5d0735cbd11b8a9689359e05d6bd723b8a64 > > > Diff: https://reviews.apache.org/r/70454/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benjamin Bannier > >
Re: Review Request 70449: Avoid publishing resources when an HTTP executor resubscribes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70449/ --- (Updated April 11, 2019, 1:30 a.m.) Review request for mesos, Benjamin Bannier, Greg Mann, and Vinod Kone. Changes --- Made the test fail without the fix. Bugs: MESOS-9711 https://issues.apache.org/jira/browse/MESOS-9711 Repository: mesos Description --- After an agent failover, an HTTP executor may resubscribe before any resource provider resubscribes. If that happens and the executor has tasks consuming resources from an unsubscribed resource provider, the agent will fail to publish the resources and kill the executor, which is an undesired behavior. The patch fixes this issue. Diffs (updated) - src/slave/slave.cpp 794a9c986b266b4916f6fbada670142798245bd1 src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 Diff: https://reviews.apache.org/r/70449/diff/4/ Changes: https://reviews.apache.org/r/70449/diff/3-4/ Testing --- make check Thanks, Chun-Hung Hsiao