Hi Anand, thanks for creating the ticket. I will also investigate a bit more. I will probably be in SF on Thursday, so we could discuss in person.
-- Dario > On Oct 17, 2016, at 12:19 PM, Anand Mazumdar <an...@apache.org> wrote: > > Dario, > > It's not immediately clear to me where the bottleneck might be. I filed > MESOS-6405 to write a benchmark that tries to mimic your test setup and then > go about fixing the issues. > > -anand > >> On Sun, Oct 16, 2016 at 6:20 PM, Dario Rexin <dre...@apple.com> wrote: >> Hi Anand, >> >> I tested with and without pipelining and it doesn’t make a difference. First >> of all because unlimited pipelining is not a good idea, because we still >> have to handle the responses and need to be able to relate the request and >> response upon return, i.e. store the context of the request until we receive >> the response. Also, we want to know as soon as possible when an error >> occurs, so early returns are very desirable. I agree that it shouldn't make >> a difference in how fast events can be processed if they are queued on the >> master vs. client, but this observation made it very apparent that >> throughput is a problem on the master. I did not make any requests that >> would potentially block for a long time, so it’s even weirder to me, that >> the throughput is so low. One thing I don’t understand for example, is why >> all messages go through the master process. The parsing for example could be >> done in a completely separate process and if every connected framework would >> be backed by its own process, the check if a framework is connected could >> also be done there (not to mention that this requirement exists only because >> we need to use multiple connections). Requiring all messages to go through a >> single process that can indefinitely block is obviously a huge bottleneck. I >> understand that this problem is not limited to the HTTP API, but I think it >> has to be fixed. >> >> — >> Dario >> >>> On Oct 16, 2016, at 5:52 PM, Anand Mazumdar <mazumdar.an...@gmail.com> >>> wrote: >>> >>> Dario, >>> >>> Regarding: >>> >>> >This is especially concerning, as it means that accepting calls will >>> >completely stall when a long running call (e.g. retrieving state.json) is >>> >running. >>> >>> How does it help a client when it gets an early accepted response versus >>> when accepting of calls is stalled i.e., queued up on the master actor? The >>> client does not need to wait for a response before pipelining its next >>> request to the master anyway. In your tests, do you send the next REVIVE >>> call only upon receiving the response to the current call? That might >>> explain the behavior you are seeing. >>> >>> -anand >>> >>>> On Sun, Oct 16, 2016 at 11:58 AM, tommy xiao <xia...@gmail.com> wrote: >>>> interesting this topic. >>>> >>>> 2016-10-17 2:51 GMT+08:00 Dario Rexin <dre...@apple.com>: >>>>> Hi Anand, >>>>> >>>>> I tested with current HEAD. After I saw low throughput on our own HTTP >>>>> API client, I wrote a small server that sends out fake events and accepts >>>>> calls and our client was able to send a lot more calls to that server. I >>>>> also wrote a small tool that simply sends as many calls to Mesos as >>>>> possible without handling any events and get similar results there.I also >>>>> observe extremely high CPU usage. While my sending tool is using ~10% >>>>> CPU, Mesos runs on ~185%. The calls I send for testing are all REVIVE and >>>>> I don’t have any agents connected, so there should be essentially nothing >>>>> happening. One reason I could think of for the reduced throughput is that >>>>> all calls are processed in the master process, before it sends back an >>>>> ACCEPTED, leading to effectively single threaded processing of HTTP >>>>> calls, interleaved with all other calls that are sent to the master >>>>> process. Libprocess however just forwards the messages to the master >>>>> process and then immediately returns ACCEPTED. It also handles all >>>>> connections in separate processes, whereas HTTP calls are effectively all >>>>> handled by the master process.This is especially concerning, as it means >>>>> that accepting calls will completely stall when a long running call (e.g. >>>>> retrieving state.json) is running. >>>>> >>>>> Thanks, >>>>> Dario >>>>> >>>>>> On Oct 16, 2016, at 11:01 AM, Anand Mazumdar <an...@apache.org> wrote: >>>>>> >>>>>> Dario, >>>>>> >>>>>> Thanks for reporting this. Did you test this with 1.0 or the recent >>>>>> HEAD? We had done performance testing prior to 1.0rc1 and had not found >>>>>> any substantial discrepancy on the call ingestion path. Hence, we had >>>>>> focussed on fixing the performance issues around writing events on the >>>>>> stream in MESOS-5222 and MESOS-5457. >>>>>> >>>>>> The numbers in the benchmark test pointed by Haosdent (v0 vs v1) differ >>>>>> due to the slowness of the client (scheduler library) in processing the >>>>>> status update events. We should add another benchmark that measures just >>>>>> the time taken by the master to write the events. I would file an issue >>>>>> shortly to address this. >>>>>> >>>>>> Do you mind filing an issue with more details on your test setup? >>>>>> >>>>>> -anand >>>>>> >>>>>>> On Sun, Oct 16, 2016 at 12:05 AM, Dario Rexin <dre...@apple.com> wrote: >>>>>>> Hi haosdent, >>>>>>> >>>>>>> thanks for the pointer! Your results show exactly what I’m >>>>>>> experiencing. I think especially for bigger clusters this could be very >>>>>>> problematic. It would be great to get some input from the folks working >>>>>>> on the HTTP API, especially Anand. >>>>>>> >>>>>>> Thanks, >>>>>>> Dario >>>>>>> >>>>>>>> On Oct 16, 2016, at 12:01 AM, haosdent <haosd...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hmm, this is an interesting topic. @anandmazumdar create a benchmark >>>>>>>> test case to compare v1 and v0 APIs before. You could run it via >>>>>>>> >>>>>>>> ``` >>>>>>>> ./bin/mesos-tests.sh --benchmark >>>>>>>> --gtest_filter="*SchedulerReconcileTasks_BENCHMARK_Test*" >>>>>>>> ``` >>>>>>>> >>>>>>>> Here is the result that run it in my machine. >>>>>>>> >>>>>>>> ``` >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 >>>>>>>> Reconciling 1000 tasks took 386.451108ms using the scheduler library >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 (479 >>>>>>>> ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 >>>>>>>> Reconciling 10000 tasks took 3.389258444secs using the scheduler >>>>>>>> library >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 (3435 >>>>>>>> ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 >>>>>>>> Reconciling 50000 tasks took 16.624603964secs using the scheduler >>>>>>>> library >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 (16737 >>>>>>>> ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3 >>>>>>>> Reconciling 100000 tasks took 33.134018718secs using the scheduler >>>>>>>> library >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3 (33333 >>>>>>>> ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0 >>>>>>>> Reconciling 1000 tasks took 24.212092ms using the scheduler driver >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0 (89 ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1 >>>>>>>> Reconciling 10000 tasks took 316.115078ms using the scheduler driver >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1 (385 ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2 >>>>>>>> Reconciling 50000 tasks took 1.239050154secs using the scheduler driver >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2 (1379 >>>>>>>> ms) >>>>>>>> [ RUN ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3 >>>>>>>> Reconciling 100000 tasks took 2.38445672secs using the scheduler driver >>>>>>>> [ OK ] >>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3 (2711 >>>>>>>> ms) >>>>>>>> ``` >>>>>>>> >>>>>>>> *SchedulerLibrary* is the HTTP API, *SchedulerDriver* is the old way >>>>>>>> based on libmesos.so. >>>>>>>> >>>>>>>>> On Sun, Oct 16, 2016 at 2:41 PM, Dario Rexin <dre...@apple.com> wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I recently did some performance testing on the v1 scheduler API and >>>>>>>>> found that throughput is around 10x lower than for the v0 API. Using >>>>>>>>> 1 connection, I don’t get a lot more than 1,500 calls per second, >>>>>>>>> where the v0 API can do ~15,000. If I use multiple connections, >>>>>>>>> throughput maxes out at 3 connections and ~2,500 calls / s. If I add >>>>>>>>> any more connections, the throughput per connection drops and the >>>>>>>>> total throughput stays around ~2,500 calls / s. Has anyone done >>>>>>>>> performance testing on the v1 API before? It seems a little strange >>>>>>>>> to me, that it’s so much slower, given that the v0 API also uses HTTP >>>>>>>>> (well, more or less). I would be thankful for any comments and >>>>>>>>> experience reports of other users. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Dario >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, >>>>>>>> Haosdent Huang >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Deshi Xiao >>>> Twitter: xds2000 >>>> E-mail: xiaods(AT)gmail.com >>> >>> >>> >>> -- >>> Anand Mazumdar >> >