Dario, It's not immediately clear to me where the bottleneck might be. I filed MESOS-6405 <https://issues.apache.org/jira/browse/MESOS-6405> to write a benchmark that tries to mimic your test setup and then go about fixing the issues.
-anand On Sun, Oct 16, 2016 at 6:20 PM, Dario Rexin <[email protected]> wrote: > Hi Anand, > > I tested with and without pipelining and it doesn’t make a difference. > First of all because unlimited pipelining is not a good idea, because we > still have to handle the responses and need to be able to relate the > request and response upon return, i.e. store the context of the request > until we receive the response. Also, we want to know as soon as possible > when an error occurs, so early returns are very desirable. I agree that it > shouldn't make a difference in how fast events can be processed if they are > queued on the master vs. client, but this observation made it very apparent > that throughput is a problem on the master. I did not make any requests > that would potentially block for a long time, so it’s even weirder to me, > that the throughput is so low. One thing I don’t understand for example, is > why all messages go through the master process. The parsing for example > could be done in a completely separate process and if every connected > framework would be backed by its own process, the check if a framework is > connected could also be done there (not to mention that this requirement > exists only because we need to use multiple connections). Requiring all > messages to go through a single process that can indefinitely block is > obviously a huge bottleneck. I understand that this problem is not limited > to the HTTP API, but I think it has to be fixed. > > — > Dario > > On Oct 16, 2016, at 5:52 PM, Anand Mazumdar <[email protected]> > wrote: > > Dario, > > Regarding: > > >This is especially concerning, as it means that accepting calls will > completely stall when a long running call (e.g. retrieving state.json) is > running. > > How does it help a client when it gets an early accepted response versus > when accepting of calls is stalled i.e., queued up on the master actor? The > client does not need to wait for a response before pipelining its next > request to the master anyway. In your tests, do you send the next REVIVE > call only upon receiving the response to the current call? That might > explain the behavior you are seeing. > > -anand > > On Sun, Oct 16, 2016 at 11:58 AM, tommy xiao <[email protected]> wrote: > >> interesting this topic. >> >> 2016-10-17 2:51 GMT+08:00 Dario Rexin <[email protected]>: >> >>> Hi Anand, >>> >>> I tested with current HEAD. After I saw low throughput on our own HTTP >>> API client, I wrote a small server that sends out fake events and accepts >>> calls and our client was able to send a lot more calls to that server. I >>> also wrote a small tool that simply sends as many calls to Mesos as >>> possible without handling any events and get similar results there.I also >>> observe extremely high CPU usage. While my sending tool is using ~10% CPU, >>> Mesos runs on ~185%. The calls I send for testing are all REVIVE and I >>> don’t have any agents connected, so there should be essentially nothing >>> happening. One reason I could think of for the reduced throughput is that >>> all calls are processed in the master process, before it sends back an >>> ACCEPTED, leading to effectively single threaded processing of HTTP calls, >>> interleaved with all other calls that are sent to the master process. >>> Libprocess however just forwards the messages to the master process and >>> then immediately returns ACCEPTED. It also handles all connections in >>> separate processes, whereas HTTP calls are effectively all handled by the >>> master process.This is especially concerning, as it means that accepting >>> calls will completely stall when a long running call (e.g. retrieving >>> state.json) is running. >>> >>> Thanks, >>> Dario >>> >>> On Oct 16, 2016, at 11:01 AM, Anand Mazumdar <[email protected]> wrote: >>> >>> Dario, >>> >>> Thanks for reporting this. Did you test this with 1.0 or the recent >>> HEAD? We had done performance testing prior to 1.0rc1 and had not found any >>> substantial discrepancy on the call ingestion path. Hence, we had focussed >>> on fixing the performance issues around writing events on the stream in >>> MESOS-5222 <https://issues.apache.org/jira/browse/MESOS-5222> and >>> MESOS-5457 <https://issues.apache.org/jira/browse/MESOS-5457>. >>> >>> The numbers in the benchmark test pointed by Haosdent (v0 vs v1) differ >>> due to the slowness of the client (scheduler library) in processing the >>> status update events. We should add another benchmark that measures just >>> the time taken by the master to write the events. I would file an issue >>> shortly to address this. >>> >>> Do you mind filing an issue with more details on your test setup? >>> >>> -anand >>> >>> On Sun, Oct 16, 2016 at 12:05 AM, Dario Rexin <[email protected]> wrote: >>> >>>> Hi haosdent, >>>> >>>> thanks for the pointer! Your results show exactly what I’m >>>> experiencing. I think especially for bigger clusters this could be very >>>> problematic. It would be great to get some input from the folks working on >>>> the HTTP API, especially Anand. >>>> >>>> Thanks, >>>> Dario >>>> >>>> On Oct 16, 2016, at 12:01 AM, haosdent <[email protected]> wrote: >>>> >>>> Hmm, this is an interesting topic. @anandmazumdar create a benchmark >>>> test case to compare v1 and v0 APIs before. You could run it via >>>> >>>> ``` >>>> ./bin/mesos-tests.sh --benchmark --gtest_filter="*SchedulerReco >>>> ncileTasks_BENCHMARK_Test*" >>>> ``` >>>> >>>> Here is the result that run it in my machine. >>>> >>>> ``` >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/0 >>>> Reconciling 1000 tasks took 386.451108ms using the scheduler library >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/0 (479 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/1 >>>> Reconciling 10000 tasks took 3.389258444secs using the scheduler library >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/1 (3435 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/2 >>>> Reconciling 50000 tasks took 16.624603964secs using the scheduler >>>> library >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/2 (16737 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/3 >>>> Reconciling 100000 tasks took 33.134018718secs using the scheduler >>>> library >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerLibrary/3 (33333 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/0 >>>> Reconciling 1000 tasks took 24.212092ms using the scheduler driver >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/0 (89 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/1 >>>> Reconciling 10000 tasks took 316.115078ms using the scheduler driver >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/1 (385 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/2 >>>> Reconciling 50000 tasks took 1.239050154secs using the scheduler driver >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/2 (1379 ms) >>>> [ RUN ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/3 >>>> Reconciling 100000 tasks took 2.38445672secs using the scheduler driver >>>> [ OK ] Tasks/SchedulerReconcileTasks_ >>>> BENCHMARK_Test.SchedulerDriver/3 (2711 ms) >>>> ``` >>>> >>>> *SchedulerLibrary* is the HTTP API, *SchedulerDriver* is the old way >>>> based on libmesos.so. >>>> >>>> On Sun, Oct 16, 2016 at 2:41 PM, Dario Rexin <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I recently did some performance testing on the v1 scheduler API and >>>>> found that throughput is around 10x lower than for the v0 API. Using 1 >>>>> connection, I don’t get a lot more than 1,500 calls per second, where the >>>>> v0 API can do ~15,000. If I use multiple connections, throughput maxes out >>>>> at 3 connections and ~2,500 calls / s. If I add any more connections, the >>>>> throughput per connection drops and the total throughput stays around >>>>> ~2,500 calls / s. Has anyone done performance testing on the v1 API >>>>> before? >>>>> It seems a little strange to me, that it’s so much slower, given that the >>>>> v0 API also uses HTTP (well, more or less). I would be thankful for any >>>>> comments and experience reports of other users. >>>>> >>>>> Thanks, >>>>> Dario >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>>> >>>> >>> >>> >> >> >> -- >> Deshi Xiao >> Twitter: xds2000 >> E-mail: xiaods(AT)gmail.com >> > > > > -- > Anand Mazumdar > > >

