Re: Performance regression in v1 api vs v0

Dario Rexin Mon, 17 Oct 2016 12:35:34 -0700

Hi Anand,

thanks for creating the ticket. I will also investigate a bit more. I will 
probably be in SF on Thursday, so we could discuss in person.


--
Dario

> On Oct 17, 2016, at 12:19 PM, Anand Mazumdar <an...@apache.org> wrote:
> 
> Dario,
> 
> It's not immediately clear to me where the bottleneck might be. I filed 
> MESOS-6405 to write a benchmark that tries to mimic your test setup and then 
> go about fixing the issues.
> 
> -anand
> 
>> On Sun, Oct 16, 2016 at 6:20 PM, Dario Rexin <dre...@apple.com> wrote:
>> Hi Anand,
>> 
>> I tested with and without pipelining and it doesn’t make a difference. First 
>> of all because unlimited pipelining is not a good idea, because we still 
>> have to handle the responses and need to be able to relate the request and 
>> response upon return, i.e. store the context of the request until we receive 
>> the response. Also, we want to know as soon as possible when an error 
>> occurs, so early returns are very desirable. I agree that it shouldn't make 
>> a difference in how fast events can be processed if they are queued on the 
>> master vs. client, but this observation made it very apparent that 
>> throughput is a problem on the master. I did not make any requests that 
>> would potentially block for a long time, so it’s even weirder to me, that 
>> the throughput is so low. One thing I don’t understand for example, is why 
>> all messages go through the master process. The parsing for example could be 
>> done in a completely separate process and if every connected framework would 
>> be backed by its own process, the check if a framework is connected could 
>> also be done there (not to mention that this requirement exists only because 
>> we need to use multiple connections). Requiring all messages to go through a 
>> single process that can indefinitely block is obviously a huge bottleneck. I 
>> understand that this problem is not limited to the HTTP API, but I think it 
>> has to be fixed.
>> 
>> —
>> Dario
>> 
>>> On Oct 16, 2016, at 5:52 PM, Anand Mazumdar <mazumdar.an...@gmail.com> 
>>> wrote:
>>> 
>>> Dario,
>>> 
>>> Regarding:
>>> 
>>> >This is especially concerning, as it means that accepting calls will 
>>> >completely stall when a long running call (e.g. retrieving state.json) is 
>>> >running. 
>>> 
>>> How does it help a client when it gets an early accepted response versus 
>>> when accepting of calls is stalled i.e., queued up on the master actor? The 
>>> client does not need to wait for a response before pipelining its next 
>>> request to the master anyway. In your tests, do you send the next REVIVE 
>>> call only upon receiving the response to the current call? That might 
>>> explain the behavior you are seeing.
>>> 
>>> -anand
>>> 
>>>> On Sun, Oct 16, 2016 at 11:58 AM, tommy xiao <xia...@gmail.com> wrote:
>>>> interesting this topic. 
>>>> 
>>>> 2016-10-17 2:51 GMT+08:00 Dario Rexin <dre...@apple.com>:
>>>>> Hi Anand,
>>>>> 
>>>>> I tested with current HEAD. After I saw low throughput on our own HTTP 
>>>>> API client, I wrote a small server that sends out fake events and accepts 
>>>>> calls and our client was able to send a lot more calls to that server. I 
>>>>> also wrote a small tool that simply sends as many calls to Mesos as 
>>>>> possible without handling any events and get similar results there.I also 
>>>>> observe extremely high CPU usage. While my sending tool is using ~10% 
>>>>> CPU, Mesos runs on ~185%. The calls I send for testing are all REVIVE and 
>>>>> I don’t have any agents connected, so there should be essentially nothing 
>>>>> happening. One reason I could think of for the reduced throughput is that 
>>>>> all calls are processed in the master process, before it sends back an 
>>>>> ACCEPTED, leading to effectively single threaded processing of HTTP 
>>>>> calls, interleaved with all other calls that are sent to the master 
>>>>> process. Libprocess however just forwards the messages to the master 
>>>>> process and then immediately  returns ACCEPTED. It also handles all 
>>>>> connections in separate processes, whereas HTTP calls are effectively all 
>>>>> handled by the master process.This is especially concerning, as it means 
>>>>> that accepting calls will completely stall when a long running call (e.g. 
>>>>> retrieving state.json) is running. 
>>>>> 
>>>>> Thanks,
>>>>> Dario
>>>>> 
>>>>>> On Oct 16, 2016, at 11:01 AM, Anand Mazumdar <an...@apache.org> wrote:
>>>>>> 
>>>>>> Dario,
>>>>>> 
>>>>>> Thanks for reporting this. Did you test this with 1.0 or the recent 
>>>>>> HEAD? We had done performance testing prior to 1.0rc1 and had not found 
>>>>>> any substantial discrepancy on the call ingestion path. Hence, we had 
>>>>>> focussed on fixing the performance issues around writing events on the 
>>>>>> stream in MESOS-5222 and MESOS-5457. 
>>>>>> 
>>>>>> The numbers in the benchmark test pointed by Haosdent (v0 vs v1) differ 
>>>>>> due to the slowness of the client (scheduler library) in processing the 
>>>>>> status update events. We should add another benchmark that measures just 
>>>>>> the time taken by the master to write the events. I would file an issue 
>>>>>> shortly to address this. 
>>>>>> 
>>>>>> Do you mind filing an issue with more details on your test setup?
>>>>>> 
>>>>>> -anand
>>>>>> 
>>>>>>> On Sun, Oct 16, 2016 at 12:05 AM, Dario Rexin <dre...@apple.com> wrote:
>>>>>>> Hi haosdent,
>>>>>>> 
>>>>>>> thanks for the pointer! Your results show exactly what I’m 
>>>>>>> experiencing. I think especially for bigger clusters this could be very 
>>>>>>> problematic. It would be great to get some input from the folks working 
>>>>>>> on the HTTP API, especially Anand.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Dario
>>>>>>> 
>>>>>>>> On Oct 16, 2016, at 12:01 AM, haosdent <haosd...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hmm, this is an interesting topic. @anandmazumdar create a benchmark 
>>>>>>>> test case to compare v1 and v0 APIs before. You could run it via
>>>>>>>> 
>>>>>>>> ```
>>>>>>>> ./bin/mesos-tests.sh --benchmark 
>>>>>>>> --gtest_filter="*SchedulerReconcileTasks_BENCHMARK_Test*"
>>>>>>>> ```
>>>>>>>> 
>>>>>>>> Here is the result that run it in my machine.
>>>>>>>> 
>>>>>>>> ```
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0
>>>>>>>> Reconciling 1000 tasks took 386.451108ms using the scheduler library
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 (479 
>>>>>>>> ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1
>>>>>>>> Reconciling 10000 tasks took 3.389258444secs using the scheduler 
>>>>>>>> library
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 (3435 
>>>>>>>> ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2
>>>>>>>> Reconciling 50000 tasks took 16.624603964secs using the scheduler 
>>>>>>>> library
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 (16737 
>>>>>>>> ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3
>>>>>>>> Reconciling 100000 tasks took 33.134018718secs using the scheduler 
>>>>>>>> library
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3 (33333 
>>>>>>>> ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0
>>>>>>>> Reconciling 1000 tasks took 24.212092ms using the scheduler driver
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0 (89 ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1
>>>>>>>> Reconciling 10000 tasks took 316.115078ms using the scheduler driver
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1 (385 ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2
>>>>>>>> Reconciling 50000 tasks took 1.239050154secs using the scheduler driver
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2 (1379 
>>>>>>>> ms)
>>>>>>>> [ RUN      ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3
>>>>>>>> Reconciling 100000 tasks took 2.38445672secs using the scheduler driver
>>>>>>>> [       OK ] 
>>>>>>>> Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3 (2711 
>>>>>>>> ms)
>>>>>>>> ```
>>>>>>>> 
>>>>>>>> *SchedulerLibrary* is the HTTP API, *SchedulerDriver* is the old way 
>>>>>>>> based on libmesos.so.
>>>>>>>> 
>>>>>>>>> On Sun, Oct 16, 2016 at 2:41 PM, Dario Rexin <dre...@apple.com> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I recently did some performance testing on the v1 scheduler API and 
>>>>>>>>> found that throughput is around 10x lower than for the v0 API. Using 
>>>>>>>>> 1 connection, I don’t get a lot more than 1,500 calls per second, 
>>>>>>>>> where the v0 API can do ~15,000. If I use multiple connections, 
>>>>>>>>> throughput maxes out at 3 connections and ~2,500 calls / s. If I add 
>>>>>>>>> any more connections, the throughput per connection drops and the 
>>>>>>>>> total throughput stays around ~2,500 calls / s. Has anyone done 
>>>>>>>>> performance testing on the v1 API before? It seems a little strange 
>>>>>>>>> to me, that it’s so much slower, given that the v0 API also uses HTTP 
>>>>>>>>> (well, more or less). I would be thankful for any comments and 
>>>>>>>>> experience reports of other users.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Dario
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Best Regards,
>>>>>>>> Haosdent Huang
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Deshi Xiao
>>>> Twitter: xds2000
>>>> E-mail: xiaods(AT)gmail.com
>>> 
>>> 
>>> 
>>> -- 
>>> Anand Mazumdar
>> 
>

Re: Performance regression in v1 api vs v0

Reply via email to