Re: Performance regression in v1 api vs v0

tommy xiao Sun, 16 Oct 2016 11:59:45 -0700

interesting this topic.

2016-10-17 2:51 GMT+08:00 Dario Rexin <dre...@apple.com>:


> Hi Anand,
>
> I tested with current HEAD. After I saw low throughput on our own HTTP API
> client, I wrote a small server that sends out fake events and accepts calls
> and our client was able to send a lot more calls to that server. I also
> wrote a small tool that simply sends as many calls to Mesos as possible
> without handling any events and get similar results there.I also observe
> extremely high CPU usage. While my sending tool is using ~10% CPU, Mesos
> runs on ~185%. The calls I send for testing are all REVIVE and I don’t have
> any agents connected, so there should be essentially nothing happening. One
> reason I could think of for the reduced throughput is that all calls are
> processed in the master process, before it sends back an ACCEPTED, leading
> to effectively single threaded processing of HTTP calls, interleaved with
> all other calls that are sent to the master process. Libprocess however
> just forwards the messages to the master process and then immediately
>  returns ACCEPTED. It also handles all connections in separate processes,
> whereas HTTP calls are effectively all handled by the master process.This
> is especially concerning, as it means that accepting calls will completely
> stall when a long running call (e.g. retrieving state.json) is running.
>
> Thanks,
> Dario
>
> On Oct 16, 2016, at 11:01 AM, Anand Mazumdar <an...@apache.org> wrote:
>
> Dario,
>
> Thanks for reporting this. Did you test this with 1.0 or the recent HEAD?
> We had done performance testing prior to 1.0rc1 and had not found any
> substantial discrepancy on the call ingestion path. Hence, we had focussed
> on fixing the performance issues around writing events on the stream in
> MESOS-5222 <https://issues.apache.org/jira/browse/MESOS-5222> and
> MESOS-5457 <https://issues.apache.org/jira/browse/MESOS-5457>.
>
> The numbers in the benchmark test pointed by Haosdent (v0 vs v1) differ
> due to the slowness of the client (scheduler library) in processing the
> status update events. We should add another benchmark that measures just
> the time taken by the master to write the events. I would file an issue
> shortly to address this.
>
> Do you mind filing an issue with more details on your test setup?
>
> -anand
>
> On Sun, Oct 16, 2016 at 12:05 AM, Dario Rexin <dre...@apple.com> wrote:
>
>> Hi haosdent,
>>
>> thanks for the pointer! Your results show exactly what I’m experiencing.
>> I think especially for bigger clusters this could be very problematic. It
>> would be great to get some input from the folks working on the HTTP API,
>> especially Anand.
>>
>> Thanks,
>> Dario
>>
>> On Oct 16, 2016, at 12:01 AM, haosdent <haosd...@gmail.com> wrote:
>>
>> Hmm, this is an interesting topic. @anandmazumdar create a benchmark test
>> case to compare v1 and v0 APIs before. You could run it via
>>
>> ```
>> ./bin/mesos-tests.sh --benchmark --gtest_filter="*SchedulerReco
>> ncileTasks_BENCHMARK_Test*"
>> ```
>>
>> Here is the result that run it in my machine.
>>
>> ```
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrar
>> y/0
>> Reconciling 1000 tasks took 386.451108ms using the scheduler library
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0
>> (479 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrar
>> y/1
>> Reconciling 10000 tasks took 3.389258444secs using the scheduler library
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1
>> (3435 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrar
>> y/2
>> Reconciling 50000 tasks took 16.624603964secs using the scheduler library
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2
>> (16737 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrar
>> y/3
>> Reconciling 100000 tasks took 33.134018718secs using the scheduler library
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3
>> (33333 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver
>> /0
>> Reconciling 1000 tasks took 24.212092ms using the scheduler driver
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0
>> (89 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver
>> /1
>> Reconciling 10000 tasks took 316.115078ms using the scheduler driver
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1
>> (385 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver
>> /2
>> Reconciling 50000 tasks took 1.239050154secs using the scheduler driver
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2
>> (1379 ms)
>> [ RUN      ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver
>> /3
>> Reconciling 100000 tasks took 2.38445672secs using the scheduler driver
>> [       OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3
>> (2711 ms)
>> ```
>>
>> *SchedulerLibrary* is the HTTP API, *SchedulerDriver* is the old way
>> based on libmesos.so.
>>
>> On Sun, Oct 16, 2016 at 2:41 PM, Dario Rexin <dre...@apple.com> wrote:
>>
>>> Hi all,
>>>
>>> I recently did some performance testing on the v1 scheduler API and
>>> found that throughput is around 10x lower than for the v0 API. Using 1
>>> connection, I don’t get a lot more than 1,500 calls per second, where the
>>> v0 API can do ~15,000. If I use multiple connections, throughput maxes out
>>> at 3 connections and ~2,500 calls / s. If I add any more connections, the
>>> throughput per connection drops and the total throughput stays around
>>> ~2,500 calls / s. Has anyone done performance testing on the v1 API before?
>>> It seems a little strange to me, that it’s so much slower, given that the
>>> v0 API also uses HTTP (well, more or less). I would be thankful for any
>>> comments and experience reports of other users.
>>>
>>> Thanks,
>>> Dario
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>>
>>
>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: Performance regression in v1 api vs v0

Reply via email to