Re: [Project Clearwater] Performance limit measurement

2016-09-20 Thread Graeme Robertson (projectclearwater.org)
Hi Michael,

It’s worth noting that Project Clearwater is designed to scale horizontally 
rather than vertically, so we would expect multiple less powerful Sprout nodes 
to out-perform a single powerful Sprout node. However, that doesn’t mean that 
your Sprout node isn’t capable of handling the load you’re hitting it with.

We do expose latency measurements over SNMP – see 
http://clearwater.readthedocs.io/en/stable/Clearwater_SNMP_Statistics.html for 
more details. In particular, under Sprout statistics we have various latency 
statistics including latency for SIP requests and latency for requests to 
Homestead. There are a couple of other statistics that might be useful for 
determining when exactly our requests are failing – if the number of initial 
registration failures and/or the number of authentication failures are non-zero 
this would indicate that the bottleneck is actually at Homestead.

It does sound as though Sprout is reporting itself as overloaded even though it 
could handle more requests. As I mentioned previously, Sprout will tweak it’s 
overload controls to rectify this, but it won’t be immediate, which might 
explain the failures. I know you’ve already tried tweaking the token controls, 
but it might be worth looking at them again. Over the course of the minute of 
your test, I think we expect to receive 60,000 REGISTERs (2 per subscriber), 
and they should be even distributed, so we’re expecting 1,000 requests per 
second. Have you tried setting init_token_rate to 1,000? You’ll want to make 
sure this change is picked up on both Sprout and Homestead – you can do this by 
editing /etc/clearwater/shared_config on a single node and running 
/usr/share/clearwater/clearwater-config-manager/scripts/upload_shared_config. 
After a few minutes the change will have propagated around the deployment.

Thanks,
Graeme

From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On 
Behalf Of ??? ?ats?
Sent: 19 September 2016 08:42
To: clearwater@lists.projectclearwater.org
Subject: Re: [Project Clearwater] Performance limit measurement

Hi Graeme,

i created a simpler scenario comparing to what the Sip Stress testing uses. In 
each scenario two subscribers try just to register to IMS and do not make any 
call to each other. I run this scenario for 15000 pairs of subscribers (3 
subscribers). The register requests are distributed in 1 minute time. It seems 
tha Sprout node is the bottleneck. The return code of most of the failed 
messages is  503 (Service Unavailable) and of some of them 408 (Request 
Timeout). I have added resources in Sprout (4 CPUs and 8Gb memory) so i don't 
believe that resources is the issue.

Does Sprout somehow exposes the latency measurements that lead to the 
throttling? We would like to take a look at them.



 Here is the the xml file .




  
  

  
  
  
  
  
  
  

  

  

  

  

  

  

  

  

  

  

  
  

  
  

  

  

  

  

  

  

  

  

  

  




Best Regards,
Michael Katsoulis



2016-09-16 21:25 GMT+03:00 Graeme Robertson 
(projectclearwater.org<http://projectclearwater.org>) 
<gra...@projectclearwater.org<mailto:gra...@projectclearwater.org>>:
Hi Michael,

Can you tell me more about your scenario? It sounds like you’re not using the 
clearwater-sip-stress package, or at least not in exactly the form we package 
up. If you’re not using the clearwater-sip-stress package then please can you 
send details of your stress scenario?

Depending on how powerful your Sprout node is, I would expect 15000 calls per 
second to be towards the upper limit of its performance powers. However, if the 
CPU is not particularly high then that would suggest that Sprout’s throttling 
controls might require further tuning. Do you know what return code the 
“unexpected messages” have? 503s indicate that there is overload somewhere. 
Sprout does adjust its throttling controls to match the load its able to 
process, but that process is not immediate, and we recommend building stress up 
gradually rather than immediately firing 15000 calls per second into the system 
– for more information on that, see 
http://www.projectclearwater.org/clearwater-performance-and-our-load-monitor/.

One final thought I had was that the node you’re running stress on might be 
overloaded. If the stress node is not responding to messages in a timely 
fashion then that will generate time outs and unexpected messages.

Thanks,
Graeme

From: Clearwater 
[mailto:clearwater-boun...@lists.projectclearwater.org<mailto:clearwater-boun...@lists.projectclearwater.org>]
 On Behalf Of ??? ?ats?
Sent: 16 September 2016 15:16
To: 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: Re: [Project Clearwater] Performance limit measurement

Hi Graeme,

thanks a lot for your response.

In our scenari

Re: [Project Clearwater] Performance limit measurement

2016-09-19 Thread Μιχάλης Κατσούλης
Hi Graeme,

i created a simpler scenario comparing to what the Sip Stress testing uses.
In each scenario two subscribers try just to register to IMS and do not
make any call to each other. I run this scenario for 15000 pairs of
subscribers (3 subscribers). The register requests are distributed in 1
minute time. It seems tha Sprout node is the bottleneck. The return code of
most of the failed messages is  503 (Service Unavailable) and of some of
them 408 (Request Timeout). I have added resources in Sprout (4 CPUs and
8Gb memory) so i don't believe that resources is the issue.

Does Sprout somehow exposes the latency measurements that lead to the
throttling? We would like to take a look at them.



 *Here is the the xml file* .




  
  

  
  
  
  
  
  
  

  

  

  

  

  

  

  

  

  

  

  
  

  
  

  

  

  

  

  

  

  

  

  

  




Best Regards,
Michael Katsoulis



2016-09-16 21:25 GMT+03:00 Graeme Robertson (projectclearwater.org) <
gra...@projectclearwater.org>:

> Hi Michael,
>
>
>
> Can you tell me more about your scenario? It sounds like you’re not using
> the clearwater-sip-stress package, or at least not in exactly the form we
> package up. If you’re not using the clearwater-sip-stress package then
> please can you send details of your stress scenario?
>
>
>
> Depending on how powerful your Sprout node is, I would expect 15000 calls
> per second to be towards the upper limit of its performance powers.
> However, if the CPU is not particularly high then that would suggest that
> Sprout’s throttling controls might require further tuning. Do you know what
> return code the “unexpected messages” have? 503s indicate that there is
> overload somewhere. Sprout does adjust its throttling controls to match the
> load its able to process, but that process is not immediate, and we
> recommend building stress up gradually rather than immediately firing 15000
> calls per second into the system – for more information on that, see
> http://www.projectclearwater.org/clearwater-performance-
> and-our-load-monitor/.
>
>
>
> One final thought I had was that the node you’re running stress on might
> be overloaded. If the stress node is not responding to messages in a timely
> fashion then that will generate time outs and unexpected messages.
>
>
>
> Thanks,
> Graeme
>
>
>
> *From:* Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org]
> *On Behalf Of *??? ?ats?????
> *Sent:* 16 September 2016 15:16
> *To:* clearwater@lists.projectclearwater.org
> *Subject:* Re: [Project Clearwater] Performance limit measurement
>
>
>
> Hi Graeme,
>
>
>
> thanks a lot for your response.
>
>
>
> In our scenario we are using the Stress node to generate 15000 calls in 60
> seconds. The number of
>
> unsuccessful calls varies from ~500 to ~5000 even in subsequent
> repetitions of the same scenario.
>
> According to wireshark the failures happen because of Sprout that does not
> send the correct responses in time
>
> and so we get "time-outs" and "unexpected messages" in the Stress node.
>
> The Sprout node has sufficient CPU and memory resources.
>
> What could be the reason of this instability in our deployment?
>
>
>
> Thank you in advance,
>
> Michael Katsoulis
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2016-09-16 16:14 GMT+03:00 Graeme Robertson (projectclearwater.org) <
> gra...@projectclearwater.org>:
>
> Hi Michael,
>
>
>
> How many successes and failures are you seeing? We primarily use the
> clearwater-sip-stress package to check we haven’t introduced crashes under
> load, and to check we haven’t significantly regressed the performance of
> Project Clearwater. Unfortunately clearwater-sip-stress is not reliable
> enough to generate completely accurate performance numbers for Project
> Clearwater (and we don’t accurately measure Project Clearwater performance
> or provide numbers). We tend to see around 1% failures when running
> clearwater-sip-stress. If your failure numbers are fluctuating at around 1%
> then this is probably down to the test scripts not being completely
> reliable, and you won’t have actually hit the deployment’s limit until you
> start seeing more failures than this.
>
>
>
> Thanks,
>
> Graeme
>
>
>
>
>
> *From:* Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org]
> *On Behalf Of *??? ?ats?
> *Sent:* 16 September 2016 10:17
> *To:* Clearwater@lists.projectclearwater.org
> *Subject:* [Project Clearw

[Project Clearwater] Performance limit measurement

2016-09-16 Thread Μιχάλης Κατσούλης
Hi all,

we are running Stress Tests against our Clearwater Deployment using Sip
Stress node.
We have noticed that the results are not consistent as the number of
successfull calls changes during repetitions of the same test scenario.

We have tried to increase the values of max_tokens , init_token_rate,
min_token_rate and
target_latency_us but we did not observe any difference.

What is the proposed way to discover the deployment's limit on how many
requests per second can
be served?

Thanks in advance,
Michael Katsoulis
___
Clearwater mailing list
Clearwater@lists.projectclearwater.org
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org