Fwd: Coordinated Omission (CO) - possible strategies

Gil Tene Fri, 18 Oct 2013 17:09:38 -0700

[FYI - this is resent out of order due to a bounce. Some replies to this have 
already been posted].


I don't think the thread model is the core of the Coordinated Omission problem. 
Unless we consider the only solution to be sending no more than one request per 
20 minutes from any given thread a threading model fix. It's more of a 
configuration choice the way I see it, but a pretty impossible one. The thread 
model may need work for other reasons, but CO is not one of them.

In JMeter, as with all other synchronous testers, Coordinated Omission is a 
per-thread issue. It's easy to demonstrate CO with JMeter with a single client 
thread testing an application that has only a single client connection in the 
real world, or with 15 client threads testing an application that has exactly 
15 real-world clients communicating at high rates (common with muxed 
environments, messaging, ESBs, trading systems, etc.). No amount of threading 
or concurrency will help get a better test results capturing for these very 
real system. Any occurrence of CO will make the JMeter results seriously bogus.

When any one thread misses a planned request sending time, CO has already 
occurred, and there is no way to avoid it at that point. You certainly detect 
that CO has happened. The question is what to do about it in JMeter once you 
detect it. The major options are:

1. Ignore it and keep working with the data as if it actually meant anything. 
This amount to http://tinyurl.com/o46doqf .

2. You can try to change the tester behavior to avoid CO going forward. E.g. 
you can try to adjust the number of threads up AND at the same time the 
frequency of requests that each thread sends requests at, which will amount to 
drastically changing the test plan in reaction to system behavior. In my 
opinion, changing behavior dynamically will have very limited effectiveness for 
two reasons: The first is that the problem had already occurred, so all the 
data up to and including the observed CO  is already bogus and has to be thrown 
away unless it can be corrected somehow. Only after you auto-adjust enough 
times to not see CO for a long time, your results during that time may be 
valid. The second is that changing the test scenario is valid (and possible) 
for very few real world systems.

3. You can try to correct for CO when you observe it. There are various ways 
this can be done, and most of them will amount to re-creating missing test 
sample results by projecting from past results. This can help correct the 
results data set so that it would better approximate what a tester that was not 
synchronous, and would have kept issuing requests per the actual test plan, 
would have experienced in the test.

4. Something else we hadn't yet thought about.

Some correction and detection example work can be found at: 
https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9
 , which uses code at https://github.com/OutlierCorrector/OutlierCorrector . 
Michael Chmiel worked at Azul Systems over the summer on this problem, and the 
OutlierCorrector package and the small patch to JMeter  (under the docs-2.9 
branch) are some of the results of that work. This fix approach appears to work 
well as long as no explicitly random behavior is stated in the test scenarios 
(the outlier detector detects a test pattern and repeats it in repairing the 
data. Expressly random scenarios will not exhibit a detectable pattern.).

-- Gil.

On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine 
<[email protected]<mailto:[email protected]>>
 wrote:

Hi Sebb,

In my testing, the option off creating threads on demand instead of all at once 
has made a huge difference in my being able to control rate of arrivals on the 
server. It has convinced me that simply using the throughput controller isn't 
enough and that the threading model in JMeter *must* change. It is the 
threading model that is the biggest source of CO in JMeter. Unfortunately we 
weren't able to come to some way of a non-disruptive change in JMeter to make 
this happen.

The model I was proposing would have JMeter generate an event heap sorted by 
the time when a sampler should be fired. A thread pool should be used to eat 
off of the heap and fire the events as per scheduled. This would allow JMeter 
to break the inappropriate relationship of a thread being a user. The solution 
is not perfect in that you will still have to fight with thread schedulers and 
hypervisors to get things to happen on queue. However, I believe the end result 
will be a far more scalable product that will require far fewer threads to 
produce far higher loads on the server.

As for your idea on the using the throughput controller. IHMO triggering an 
assert only worsens the CO problem. In fact, if the response times from the 
timeouts are not added into the results, in other words they are omitted from 
the data set, you've only made the problem worse as you are filter out bad data 
points from the result sets making the results better than they should be. 
Peter Lawyer's (included here for the purpose of this discussion) technique for 
correcting CO is to simply recognize when the event should have been triggered 
and then start the timer for that event at that time. So the latency reported 
will include the time before event triggering.

Gil Tene's done some work with JMeter. I'll leave it up to him to post what 
he's done. The interesting bit that he's created is HrdHistogram 
(https://github.com/giltene/HdrHistogram). It is not only a better way to 
report results,it offers techniques to calculate and correct for CO. Also Gil 
might be able to point you to a more recent version of his on CO talk. It might 
be nice to have a new sampler that incorporates this work.

On a side note, I've got a Servlet filter that is JMX component that measures a 
bunch of stats from the servers POV. It's something that could be contributed 
as it could be used to help understand the source of CO.. if not just 
complement JMeter's view of latency.

Regards,
Kirk


On 2013-10-18, at 12:27 AM, sebb <[email protected]<mailto:[email protected]>> 
wrote:

It looks to be quite difficult to avoid the issue of Coordination
Omission without a major redesign of JMeter.

However, it may be a lot easier to detect when the condition has occurred.
This would potentially allow the test settings to be changed to reduce
or eliminate the occurrences - e.g. by increasing the number of
threads or spreading the load across more JMeter instances.

The Constant Throughput Controller calculates the desired wait time,
and if this is less than zero - i.e. a sample should already have been
generated - it could trigger the creation of a failed Assertion
showing the time difference.

Would this be sufficient to detect all CO occurrences?
If not, what other metric needs to be checked?

Even if it is not the only possible cause, would it be useful as a
starting point?

---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>

Fwd: Coordinated Omission (CO) - possible strategies

Reply via email to