Re: Coordinated Omission (CO) - possible strategies

Kirk Pepperdine Fri, 18 Oct 2013 11:56:43 -0700

On 2013-10-18, at 7:43 PM, Gil Tene <[email protected]> wrote:

> I'm not saying the threading model doesn't have it's own issues, or that 
> those issues could not in themselves cause coordinated omission. I'm saying 
> there is already a dominant, demonstrable, and classic case of CO in JMeter 
> that doesn't have anything to do with the threading model, and will not go 
> away no matter what is done to the threading model. As long as JMeter test 
> plans are expressed as describing instructions for serial, synchronous, 
> do-these-one-after-the-other scripts for what the tester should do for a 
> given client, coordinated omission will easily occur in executing those 
> instructions. I believe that this will no go away without changing how all 
> JMeter test plans are expressed, and that is probably a non-starter. As a 
> result, I think that building in logic that will correct for coordinated 
> omission when it inevitably occurs, as opposed to trying to avoid it's 
> occurrence, is the only way to go for JMeter.


I can't disagree with you in that CO is present in a single threaded test. 
However, the nature of the type of load testing is that you play out a scenario 
because the results of the previous request are needed for the current request. 
Under those conditions you can't do much but wait until the back pressure 
clears or your initial request is retired. I think the best you can do under 
these circumstances just as Sebb has suggested in that you flag the problem and 
move on. I wouldn't fail nor omit the result but I'm not sure how you can 
correct because the back pressure in this case will result in lower loads which 
will allow requests to retire at a rate higher than one should normally expect.

That said, when users meet this type of system they will most likely abandon.. 
which is in it's self a failure. JMeter doesn't have direct facilities to 
support this type of behaviour.
> 
> Coordinated Omission is a basic problem that can happen due to many, many 
> different reasons and causes. It is made up of two simple things: One is the 
> Omission of some results or samples from the final data set. Second is the 
> Coordination of such omissions with other behavior, such that it is not 
> random. Random omission is usually not a problem. That's just sampling, and 
> random sampling works. Coordinated Omission is a problem because it is 
> effectively [highly] biased sampling. When Coordinated Omission occurs, the 
> resulting data set is biased towards certain behaviors (like good response 
> times), leading ALL statistics on the resulting data set to be highly suspect 
> (read: "usually completely wrong and off by orders of magnitude") in 
> describing response time or latency behavior of the observed system.
> 
> In JMeter, Coordinated Omission occurs whenever a thread doesn't execute it's 
> test plan as planned, and does so in reaction to behavior it encounters. This 
> is most often caused by the simple and inherent synchronous nature of test 
> plans as they are stated in JMeter: when a specific request takes longer to 
> respond that it would have taken the thread to send the next request in the 
> plan, the very fact that the thread did not send the next request out on time 
> as planned is a coordinated omission: It is the effective removal of a 
> response time result that would have been in the data set had the 
> coordination not happened. It is "omission" since a measurement that should 
> have occurred didn't happen and was not recorded. It is "coordinated" because 
> the omission is not random, and is correlated-with/influenced-by the 
> occurrence of another longer than normal response time occurrence.
> 
> The work done with the OutlierCorrector in JMeter focused on detecting CO in 
> streams of measured results reported to listeners, and inserting "fake" 
> results into the stream to represent the missing, omitted results that should 
> have been there. OutlierCorrector also has a log file corrector that can fix 
> JMeter logs offline, and after the fact by applying the same logic.

Right, but this is for a fixed transactional rate which is typically seen in 
machine to machine HFTS. In Web apps, perhaps the most common use case for 
JMeter, client back-off due to back pressure is a common behaviour and it's one 
that doesn't harm the testing process in the sense that if the server can't 
retire transactions fast enough.. JMeter will expose it. if you want to prove 5 
9's, then I agree, you've got a problem.

It's not that I disagree with you or I don't understand what you're saying, 
it's just that I'm having difficulty mapping it back to the world that people 
on this list have to deal with. w.r.t, I've a feeling that our views are some 
what tainted by the worlds we live in. In the HTTP world, CO exists and I 
accept it as natural behaviour. In your world CO exists but it cannot be 
accepted. The problem is that mechanical sympathy is mostly about your world. I 
think there is a commonality between the two worlds but I think to find it we 
need more discussion. I'm not sure that this list is good for this purpose so 
I'm going to flip back to mechanical sympathy instead of hijacking this mailing 
list.

-- Kirk

> 
> -- Gil.
> 
> 
> On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <[email protected]> 
> wrote:
> 
>> Hi Gil,
>> 
>> I would have to disagree as in this case I believe there is CO due to the 
>> threading model, CO on a per-thread basis as well as plain old omission. I 
>> believe these conditions are in addition to the conditions you're pointing 
>> to.
>> 
>> You may test at a fixed rate for HFT but in most worlds, random is 
>> necessary. Unfortunately that makes the problem more difficult to deal with.
>> 
>> Regards,
>> Kirk
>> 
>> On 2013-10-18, at 5:32 PM, Gil Tene <[email protected]> wrote:
>> 
>>> I don't think the thread model is the core of the Coordinated Omission 
>>> problem. Unless we consider the only solution to be sending no more than 
>>> one request per 20 minutes from any given thread a threading model fix. 
>>> It's more of a configuration choice the way I see it, but a pretty 
>>> impossible one. The thread model may need work for other reasons, but CO is 
>>> not one of them. 
>>> 
>>> In JMeter, as with all other synchronous testers, Coordinated Omission is a 
>>> per-thread issue. It's easy to demonstrate CO with JMeter with a single 
>>> client thread testing an application that has only a single client 
>>> connection in the real world, or with 15 client threads testing an 
>>> application that has exactly 15 real-world clients communicating at high 
>>> rates (common with muxed environments, messaging, ESBs, trading systems, 
>>> etc.). No amount of threading or concurrency will help get a better test 
>>> results capturing for these very real system. Any occurrence of CO will 
>>> make the JMeter results seriously bogus.
>>> 
>>> When any one thread misses a planned request sending time, CO has already 
>>> occurred, and there is no way to avoid it at that point. You certainly 
>>> detect that CO has happened. The question is what to do about it in JMeter 
>>> once you detect it. The major options are:
>>> 
>>> 1. Ignore it and keep working with the data as if it actually meant 
>>> anything. This amount to http://tinyurl.com/o46doqf .
>>> 
>>> 2. You can try to change the tester behavior to avoid CO going forward. 
>>> E.g. you can try to adjust the number of threads up AND at the same time 
>>> the frequency of requests that each thread sends requests at, which will 
>>> amount to drastically changing the test plan in reaction to system 
>>> behavior. In my opinion, changing behavior dynamically will have very 
>>> limited effectiveness for two reasons: The first is that the problem had 
>>> already occurred, so all the data up to and including the observed CO  is 
>>> already bogus and has to be thrown away unless it can be corrected somehow. 
>>> Only after you auto-adjust enough times to not see CO for a long time, your 
>>> results during that time may be valid. The second is that changing the test 
>>> scenario is valid (and possible) for very few real world systems.
>>> 
>>> 3. You can try to correct for CO when you observe it. There are various 
>>> ways this can be done, and most of them will amount to re-creating missing 
>>> test sample results by projecting from past results. This can help correct 
>>> the results data set so that it would better approximate what a tester that 
>>> was not synchronous, and would have kept issuing requests per the actual 
>>> test plan, would have experienced in the test.
>>> 
>>> 4. Something else we hadn't yet thought about.
>>> 
>>> Some correction and detection example work can be found at: 
>>> https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9
>>>  , which uses code at https://github.com/OutlierCorrector/OutlierCorrector 
>>> . Michael Chmiel worked at Azul Systems over the summer on this problem, 
>>> and the OutlierCorrector package and the small patch to JMeter  (under the 
>>> docs-2.9 branch) are some of the results of that work. This fix approach 
>>> appears to work well as long as no explicitly random behavior is stated in 
>>> the test scenarios (the outlier detector detects a test pattern and repeats 
>>> it in repairing the data. Expressly random scenarios will not exhibit a 
>>> detectable pattern.).
>>> 
>>> -- Gil.
>>> 
>>> On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <[email protected]>
>>>  wrote:
>>> 
>>>> Hi Sebb,
>>>> 
>>>> In my testing, the option off creating threads on demand instead of all at 
>>>> once has made a huge difference in my being able to control rate of 
>>>> arrivals on the server. It has convinced me that simply using the 
>>>> throughput controller isn't enough and that the threading model in JMeter 
>>>> *must* change. It is the threading model that is the biggest source of CO 
>>>> in JMeter. Unfortunately we weren't able to come to some way of a 
>>>> non-disruptive change in JMeter to make this happen.
>>>> 
>>>> The model I was proposing would have JMeter generate an event heap sorted 
>>>> by the time when a sampler should be fired. A thread pool should be used 
>>>> to eat off of the heap and fire the events as per scheduled. This would 
>>>> allow JMeter to break the inappropriate relationship of a thread being a 
>>>> user. The solution is not perfect in that you will still have to fight 
>>>> with thread schedulers and hypervisors to get things to happen on queue. 
>>>> However, I believe the end result will be a far more scalable product that 
>>>> will require far fewer threads to produce far higher loads on the server.
>>>> 
>>>> As for your idea on the using the throughput controller. IHMO triggering 
>>>> an assert only worsens the CO problem. In fact, if the response times from 
>>>> the timeouts are not added into the results, in other words they are 
>>>> omitted from the data set, you've only made the problem worse as you are 
>>>> filter out bad data points from the result sets making the results better 
>>>> than they should be. Peter Lawyer's (included here for the purpose of this 
>>>> discussion) technique for correcting CO is to simply recognize when the 
>>>> event should have been triggered and then start the timer for that event 
>>>> at that time. So the latency reported will include the time before event 
>>>> triggering.
>>>> 
>>>> Gil Tene's done some work with JMeter. I'll leave it up to him to post 
>>>> what he's done. The interesting bit that he's created is HrdHistogram 
>>>> (https://github.com/giltene/HdrHistogram). It is not only a better way to 
>>>> report results,it offers techniques to calculate and correct for CO. Also 
>>>> Gil might be able to point you to a more recent version of his on CO talk. 
>>>> It might be nice to have a new sampler that incorporates this work.
>>>> 
>>>> On a side note, I've got a Servlet filter that is JMX component that 
>>>> measures a bunch of stats from the servers POV. It's something that could 
>>>> be contributed as it could be used to help understand the source of CO.. 
>>>> if not just complement JMeter's view of latency.
>>>> 
>>>> Regards,
>>>> Kirk
>>>> 
>>>> 
>>>> On 2013-10-18, at 12:27 AM, sebb <[email protected]> wrote:
>>>> 
>>>>> It looks to be quite difficult to avoid the issue of Coordination
>>>>> Omission without a major redesign of JMeter.
>>>>> 
>>>>> However, it may be a lot easier to detect when the condition has occurred.
>>>>> This would potentially allow the test settings to be changed to reduce
>>>>> or eliminate the occurrences - e.g. by increasing the number of
>>>>> threads or spreading the load across more JMeter instances.
>>>>> 
>>>>> The Constant Throughput Controller calculates the desired wait time,
>>>>> and if this is less than zero - i.e. a sample should already have been
>>>>> generated - it could trigger the creation of a failed Assertion
>>>>> showing the time difference.
>>>>> 
>>>>> Would this be sufficient to detect all CO occurrences?
>>>>> If not, what other metric needs to be checked?
>>>>> 
>>>>> Even if it is not the only possible cause, would it be useful as a
>>>>> starting point?
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>>> 
>> 
>

Re: Coordinated Omission (CO) - possible strategies

Reply via email to