Re: Coordinated Omission (CO) - possible strategies

Kirk Pepperdine Fri, 18 Oct 2013 14:25:37 -0700

On 2013-10-18, at 11:12 PM, Gil Tene <g...@azulsystems.com> wrote:

> This is not a problem that matters only in HFT or financial trading. 
> Coordinated Omission is just as prevalent in making all your stats completely 
> wrong in HTTP apps as it is in high frequency apps. It is unfortunately 
> universal.


I'm not disagreeing with you. I am trying to sort out if it matters. One of the 
questions I have is what is the human response to back pressure in an HTTP 
based system? Next question, does my load injector (in this case JMeter) behave 
in the same way?

-- Kirk

> 
> On Oct 18, 2013, at 11:55 AM, Kirk Pepperdine <kirk.pepperd...@gmail.com>
>  wrote:
> 
>> 
>> On 2013-10-18, at 7:43 PM, Gil Tene <g...@azulsystems.com> wrote:
>> 
>>> I'm not saying the threading model doesn't have it's own issues, or that 
>>> those issues could not in themselves cause coordinated omission. I'm saying 
>>> there is already a dominant, demonstrable, and classic case of CO in JMeter 
>>> that doesn't have anything to do with the threading model, and will not go 
>>> away no matter what is done to the threading model. As long as JMeter test 
>>> plans are expressed as describing instructions for serial, synchronous, 
>>> do-these-one-after-the-other scripts for what the tester should do for a 
>>> given client, coordinated omission will easily occur in executing those 
>>> instructions. I believe that this will no go away without changing how all 
>>> JMeter test plans are expressed, and that is probably a non-starter. As a 
>>> result, I think that building in logic that will correct for coordinated 
>>> omission when it inevitably occurs, as opposed to trying to avoid it's 
>>> occurrence, is the only way to go for JMeter.
>> 
>> I can't disagree with you in that CO is present in a single threaded test. 
>> However, the nature of the type of load testing is that you play out a 
>> scenario because the results of the previous request are needed for the 
>> current request. Under those conditions you can't do much but wait until the 
>> back pressure clears or your initial request is retired. I think the best 
>> you can do under these circumstances just as Sebb has suggested in that you 
>> flag the problem and move on. I wouldn't fail nor omit the result but I'm 
>> not sure how you can correct because the back pressure in this case will 
>> result in lower loads which will allow requests to retire at a rate higher 
>> than one should normally expect.
> 
> The only "correct" way to deal with detected coordinated omission is to 
> either correct for it, or to throw away all latency or response time data 
> acquired with it. "Flagging" it or "moving on" and keeping the other data for 
> analysis is the same as saying "This data is meaningless, selective, and 
> selectively represents only the best results the system demonstrated, while 
> specifically dropping the vast majority of actual indications of bad behavior 
> encountered during the test". 
> 
> To be clear, I'm saying that all response time data, including average 
> 90%'ile or any other, that JMeter collects in the presence of coordinated 
> omissions is completely wrong. They are wrong because the data they are all 
> based on is wrong. It's easily demonstrable to be off by several orders of 
> magnitude in real world situations, and in real world web applications.
> 
>> 
>> That said, when users meet this type of system they will most likely 
>> abandon.. which is in it's self a failure. JMeter doesn't have direct 
>> facilities to support this type of behaviour.
> 
> Failure, abandon, and backoff conditions are interesting commentary that does 
> not replace the need to include them in percentiles and averages. When 
> someone says "my web application exhibits a 99%'ile response time of 700msec" 
> the recipient of this information doesn't hear "99% of the the good results 
> will be 700msec or less". They hear "99% of ALL requests attempted with this 
> system will respond within 700 msec of less.". That includes all requests 
> that may have resulted in users walking away in anger, or seeing long 
> response times while the system was stalled for some reason.
> 
>>> 
>>> Coordinated Omission is a basic problem that can happen due to many, many 
>>> different reasons and causes. It is made up of two simple things: One is 
>>> the Omission of some results or samples from the final data set. Second is 
>>> the Coordination of such omissions with other behavior, such that it is not 
>>> random. Random omission is usually not a problem. That's just sampling, and 
>>> random sampling works. Coordinated Omission is a problem because it is 
>>> effectively [highly] biased sampling. When Coordinated Omission occurs, the 
>>> resulting data set is biased towards certain behaviors (like good response 
>>> times), leading ALL statistics on the resulting data set to be highly 
>>> suspect (read: "usually completely wrong and off by orders of magnitude") 
>>> in describing response time or latency behavior of the observed system.
>>> 
>>> In JMeter, Coordinated Omission occurs whenever a thread doesn't execute 
>>> it's test plan as planned, and does so in reaction to behavior it 
>>> encounters. This is most often caused by the simple and inherent 
>>> synchronous nature of test plans as they are stated in JMeter: when a 
>>> specific request takes longer to respond that it would have taken the 
>>> thread to send the next request in the plan, the very fact that the thread 
>>> did not send the next request out on time as planned is a coordinated 
>>> omission: It is the effective removal of a response time result that would 
>>> have been in the data set had the coordination not happened. It is 
>>> "omission" since a measurement that should have occurred didn't happen and 
>>> was not recorded. It is "coordinated" because the omission is not random, 
>>> and is correlated-with/influenced-by the occurrence of another longer than 
>>> normal response time occurrence.
>>> 
>>> The work done with the OutlierCorrector in JMeter focused on detecting CO 
>>> in streams of measured results reported to listeners, and inserting "fake" 
>>> results into the stream to represent the missing, omitted results that 
>>> should have been there. OutlierCorrector also has a log file corrector that 
>>> can fix JMeter logs offline, and after the fact by applying the same logic.
>> 
>> Right, but this is for a fixed transactional rate which is typically seen in 
>> machine to machine HFTS. In Web apps, perhaps the most common use case for 
>> JMeter, client back-off due to back pressure is a common behaviour and it's 
>> one that doesn't harm the testing process in the sense that if the server 
>> can't retire transactions fast enough.. JMeter will expose it. if you want 
>> to prove 5 9's, then I agree, you've got a problem.
> 
> Actually the corrector adjusts to the current transactional rare with a 
> configurable moving window average. 
> 
> Fixed transactional rates are no more common in HFTS that they are in web 
> applications, and client backoff is just as common there. But this has 
> nothing to do with HFTS. In all systems with synchronous clients, whether 
> they take 20usec or 2 seconds for a typical response, the characterization 
> and description of response time behavior should have nothing to do with 
> backing off. And in both types of systems, once backoff happens, coordinated 
> omission has kicked in and your data is contaminated.
> 
> As a trivial hypothetical, imagine a web system that regularly stalls for 3 
> consecutive seconds out of every 40 seconds of operation under a regular load 
> of 200 requests per second, but responds promptly in 20 msec or less the rest 
> of the time. This scenario can easily found in the real world for GC pauses 
> under high load for untuned systems. The 95%'ile response time of such a 
> system clearly can't be described as lower than 2 seconds without outright 
> lying. But here is what JMeter will report for such a systems:
> 
> - For 2 threads running 100 requests per second each: 95%'ile will show 
> 20msec.
> - For 20 threads running 10 requests per second each:  95%'ile will show 
> 20msec.
> - For 200 threads running 1 requests per second each: 95%'ile will show 20 
> msec.
> - For 400 threads running 1 request every 2 seconds: 95%'ile will show 20 
> msec.
> - For 2000 threads running 1 request every 10 seconds: 95%'ile will show ~1 
> seconds. 
> 
> Clearly there is a configuration for JMeter that would expose this behavior, 
> it's the one where the gap between requests any one client would send is 
> higher than the largest pause the system ever exhibits. It is also the only 
> one in the list that does not exhibit coordinated omission. But what if the 
> real world for this system simply involved 200 interactive clients, that 
> actually hit it with a request once every 1 second or so (think simple web 
> based games)? JMeter's test plans for that actual real world scenario would 
> show a 9%'ile response time results that is 100x off from reality.
> 
> And yes, in this world the real users playing this game would probably "back 
> off". Or they may do the reverse (click repeatedly in frustration, getting a 
> response after 3 seconds). Neither behavior would improve the 95%ile behavior 
> or excuse a 100x off report.
> 
>> 
>> It's not that I disagree with you or I don't understand what you're saying, 
>> it's just that I'm having difficulty mapping it back to the world that 
>> people on this list have to deal with. w.r.t, I've a feeling that our views 
>> are some what tainted by the worlds we live in. In the HTTP world, CO exists 
>> and I accept it as natural behaviour. In your world CO exists but it cannot 
>> be accepted.
> 
> Just because I also dabble in High frequency trading systems and 10 usec 
> responses doesn't mean I forgot about large and small scale web applications 
> with real people at the end, and with human response times.
> 
> My world covers many more HTTP people than it does HFT people. Many people's 
> first reaction to realizing how badly Coordinated Omission affects the 
> accuracy of reported response times is "this applies to someone else's 
> domain, but mine things are  still ok because of X". Unfortunately, the 
> problem is almost universal in synchronous testers and in synchronous 
> internal monitoring systems, and 95%+ ( ;-) ) of Web testing environments I 
> have encountered have dramatically under-reported 95%, 99%, and all other 
> %'iles. Unless those systems actually don't care about 5% failure rates, 
> their business decisions are currently being based on bad data.
> 
>> The problem is that mechanical sympathy is mostly about your world. I think 
>> there is a commonality between the two worlds but I think to find it we need 
>> more discussion. I'm not sure that this list is good for this purpose so I'm 
>> going to flip back to mechanical sympathy instead of hijacking this mailing 
>> list.
> 
>> 
>> -- Kirk
>> 
>>> 
>>> -- Gil.
>>> 
>>> 
>>> On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <kirk.pepperd...@gmail.com> 
>>> wrote:
>>> 
>>>> Hi Gil,
>>>> 
>>>> I would have to disagree as in this case I believe there is CO due to the 
>>>> threading model, CO on a per-thread basis as well as plain old omission. I 
>>>> believe these conditions are in addition to the conditions you're pointing 
>>>> to.
>>>> 
>>>> You may test at a fixed rate for HFT but in most worlds, random is 
>>>> necessary. Unfortunately that makes the problem more difficult to deal 
>>>> with.
>>>> 
>>>> Regards,
>>>> Kirk
>>>> 
>>>> On 2013-10-18, at 5:32 PM, Gil Tene <g...@azulsystems.com> wrote:
>>>> 
>>>>> I don't think the thread model is the core of the Coordinated Omission 
>>>>> problem. Unless we consider the only solution to be sending no more than 
>>>>> one request per 20 minutes from any given thread a threading model fix. 
>>>>> It's more of a configuration choice the way I see it, but a pretty 
>>>>> impossible one. The thread model may need work for other reasons, but CO 
>>>>> is not one of them. 
>>>>> 
>>>>> In JMeter, as with all other synchronous testers, Coordinated Omission is 
>>>>> a per-thread issue. It's easy to demonstrate CO with JMeter with a single 
>>>>> client thread testing an application that has only a single client 
>>>>> connection in the real world, or with 15 client threads testing an 
>>>>> application that has exactly 15 real-world clients communicating at high 
>>>>> rates (common with muxed environments, messaging, ESBs, trading systems, 
>>>>> etc.). No amount of threading or concurrency will help get a better test 
>>>>> results capturing for these very real system. Any occurrence of CO will 
>>>>> make the JMeter results seriously bogus.
>>>>> 
>>>>> When any one thread misses a planned request sending time, CO has already 
>>>>> occurred, and there is no way to avoid it at that point. You certainly 
>>>>> detect that CO has happened. The question is what to do about it in 
>>>>> JMeter once you detect it. The major options are:
>>>>> 
>>>>> 1. Ignore it and keep working with the data as if it actually meant 
>>>>> anything. This amount to http://tinyurl.com/o46doqf .
>>>>> 
>>>>> 2. You can try to change the tester behavior to avoid CO going forward. 
>>>>> E.g. you can try to adjust the number of threads up AND at the same time 
>>>>> the frequency of requests that each thread sends requests at, which will 
>>>>> amount to drastically changing the test plan in reaction to system 
>>>>> behavior. In my opinion, changing behavior dynamically will have very 
>>>>> limited effectiveness for two reasons: The first is that the problem had 
>>>>> already occurred, so all the data up to and including the observed CO  is 
>>>>> already bogus and has to be thrown away unless it can be corrected 
>>>>> somehow. Only after you auto-adjust enough times to not see CO for a long 
>>>>> time, your results during that time may be valid. The second is that 
>>>>> changing the test scenario is valid (and possible) for very few real 
>>>>> world systems.
>>>>> 
>>>>> 3. You can try to correct for CO when you observe it. There are various 
>>>>> ways this can be done, and most of them will amount to re-creating 
>>>>> missing test sample results by projecting from past results. This can 
>>>>> help correct the results data set so that it would better approximate 
>>>>> what a tester that was not synchronous, and would have kept issuing 
>>>>> requests per the actual test plan, would have experienced in the test.
>>>>> 
>>>>> 4. Something else we hadn't yet thought about.
>>>>> 
>>>>> Some correction and detection example work can be found at: 
>>>>> https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9
>>>>>  , which uses code at 
>>>>> https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel 
>>>>> worked at Azul Systems over the summer on this problem, and the 
>>>>> OutlierCorrector package and the small patch to JMeter  (under the 
>>>>> docs-2.9 branch) are some of the results of that work. This fix approach 
>>>>> appears to work well as long as no explicitly random behavior is stated 
>>>>> in the test scenarios (the outlier detector detects a test pattern and 
>>>>> repeats it in repairing the data. Expressly random scenarios will not 
>>>>> exhibit a detectable pattern.).
>>>>> 
>>>>> -- Gil.
>>>>> 
>>>>> On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <kirk.pepperd...@gmail.com>
>>>>>  wrote:
>>>>> 
>>>>>> Hi Sebb,
>>>>>> 
>>>>>> In my testing, the option off creating threads on demand instead of all 
>>>>>> at once has made a huge difference in my being able to control rate of 
>>>>>> arrivals on the server. It has convinced me that simply using the 
>>>>>> throughput controller isn't enough and that the threading model in 
>>>>>> JMeter *must* change. It is the threading model that is the biggest 
>>>>>> source of CO in JMeter. Unfortunately we weren't able to come to some 
>>>>>> way of a non-disruptive change in JMeter to make this happen.
>>>>>> 
>>>>>> The model I was proposing would have JMeter generate an event heap 
>>>>>> sorted by the time when a sampler should be fired. A thread pool should 
>>>>>> be used to eat off of the heap and fire the events as per scheduled. 
>>>>>> This would allow JMeter to break the inappropriate relationship of a 
>>>>>> thread being a user. The solution is not perfect in that you will still 
>>>>>> have to fight with thread schedulers and hypervisors to get things to 
>>>>>> happen on queue. However, I believe the end result will be a far more 
>>>>>> scalable product that will require far fewer threads to produce far 
>>>>>> higher loads on the server.
>>>>>> 
>>>>>> As for your idea on the using the throughput controller. IHMO triggering 
>>>>>> an assert only worsens the CO problem. In fact, if the response times 
>>>>>> from the timeouts are not added into the results, in other words they 
>>>>>> are omitted from the data set, you've only made the problem worse as you 
>>>>>> are filter out bad data points from the result sets making the results 
>>>>>> better than they should be. Peter Lawyer's (included here for the 
>>>>>> purpose of this discussion) technique for correcting CO is to simply 
>>>>>> recognize when the event should have been triggered and then start the 
>>>>>> timer for that event at that time. So the latency reported will include 
>>>>>> the time before event triggering.
>>>>>> 
>>>>>> Gil Tene's done some work with JMeter. I'll leave it up to him to post 
>>>>>> what he's done. The interesting bit that he's created is HrdHistogram 
>>>>>> (https://github.com/giltene/HdrHistogram). It is not only a better way 
>>>>>> to report results,it offers techniques to calculate and correct for CO. 
>>>>>> Also Gil might be able to point you to a more recent version of his on 
>>>>>> CO talk. It might be nice to have a new sampler that incorporates this 
>>>>>> work.
>>>>>> 
>>>>>> On a side note, I've got a Servlet filter that is JMX component that 
>>>>>> measures a bunch of stats from the servers POV. It's something that 
>>>>>> could be contributed as it could be used to help understand the source 
>>>>>> of CO.. if not just complement JMeter's view of latency.
>>>>>> 
>>>>>> Regards,
>>>>>> Kirk
>>>>>> 
>>>>>> 
>>>>>> On 2013-10-18, at 12:27 AM, sebb <seb...@gmail.com> wrote:
>>>>>> 
>>>>>>> It looks to be quite difficult to avoid the issue of Coordination
>>>>>>> Omission without a major redesign of JMeter.
>>>>>>> 
>>>>>>> However, it may be a lot easier to detect when the condition has 
>>>>>>> occurred.
>>>>>>> This would potentially allow the test settings to be changed to reduce
>>>>>>> or eliminate the occurrences - e.g. by increasing the number of
>>>>>>> threads or spreading the load across more JMeter instances.
>>>>>>> 
>>>>>>> The Constant Throughput Controller calculates the desired wait time,
>>>>>>> and if this is less than zero - i.e. a sample should already have been
>>>>>>> generated - it could trigger the creation of a failed Assertion
>>>>>>> showing the time difference.
>>>>>>> 
>>>>>>> Would this be sufficient to detect all CO occurrences?
>>>>>>> If not, what other metric needs to be checked?
>>>>>>> 
>>>>>>> Even if it is not the only possible cause, would it be useful as a
>>>>>>> starting point?
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@jmeter.apache.org
>>>>>>> For additional commands, e-mail: user-h...@jmeter.apache.org
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Coordinated Omission (CO) - possible strategies

Reply via email to