On 2013-10-18, at 11:12 PM, Gil Tene <g...@azulsystems.com> wrote: > This is not a problem that matters only in HFT or financial trading. > Coordinated Omission is just as prevalent in making all your stats completely > wrong in HTTP apps as it is in high frequency apps. It is unfortunately > universal.
I'm not disagreeing with you. I am trying to sort out if it matters. One of the questions I have is what is the human response to back pressure in an HTTP based system? Next question, does my load injector (in this case JMeter) behave in the same way? -- Kirk > > On Oct 18, 2013, at 11:55 AM, Kirk Pepperdine <kirk.pepperd...@gmail.com> > wrote: > >> >> On 2013-10-18, at 7:43 PM, Gil Tene <g...@azulsystems.com> wrote: >> >>> I'm not saying the threading model doesn't have it's own issues, or that >>> those issues could not in themselves cause coordinated omission. I'm saying >>> there is already a dominant, demonstrable, and classic case of CO in JMeter >>> that doesn't have anything to do with the threading model, and will not go >>> away no matter what is done to the threading model. As long as JMeter test >>> plans are expressed as describing instructions for serial, synchronous, >>> do-these-one-after-the-other scripts for what the tester should do for a >>> given client, coordinated omission will easily occur in executing those >>> instructions. I believe that this will no go away without changing how all >>> JMeter test plans are expressed, and that is probably a non-starter. As a >>> result, I think that building in logic that will correct for coordinated >>> omission when it inevitably occurs, as opposed to trying to avoid it's >>> occurrence, is the only way to go for JMeter. >> >> I can't disagree with you in that CO is present in a single threaded test. >> However, the nature of the type of load testing is that you play out a >> scenario because the results of the previous request are needed for the >> current request. Under those conditions you can't do much but wait until the >> back pressure clears or your initial request is retired. I think the best >> you can do under these circumstances just as Sebb has suggested in that you >> flag the problem and move on. I wouldn't fail nor omit the result but I'm >> not sure how you can correct because the back pressure in this case will >> result in lower loads which will allow requests to retire at a rate higher >> than one should normally expect. > > The only "correct" way to deal with detected coordinated omission is to > either correct for it, or to throw away all latency or response time data > acquired with it. "Flagging" it or "moving on" and keeping the other data for > analysis is the same as saying "This data is meaningless, selective, and > selectively represents only the best results the system demonstrated, while > specifically dropping the vast majority of actual indications of bad behavior > encountered during the test". > > To be clear, I'm saying that all response time data, including average > 90%'ile or any other, that JMeter collects in the presence of coordinated > omissions is completely wrong. They are wrong because the data they are all > based on is wrong. It's easily demonstrable to be off by several orders of > magnitude in real world situations, and in real world web applications. > >> >> That said, when users meet this type of system they will most likely >> abandon.. which is in it's self a failure. JMeter doesn't have direct >> facilities to support this type of behaviour. > > Failure, abandon, and backoff conditions are interesting commentary that does > not replace the need to include them in percentiles and averages. When > someone says "my web application exhibits a 99%'ile response time of 700msec" > the recipient of this information doesn't hear "99% of the the good results > will be 700msec or less". They hear "99% of ALL requests attempted with this > system will respond within 700 msec of less.". That includes all requests > that may have resulted in users walking away in anger, or seeing long > response times while the system was stalled for some reason. > >>> >>> Coordinated Omission is a basic problem that can happen due to many, many >>> different reasons and causes. It is made up of two simple things: One is >>> the Omission of some results or samples from the final data set. Second is >>> the Coordination of such omissions with other behavior, such that it is not >>> random. Random omission is usually not a problem. That's just sampling, and >>> random sampling works. Coordinated Omission is a problem because it is >>> effectively [highly] biased sampling. When Coordinated Omission occurs, the >>> resulting data set is biased towards certain behaviors (like good response >>> times), leading ALL statistics on the resulting data set to be highly >>> suspect (read: "usually completely wrong and off by orders of magnitude") >>> in describing response time or latency behavior of the observed system. >>> >>> In JMeter, Coordinated Omission occurs whenever a thread doesn't execute >>> it's test plan as planned, and does so in reaction to behavior it >>> encounters. This is most often caused by the simple and inherent >>> synchronous nature of test plans as they are stated in JMeter: when a >>> specific request takes longer to respond that it would have taken the >>> thread to send the next request in the plan, the very fact that the thread >>> did not send the next request out on time as planned is a coordinated >>> omission: It is the effective removal of a response time result that would >>> have been in the data set had the coordination not happened. It is >>> "omission" since a measurement that should have occurred didn't happen and >>> was not recorded. It is "coordinated" because the omission is not random, >>> and is correlated-with/influenced-by the occurrence of another longer than >>> normal response time occurrence. >>> >>> The work done with the OutlierCorrector in JMeter focused on detecting CO >>> in streams of measured results reported to listeners, and inserting "fake" >>> results into the stream to represent the missing, omitted results that >>> should have been there. OutlierCorrector also has a log file corrector that >>> can fix JMeter logs offline, and after the fact by applying the same logic. >> >> Right, but this is for a fixed transactional rate which is typically seen in >> machine to machine HFTS. In Web apps, perhaps the most common use case for >> JMeter, client back-off due to back pressure is a common behaviour and it's >> one that doesn't harm the testing process in the sense that if the server >> can't retire transactions fast enough.. JMeter will expose it. if you want >> to prove 5 9's, then I agree, you've got a problem. > > Actually the corrector adjusts to the current transactional rare with a > configurable moving window average. > > Fixed transactional rates are no more common in HFTS that they are in web > applications, and client backoff is just as common there. But this has > nothing to do with HFTS. In all systems with synchronous clients, whether > they take 20usec or 2 seconds for a typical response, the characterization > and description of response time behavior should have nothing to do with > backing off. And in both types of systems, once backoff happens, coordinated > omission has kicked in and your data is contaminated. > > As a trivial hypothetical, imagine a web system that regularly stalls for 3 > consecutive seconds out of every 40 seconds of operation under a regular load > of 200 requests per second, but responds promptly in 20 msec or less the rest > of the time. This scenario can easily found in the real world for GC pauses > under high load for untuned systems. The 95%'ile response time of such a > system clearly can't be described as lower than 2 seconds without outright > lying. But here is what JMeter will report for such a systems: > > - For 2 threads running 100 requests per second each: 95%'ile will show > 20msec. > - For 20 threads running 10 requests per second each: 95%'ile will show > 20msec. > - For 200 threads running 1 requests per second each: 95%'ile will show 20 > msec. > - For 400 threads running 1 request every 2 seconds: 95%'ile will show 20 > msec. > - For 2000 threads running 1 request every 10 seconds: 95%'ile will show ~1 > seconds. > > Clearly there is a configuration for JMeter that would expose this behavior, > it's the one where the gap between requests any one client would send is > higher than the largest pause the system ever exhibits. It is also the only > one in the list that does not exhibit coordinated omission. But what if the > real world for this system simply involved 200 interactive clients, that > actually hit it with a request once every 1 second or so (think simple web > based games)? JMeter's test plans for that actual real world scenario would > show a 9%'ile response time results that is 100x off from reality. > > And yes, in this world the real users playing this game would probably "back > off". Or they may do the reverse (click repeatedly in frustration, getting a > response after 3 seconds). Neither behavior would improve the 95%ile behavior > or excuse a 100x off report. > >> >> It's not that I disagree with you or I don't understand what you're saying, >> it's just that I'm having difficulty mapping it back to the world that >> people on this list have to deal with. w.r.t, I've a feeling that our views >> are some what tainted by the worlds we live in. In the HTTP world, CO exists >> and I accept it as natural behaviour. In your world CO exists but it cannot >> be accepted. > > Just because I also dabble in High frequency trading systems and 10 usec > responses doesn't mean I forgot about large and small scale web applications > with real people at the end, and with human response times. > > My world covers many more HTTP people than it does HFT people. Many people's > first reaction to realizing how badly Coordinated Omission affects the > accuracy of reported response times is "this applies to someone else's > domain, but mine things are still ok because of X". Unfortunately, the > problem is almost universal in synchronous testers and in synchronous > internal monitoring systems, and 95%+ ( ;-) ) of Web testing environments I > have encountered have dramatically under-reported 95%, 99%, and all other > %'iles. Unless those systems actually don't care about 5% failure rates, > their business decisions are currently being based on bad data. > >> The problem is that mechanical sympathy is mostly about your world. I think >> there is a commonality between the two worlds but I think to find it we need >> more discussion. I'm not sure that this list is good for this purpose so I'm >> going to flip back to mechanical sympathy instead of hijacking this mailing >> list. > >> >> -- Kirk >> >>> >>> -- Gil. >>> >>> >>> On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <kirk.pepperd...@gmail.com> >>> wrote: >>> >>>> Hi Gil, >>>> >>>> I would have to disagree as in this case I believe there is CO due to the >>>> threading model, CO on a per-thread basis as well as plain old omission. I >>>> believe these conditions are in addition to the conditions you're pointing >>>> to. >>>> >>>> You may test at a fixed rate for HFT but in most worlds, random is >>>> necessary. Unfortunately that makes the problem more difficult to deal >>>> with. >>>> >>>> Regards, >>>> Kirk >>>> >>>> On 2013-10-18, at 5:32 PM, Gil Tene <g...@azulsystems.com> wrote: >>>> >>>>> I don't think the thread model is the core of the Coordinated Omission >>>>> problem. Unless we consider the only solution to be sending no more than >>>>> one request per 20 minutes from any given thread a threading model fix. >>>>> It's more of a configuration choice the way I see it, but a pretty >>>>> impossible one. The thread model may need work for other reasons, but CO >>>>> is not one of them. >>>>> >>>>> In JMeter, as with all other synchronous testers, Coordinated Omission is >>>>> a per-thread issue. It's easy to demonstrate CO with JMeter with a single >>>>> client thread testing an application that has only a single client >>>>> connection in the real world, or with 15 client threads testing an >>>>> application that has exactly 15 real-world clients communicating at high >>>>> rates (common with muxed environments, messaging, ESBs, trading systems, >>>>> etc.). No amount of threading or concurrency will help get a better test >>>>> results capturing for these very real system. Any occurrence of CO will >>>>> make the JMeter results seriously bogus. >>>>> >>>>> When any one thread misses a planned request sending time, CO has already >>>>> occurred, and there is no way to avoid it at that point. You certainly >>>>> detect that CO has happened. The question is what to do about it in >>>>> JMeter once you detect it. The major options are: >>>>> >>>>> 1. Ignore it and keep working with the data as if it actually meant >>>>> anything. This amount to http://tinyurl.com/o46doqf . >>>>> >>>>> 2. You can try to change the tester behavior to avoid CO going forward. >>>>> E.g. you can try to adjust the number of threads up AND at the same time >>>>> the frequency of requests that each thread sends requests at, which will >>>>> amount to drastically changing the test plan in reaction to system >>>>> behavior. In my opinion, changing behavior dynamically will have very >>>>> limited effectiveness for two reasons: The first is that the problem had >>>>> already occurred, so all the data up to and including the observed CO is >>>>> already bogus and has to be thrown away unless it can be corrected >>>>> somehow. Only after you auto-adjust enough times to not see CO for a long >>>>> time, your results during that time may be valid. The second is that >>>>> changing the test scenario is valid (and possible) for very few real >>>>> world systems. >>>>> >>>>> 3. You can try to correct for CO when you observe it. There are various >>>>> ways this can be done, and most of them will amount to re-creating >>>>> missing test sample results by projecting from past results. This can >>>>> help correct the results data set so that it would better approximate >>>>> what a tester that was not synchronous, and would have kept issuing >>>>> requests per the actual test plan, would have experienced in the test. >>>>> >>>>> 4. Something else we hadn't yet thought about. >>>>> >>>>> Some correction and detection example work can be found at: >>>>> https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9 >>>>> , which uses code at >>>>> https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel >>>>> worked at Azul Systems over the summer on this problem, and the >>>>> OutlierCorrector package and the small patch to JMeter (under the >>>>> docs-2.9 branch) are some of the results of that work. This fix approach >>>>> appears to work well as long as no explicitly random behavior is stated >>>>> in the test scenarios (the outlier detector detects a test pattern and >>>>> repeats it in repairing the data. Expressly random scenarios will not >>>>> exhibit a detectable pattern.). >>>>> >>>>> -- Gil. >>>>> >>>>> On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <kirk.pepperd...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Sebb, >>>>>> >>>>>> In my testing, the option off creating threads on demand instead of all >>>>>> at once has made a huge difference in my being able to control rate of >>>>>> arrivals on the server. It has convinced me that simply using the >>>>>> throughput controller isn't enough and that the threading model in >>>>>> JMeter *must* change. It is the threading model that is the biggest >>>>>> source of CO in JMeter. Unfortunately we weren't able to come to some >>>>>> way of a non-disruptive change in JMeter to make this happen. >>>>>> >>>>>> The model I was proposing would have JMeter generate an event heap >>>>>> sorted by the time when a sampler should be fired. A thread pool should >>>>>> be used to eat off of the heap and fire the events as per scheduled. >>>>>> This would allow JMeter to break the inappropriate relationship of a >>>>>> thread being a user. The solution is not perfect in that you will still >>>>>> have to fight with thread schedulers and hypervisors to get things to >>>>>> happen on queue. However, I believe the end result will be a far more >>>>>> scalable product that will require far fewer threads to produce far >>>>>> higher loads on the server. >>>>>> >>>>>> As for your idea on the using the throughput controller. IHMO triggering >>>>>> an assert only worsens the CO problem. In fact, if the response times >>>>>> from the timeouts are not added into the results, in other words they >>>>>> are omitted from the data set, you've only made the problem worse as you >>>>>> are filter out bad data points from the result sets making the results >>>>>> better than they should be. Peter Lawyer's (included here for the >>>>>> purpose of this discussion) technique for correcting CO is to simply >>>>>> recognize when the event should have been triggered and then start the >>>>>> timer for that event at that time. So the latency reported will include >>>>>> the time before event triggering. >>>>>> >>>>>> Gil Tene's done some work with JMeter. I'll leave it up to him to post >>>>>> what he's done. The interesting bit that he's created is HrdHistogram >>>>>> (https://github.com/giltene/HdrHistogram). It is not only a better way >>>>>> to report results,it offers techniques to calculate and correct for CO. >>>>>> Also Gil might be able to point you to a more recent version of his on >>>>>> CO talk. It might be nice to have a new sampler that incorporates this >>>>>> work. >>>>>> >>>>>> On a side note, I've got a Servlet filter that is JMX component that >>>>>> measures a bunch of stats from the servers POV. It's something that >>>>>> could be contributed as it could be used to help understand the source >>>>>> of CO.. if not just complement JMeter's view of latency. >>>>>> >>>>>> Regards, >>>>>> Kirk >>>>>> >>>>>> >>>>>> On 2013-10-18, at 12:27 AM, sebb <seb...@gmail.com> wrote: >>>>>> >>>>>>> It looks to be quite difficult to avoid the issue of Coordination >>>>>>> Omission without a major redesign of JMeter. >>>>>>> >>>>>>> However, it may be a lot easier to detect when the condition has >>>>>>> occurred. >>>>>>> This would potentially allow the test settings to be changed to reduce >>>>>>> or eliminate the occurrences - e.g. by increasing the number of >>>>>>> threads or spreading the load across more JMeter instances. >>>>>>> >>>>>>> The Constant Throughput Controller calculates the desired wait time, >>>>>>> and if this is less than zero - i.e. a sample should already have been >>>>>>> generated - it could trigger the creation of a failed Assertion >>>>>>> showing the time difference. >>>>>>> >>>>>>> Would this be sufficient to detect all CO occurrences? >>>>>>> If not, what other metric needs to be checked? >>>>>>> >>>>>>> Even if it is not the only possible cause, would it be useful as a >>>>>>> starting point? >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@jmeter.apache.org >>>>>>> For additional commands, e-mail: user-h...@jmeter.apache.org >>>>>>> >>>>>> >>>>> >>>> >>> >> >