To focus on the "how to deal with Coordinated Omission" part: There are two main ways to deal with CO in your actual executed behavior:
1. Change the behavior to avoid CO to begin with. 2. Detect it and correct it. There is a "detect it and report it" one too, but I dot think it is of any real use, as detection without correction will just tell you your data can't be believed at all, but won't tell you anything about what can be. Since CO can move percentile magnitudes and position by literal multiple orders if magnitude (I have multiple measured real world production behaviors that show this) , "hoping it us not too bad" when you know it is there amounts to burying your head in the sand. Avoiding CO [option 1] is obviously preferable where possible. E.g. In load generators this can be achieved if everything the load generator does is made asynchronous, or by making sure that any synchronous part will never attempt to send messages closer together in time than the largest possible stall the system under test may ever experience (with some extra padding, this means "no closer than 10 minutes apart"). But avoiding CO in your actual measured results is unfortunately impractical for many systems. E.g. In systems where actual individual clients interact with the system using in-order transports (like TCP) with actual inter-request time gaps that are shorter than stalls that occur in the system CO will absolutely incur, both in the real world and in any tester that emulates it. Correcting CO [option 2] is what you have to do if CO exists in the data measured by actual-executed-stuff. Correction inevitably amounts to "filling in the gaps" by projecting (without certainty or actual knowledge) a modeled behavior onto those gaps and adding data points to the data set that did nit actually get measured, but "would have" had COZ not stopped the measurements from being taken at the right points. There are various ways to correct CO in such data sets, and how well they do depends on how much we know about the behavior of the system around the gaps and how much we know about the the themselves (e.g. Knowing an actual complete stall occurred us very useful). I think JMeter falls squarely into the synchronous tester camp, and that's not going to change. Given that many (most?) systems it measures use TCP as a transport and naturally exhibit systems stalls that are longer than inter-request times in actual use behaviors, I see eliminating CO from JMeter's actual measured results as hopeless. Coordinate Omission in JMeter is just part if life, and we have to deal with it. I therefore focus on the "how to correct" part if the equation. Having played with correction techniques, I can say that random operation sequences (not random timing) is the hardest thing to deal with. Not necessarily impossible, but really hard. Random timing, on the other hand is easily dealt with for correction purposes, as projecting known, non-random sequences of operations into the CO gaps can be done just as well based in averaged timing data. So Kirk, is the random behavior you need one if random timing, or random operation sequencing (or both)? Sent from my iPad On Oct 18, 2013, at 10:48 PM, "Kirk Pepperdine" <[email protected]<mailto:[email protected]>> wrote: On 2013-10-19, at 1:33 AM, Gil Tene <[email protected]<mailto:[email protected]>> wrote: I guess we look at human response back pressure in different ways. It's a question of whether or not you consider the humans to be part of the system you are testing, and what you think your stats are supposed to represent. You've seen my presentations and so you know that I do believe that human and non-human actors are definitively part of the system. They provide the dynamics for the system being tested. A change in how that layer in my model works can and does makes a huge difference in how the other layers work to support the overall system. Some people will take the "forgiving" approach, which considers the client behavior client as part of the overall system behavior. In such an approach, if a human responded to slow behavior by not asking any more questions for a while, that's simply what the overall system did, and the stats reported should reflect only the actual attempts that actual humans would have, including their slowing down their requests in response to slow reaction times. Sort of. I want to know that a user was inhibited from making forward progress because the previous step in their workflow blew stated tolerances. In some cases I'd like to have that user abandon. I'm not sure I'd call this forgiving though I am looking to see what the overall system can do to answer the question; is it good enough and if not, why not. I'm not going to suggest your view is incorrect. I think it's quite valid. I don't believe the two views are orthogonal and that there are elements of both in each. The question here on more practical terms is; what needs to be done to reduce the level of CO that currently occurs in JMeter and how should we react to it. Throwing out entire datasets from runs seems like an academic answer to a more practical question; will our application stand up when under load. From my point of view, for JMeter to better answer that question. A web site being completely down for 5 minutes an hour would generate a lot of human back pressure response. It may even slow down request rates so much during the outage that 99%+ of the overall actual requests by end users during an hour that included such a 5 minute outage would still be very good. Reporting on those (actual requests by humans) would be very different from reporting on what would have happened without human back pressure. But it's easy to examine which of the two reporting methods would be accepted by a reader of such reports. But then that 5 minute outage is going to show up some where and if you bury it in how you report.... that would seem to be a problem. This whole argument suggests that what you want is a better regime for the treatment of the data. If that is what you're saying, we're in complete agreement. The 5 minute pause should not be filtered out of the data! IMHO, the first thing to do is eliminate or reduce the known sources of CO from JMeter. I'm not sure that tackling the CTT is the beat way to go. In fact I'd prefer a combination of approaches that includes things like how jHiccup works with a GC STW detector. As you've mentioned before, even with a fix to the threading model in JMeter, CO will still occur. Regards, Kirk
