Re: Coordinated Omission (CO) - possible strategies

Gil Tene Sat, 19 Oct 2013 00:57:38 -0700

To focus on the "how to deal with Coordinated Omission" part:

There are two main ways to deal with CO in your actual executed behavior:


1. Change the behavior to avoid CO to begin with.

2. Detect it and correct it.

There is a "detect it and report it" one too, but I dot think it is of any real 
use, as detection without correction will just tell you your data can't be 
believed at all, but won't tell you anything about what can be. Since CO can 
move percentile magnitudes and position by literal multiple orders if magnitude 
(I have multiple measured real world production behaviors that show this) , 
"hoping it us not too bad" when you know it is there amounts to burying your 
head in the sand.

Avoiding CO [option 1] is obviously preferable where possible. E.g. In load 
generators this can be achieved if everything the load generator does is made 
asynchronous, or by making sure that any synchronous part will never attempt to 
send messages closer together in time than the largest possible stall the 
system under test may ever experience (with some extra padding, this means "no 
closer than 10 minutes apart").

But avoiding CO in your actual measured results is unfortunately impractical 
for many systems. E.g. In systems where actual individual clients interact with 
the system using in-order transports (like TCP) with actual inter-request time 
gaps that are shorter than stalls that occur in the system CO will absolutely 
incur, both in the real world and in any tester that emulates it.

Correcting CO [option 2] is what you have to do if CO exists in the data 
measured by actual-executed-stuff. Correction inevitably amounts to "filling in 
the gaps" by projecting (without certainty or actual knowledge) a modeled 
behavior onto those gaps and adding data points to the data set that did nit 
actually get measured, but "would have" had COZ not stopped the measurements 
from being taken at the right points. There are various ways to correct CO in 
such data sets, and how well they do depends on how much we know about the 
behavior of the system around the gaps and how much we know about the the 
themselves (e.g. Knowing an actual complete stall occurred us very useful).

I think JMeter falls squarely into the synchronous tester camp, and that's not 
going to change. Given that many (most?) systems it measures use TCP as a 
transport and naturally exhibit systems stalls that are longer than 
inter-request times in actual use behaviors, I see eliminating CO from JMeter's 
actual measured results as hopeless. Coordinate Omission in JMeter is just part 
if life, and we have to deal with it. I therefore focus on the "how to correct" 
part if the equation.

Having played with correction techniques, I can say that random operation 
sequences (not random timing) is the hardest thing to deal with. Not 
necessarily impossible, but really hard. Random timing, on the other hand is 
easily dealt with for correction purposes, as projecting known, non-random 
sequences of operations into the CO gaps can be done just as well based in 
averaged timing data.

So Kirk, is the random behavior you need one if random timing, or random 
operation sequencing (or both)?

Sent from my iPad

On Oct 18, 2013, at 10:48 PM, "Kirk Pepperdine" 
<[email protected]<mailto:[email protected]>> wrote:


On 2013-10-19, at 1:33 AM, Gil Tene 
<[email protected]<mailto:[email protected]>> wrote:

I guess we look at human response back pressure in different ways. It's a 
question of whether or not you consider the humans to be part of the system you 
are testing, and what you think your stats are supposed to represent.

You've seen my presentations and so you know that I do believe that human and 
non-human actors are definitively part of the system. They provide the dynamics 
for the system being tested. A change in how that layer in my model works can 
and does makes a huge difference in how the other layers work to support the 
overall system.

Some people will take the "forgiving" approach, which considers the client 
behavior client as part of the overall system behavior. In such an approach, if 
a human responded to slow behavior by not asking any more questions for a 
while, that's simply what the overall system did, and the stats reported should 
reflect only the actual attempts that actual humans would have, including their 
slowing down their requests in response to slow reaction times.

Sort of. I want to know that a user was inhibited from making forward progress 
because the previous step in their workflow blew stated tolerances. In some 
cases I'd like to have that user abandon. I'm not sure I'd call this forgiving 
though I am looking to see what the overall system can do to answer the 
question; is it good enough and if not, why not.

I'm not going to suggest your view is incorrect. I think it's quite valid. I 
don't believe the two views are orthogonal and that there are elements of both 
in each. The question here on more practical terms is; what needs to be done to 
reduce the level of CO that currently occurs in JMeter and how should we react 
to it. Throwing out entire datasets from runs seems like an academic answer to 
a more practical question; will our application stand up when under load. From 
my point of view, for JMeter to better answer that question.


A web site being completely down for 5 minutes an hour would generate a lot of 
human back pressure response. It may even slow down request rates so much 
during the outage that 99%+ of the overall actual requests by end users during 
an hour that included such a 5 minute outage would still be very good. 
Reporting on those (actual requests by humans) would be very different from 
reporting on what would have happened without human back pressure. But it's 
easy to examine which of the two reporting methods would be accepted by a 
reader of such reports.

But then that 5 minute outage is going to show up some where and if you bury it 
in how you report.... that would seem to be a problem. This whole argument 
suggests that what you want is a better regime for the treatment of the data. 
If that is what you're saying, we're in complete agreement. The 5 minute pause 
should not be filtered out of the data!

IMHO, the first thing to do is eliminate or reduce the known sources of CO from 
JMeter. I'm not sure that tackling the CTT is the beat way to go. In fact I'd 
prefer a combination of approaches that includes things like how jHiccup works 
with a GC STW detector. As you've mentioned before, even with a fix to the 
threading model in JMeter, CO will still occur.

Regards,
Kirk

Re: Coordinated Omission (CO) - possible strategies

Reply via email to