Re: [DISCUSS] How to do Sliding Windows in Profiler

2017-01-27 Thread Casey Stella
So, I did some digging into the implementation of Median Absolute Deviation
that we ship.  I think we're doing the right thing here for sliding
windows.  The key assumption that we were making as part of this discussion
is that were storing 2 distributions in the state:

   - The distribution of the values
   - The distribution of the deviation from the median

In fact, we are actually storing 4 distributions:

   - The distribution of the values for this tick
  - This is used ONLY when merging states
   - The distribution of the values for the whole sliding window
  - This is used ONLY for scoring
   - The distribution of the deviation from the median of the sliding
   window for this tick
  - This is used ONLY when merging states
  - The median used here is the median of the merged tick-granular
  distribution of values across the whole window
   - The distribution of the deviation from the median of the sliding
   window for the whole sliding window
  - This is used ONLY for scoring
  - The median used here is the median of the merged tick-granular
  distribution of values across the whole window


The tick-granular distributions are used for merging, so we do not get
overlapping distributions of values or deviations from medians.  The
window-granular distributions are used for the actual scoring, so you get
the full context of the window.

I would urge you guys to check my work and let me know if I interpreted it
wrong by looking at the State class here

.

All that being said, I still think that providing the ability to structure
multiple profiles at once as Matt suggested would be extremely nice and I'm
in favor of it, on the whole.

Casey

On Wed, Jan 25, 2017 at 2:06 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Hey guys,
>
> 1. I'm going to Google that brandy analogy later. It sounds tasty :)
> 2. The option for reading/writing multiple profiles makes sense to me and
> provides much greater flexibility and efficiency.
> 3. My gut reaction to reading through this is "wow." This seems really
> complicated. I'm all for the flexibility provided by these composable
> functions and think this would be much better for users with either some
> templates or other form of abstraction to simplify common patterns.
>
> Mike
>
>
> On Wed, Jan 25, 2017 at 8:02 AM, Casey Stella  wrote:
>
> > So, I think the key takeaway, at least for me, and my key feedback is
> that:
> >
> >- We should separate the wrong behavior of MAD with the ability to
> >improve windowing.  We can fix one without the other and, in my
> opinion,
> >the key fix for MAD is keeping the value distribution independent of
> the
> >MAD state, as Matt noted at the end of this email.
> >- Because the above necessitates tracking two things, it'd be nice to
> >adopt Matt's suggestion to be able to sort of merge profiles track two
> >results.
> >
> > One more thing, in general, if you ever want to merge outputs on read,
> then
> > you should never write out the merged data (unless the data you're
> writing
> > has a way to untangle the overlap).  They are two separate semantics.
> >
> > On merge on read, you choose the window at read time.  An example of this
> > for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function
> which
> > takes the value distribution would be:
> > {
> >   "profiles": [
> > {
> >   "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
> >   "foreach": "'global'",
> >   "init" : {
> > "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')))",
> > "val_dist": "STATS_INIT()"
> > },
> >   "update": {
> > "val_dist": "STATS_ADD(val_dist, value)",
> > "s": "OUTLIER_MAD_ADD(s, value)"
> > },
> >   "result": {
> > "sketchy_values": "val_dist",
> > "sketchy_mad": "s",
> > }
> > }
> >   ]
> > }
> >
> > You would then read this in the enrichment topology or wherever
> > via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > 'global', 5, 'MINUTES')), value)
> >
> >
> > On merge on write, you choose the window size on write time and should
> read
> > only the latest profile entry on read, so that would look like:
> > {
> >   "profiles": [
> > {
> >   "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> >   "foreach": "'global'",
> >   "init" : {
> > "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')))",
> > "val_dist": "STATS_INIT()"
> > },
> >   "update": {
> > "val_dist": "STATS_ADD(val_dist, value)",
> > "s": "OUTLIER_MAD_ADD(s, value)"
> > },
> >   "result": {
> >  

Re: [DISCUSS] How to do Sliding Windows in Profiler

2017-01-25 Thread Michael Miklavcic
Hey guys,

1. I'm going to Google that brandy analogy later. It sounds tasty :)
2. The option for reading/writing multiple profiles makes sense to me and
provides much greater flexibility and efficiency.
3. My gut reaction to reading through this is "wow." This seems really
complicated. I'm all for the flexibility provided by these composable
functions and think this would be much better for users with either some
templates or other form of abstraction to simplify common patterns.

Mike


On Wed, Jan 25, 2017 at 8:02 AM, Casey Stella  wrote:

> So, I think the key takeaway, at least for me, and my key feedback is that:
>
>- We should separate the wrong behavior of MAD with the ability to
>improve windowing.  We can fix one without the other and, in my opinion,
>the key fix for MAD is keeping the value distribution independent of the
>MAD state, as Matt noted at the end of this email.
>- Because the above necessitates tracking two things, it'd be nice to
>adopt Matt's suggestion to be able to sort of merge profiles track two
>results.
>
> One more thing, in general, if you ever want to merge outputs on read, then
> you should never write out the merged data (unless the data you're writing
> has a way to untangle the overlap).  They are two separate semantics.
>
> On merge on read, you choose the window at read time.  An example of this
> for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function which
> takes the value distribution would be:
> {
>   "profiles": [
> {
>   "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
>   "foreach": "'global'",
>   "init" : {
> "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')))",
> "val_dist": "STATS_INIT()"
> },
>   "update": {
> "val_dist": "STATS_ADD(val_dist, value)",
> "s": "OUTLIER_MAD_ADD(s, value)"
> },
>   "result": {
> "sketchy_values": "val_dist",
> "sketchy_mad": "s",
> }
> }
>   ]
> }
>
> You would then read this in the enrichment topology or wherever
> via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> 'global', 5, 'MINUTES')), value)
>
>
> On merge on write, you choose the window size on write time and should read
> only the latest profile entry on read, so that would look like:
> {
>   "profiles": [
> {
>   "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
>   "foreach": "'global'",
>   "init" : {
> "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')))",
> "val_dist": "STATS_INIT()"
> },
>   "update": {
> "val_dist": "STATS_ADD(val_dist, value)",
> "s": "OUTLIER_MAD_ADD(s, value)"
> },
>   "result": {
> "sketchy_values": "val_dist",
> "sketchy_mad": "s",
> "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
> }
> }
>   ]
> }
>
>
> You would then score pulling only the most recent MAD state.  Merging
> across them would double count entries, as Matt mentioned above
> OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
> 'MINUTES')), value)
>
> So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
> multiple outputs.  Also, different but coupled problem, MAD needs fixing to
> decouple tracking value distributions from the distribution of the
> deviation from the median.
>
> Casey
>
> On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen  wrote:
>
> > That was a lot to digest so I apologize if I have missed some of your
> > thought process.  I probably need to read through this a couple more
> times.
> >
> >
> > Can anyone specify a better way to do Sliding Window profiles correctly
> > > with current functionality?
> >
> >
> > Could we not change the underlying implementation of the profiler to
> > support sliding windows?  Ideally the same profile definition could be
> > applied to either a tumbling or sliding window, rather than requiring a
> > change in the profile definition itself.
> >
> > The work that I had done for METRON-590, made it possible to configure
> > either a tumbling or sliding window.  The downside of that work is that
> it
> > was an all-or-nothing change, all profiles (in the same topology) were
> > either tumbling or sliding.  But perhaps we can enhance it from there.
> >
> >
> > A much more elegant solution would be to allow “result” to write two
> > > objects.
> >
> >
> > Yes, I like this! I have had the same thought myself.  This would be a
> > simple change and would provide the user with some flexibility.  Does
> this
> > change solve the entire problem though?
> >
> >
> > Obviously, I do need to sit down and think on this a bit more.  I really
> > wish we could handle these use cases with a profile definition that
> remains
> > simple and easy to grok.  Thanks 

[DISCUSS] How to do Sliding Windows in Profiler

2017-01-24 Thread Matt Foley
Hi all,

Casey and I had an interesting chat yesterday, during which we agreed that the 
example code for Outlier Analysis in 
https://github.com/apache/incubator-metron/blob/master/metron-analytics/metron-statistics/README.md
 and the revised example code in 
https://issues.apache.org/jira/browse/METRON-668 (as of 23 January) both do not 
correctly implement the desired Sliding Window model.  This email gives the 
argument for why, and proposes a couple ways to do it right.  Your input and 
preferences are requested.

 

First a couple statements about the STATS object that underlies most 
interesting Profile work:

· The STATS object is a t-digest.  It summarizes a set of data points, 
such as those received during a sampling period, in a way that is nominally 
O(1) regardless of the input number of data points, and preserves the info 
about the “shape” of the statistical distribution of those points.  Not only 
info about averages and standard deviations, but also about medians and 
percentiles (which, btw, is a very different kind of information), is preserved 
and can be calculated correctly to within a given error epsilon.  Since it is a 
summary, however, time information is lost.

· STATS objects, these digests of sampling periods, are MERGEABLE, 
meaning if you have a digest from time(1) to time(2), and another digest from 
time(2) to time (3), you can merge them and get a digest that is statistically 
equivalent to a digest from time(1) to time(3) continuously.

· They are sort of idempotent, in that if you take two identical 
digests and merge them, you get almost the same object.  However, the result 
object will be scaled as summarizing twice the number of input data points.

· Which is why it DOESN’T work to try to merge overlapping sampling 
periods.  To give a crude example, if you have a digest from time(1) to time(3) 
and another digest from time(2) to time(4), and merge them, the samples from 
time(2) to time(3) will be over-represented by a factor of 2x, which should be 
expected to skew the distribution (unless the distribution really is constant 
for all sub-windows – which would mean we don’t need Sliding Windows because 
nothing changes).

 

The Outlier Analysis profiles linked above try to implement a sliding window, 
in which each profile period summarizes the Median Absolute Deviation 
distribution of the last five profile periods only.  An “Outlier MAD Score” can 
then be determined by comparing the deviation of a new data point to the last 
MAD distribution recorded in the Profile.  This allows for changes over time in 
the statistical distribution of inputs, but does not make the measure unduly 
sensitive to just the last minute or two.  This is a typical use case for 
Sliding Windows.

 

Both example codes trip on how to do the sliding window in the context of a 
Profile.  At sampling period boundaries, both either compose the “result” 
digest or initialize the “next” digest by reading the previous 5 result 
digests.  That is wrong, because it ignores the fact that those digests aren’t 
just for their time periods.  They too were composed with THEIR preceding 5 
digests, each of which were composed with their preceding 5 digests, which in 
turn… etc.  The end result is sort of like the way Madieras or some brandies 
are aged via the Solera process with fractional blending.  You don’t get a true 
sliding window, which sharply drops the past, you get a continuous dilution of 
the past. In fact, it’s wrong to assume that the “tail” of the far past is more 
diluted than the near past! – It would be with some algorithms, but the alg 
used in these two examples causes the far past to become an exponentially MORE 
important fraction of the overall data than the near past – much worse than 
simply turning on digesting at time(0) and leaving it on, with no attempt at 
windowing.  (Simulate it in a spreadsheet and you’ll see.)

 

We need a Profiler structure that assists in creating Sliding Window profiles.  
The problem is that Profiles let you save only one thing (quantity or object) 
per sampling period, and that’s typically a different “thing” (object type or 
scale) than you want to use to compose the result for each windowed span.  One 
way to do it correctly would be with two Profiles, like this:

 

(SOLUTION A)

{

  "profiles": [

{

  "profile": "sketchy_mad",

  "foreach": "'global'",

  "init" : {

"s": "OUTLIER_MAD_INIT()"

},

  "update": {

"s": "OUTLIER_MAD_ADD(s, value)"

},

  "result": "s"

},

{

  "profile": "windowed_mad",

  "foreach": "'global'",

  "init" : { },

  "update": { },

  "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',

'global', 5, 'MINUTES'))"

}

  ]

}

 

This is typical.  You have a fine-grain sampling period that you want to 
“tumble”, and a broader window that you want to “slide” or “roll” along the