Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-07 Thread Joe Gordon
Everything sounds good!


On Mon, Jan 6, 2014 at 6:52 PM, Sean Dague s...@dague.net wrote:

 On 01/06/2014 07:04 PM, Joe Gordon wrote:

 Overall this looks really good, and very spot on.


 On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague s...@dague.net
 mailto:s...@dague.net wrote:

 A lot of elastic recheck this fall has been based on the ad hoc
 needs of the moment, in between diving down into the race bugs that
 were uncovered by it. This week away from it all helped provide a
 little perspective on what I think we need to do to call it *done*
 (i.e. something akin to a 1.0 even though we are CDing it).

 Here is my current thinking on the next major things that should
 happen. Opinions welcomed.

 (These are roughly in implementation order based on urgency)

 = Split of web UI =

 The elastic recheck page is becoming a mismash of what was needed at
 the time. I think what we really have emerging is:
   * Overall Gate Health
   * Known (to ER) Bugs
   * Unknown (to ER) Bugs - more below

 I think the landing page should be Know Bugs, as that's where we
 want both bug hunters to go to prioritize things, as well as where
 people looking for known bugs should start.

 I think the overall Gate Health graphs should move to the zuul
 status page. Possibly as part of the collection of graphs at the
 bottom.

 We should have a secondary page (maybe column?) of the
 un-fingerprinted recheck bugs, largely to use as candidates for
 fingerprinting. This will let us eventually take over /recheck.


 I think it would be cool to collect the list of unclassified failures
 (not by recheck bug), so we can see how many (and what percentage) need
 to be classified. This isn't gate health but more of e-r health or
 something like that.


 Agreed. I've got the percentage in check_success today, but I agree that
 every gate job that fails that we don't have a fingerprint should be listed
 somewhere we can work through them.


 = Data Analysis / Graphs =

 I spent a bunch of time playing with pandas over break
 (http://dague.net/2013/12/30/__ipython-notebook-experiments/
 http://dague.net/2013/12/30/ipython-notebook-experiments/)__, it's

 kind of awesome. It also made me rethink our approach to handling
 the data.

 I think the rolling average approach we were taking is more precise
 than accurate. As these are statistical events they really need
 error bars. Because when we have a quiet night, and 1 job fails at
 6am in the morning, the 100% failure rate it reflects in grenade
 needs to be quantified that it was 1 of 1, not 50 of 50.


 So my feeling is we should move away from the point graphs we have,
 and present these as weekly and daily failure rates (with graphs and
 error bars). And slice those per job. My suggestion is that we do
 the actual visualization with matplotlib because it's super easy to
 output that from pandas data sets.


 The one thing that the current graph does, that weekly and daily failure
 rates don't show, is a sudden spike in one of the lines.  If you stare
 at the current graphs for long enough and can read through the noise,
 you can see when the gate collectively crashes or if just the neutron
 related gates start failing. So I think one more graph is needed.


 The point of the visualizations is to make sense to people that don't
 understand all the data, especially core members of various teams that are
 trying to figure out if I attack 1 bug right now, what's the biggest bang
 for my buck.


Yes, that is one of the big uses for a visualization.  the one I had in
mind was being able to see if a new unclassified bug appeared.



  Basically we'll be mining Elastic Search - Pandas TimeSeries -
 transforms and analysis - output tables and graphs. This is
 different enough from our current jquery graphing that I want to get
 ACKs before doing a bunch of work here and finding out people don't
 like it in reviews.

 Also in this process upgrade the metadata that we provide for each
 of those bugs so it's a little more clear what you are looking at.


 For example?


 We should always be listing the bug title, not just the number. We should
 also list what projects it's filed against. I've stared at this bugs as
 much as anyone, and I still need to click through the top 4 to figure out
 which one is the ssh bug. :)


  = Take over of /recheck =

 There is still a bunch of useful data coming in on recheck bug
  data which hasn't been curated into ER queries. I think the
 right thing to do is treat these as a work queue of bugs we should
 be building patterns out of (or completely invalidating). I've got a
 preliminary gerrit bulk query piece of code that does this, which
 would remove the need of the daemon the way that's currently
 happening. The gerrit queries are a little long right now, 

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-07 Thread Matt Riedemann



On 1/2/2014 8:29 PM, Sean Dague wrote:

A lot of elastic recheck this fall has been based on the ad hoc needs of
the moment, in between diving down into the race bugs that were
uncovered by it. This week away from it all helped provide a little
perspective on what I think we need to do to call it *done* (i.e.
something akin to a 1.0 even though we are CDing it).

Here is my current thinking on the next major things that should happen.
Opinions welcomed.

(These are roughly in implementation order based on urgency)

= Split of web UI =

The elastic recheck page is becoming a mismash of what was needed at the
time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

I think the landing page should be Know Bugs, as that's where we want
both bug hunters to go to prioritize things, as well as where people
looking for known bugs should start.

I think the overall Gate Health graphs should move to the zuul status
page. Possibly as part of the collection of graphs at the bottom.

We should have a secondary page (maybe column?) of the un-fingerprinted
recheck bugs, largely to use as candidates for fingerprinting. This will
let us eventually take over /recheck.

= Data Analysis / Graphs =

I spent a bunch of time playing with pandas over break
(http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind
of awesome. It also made me rethink our approach to handling the data.

I think the rolling average approach we were taking is more precise than
accurate. As these are statistical events they really need error bars.
Because when we have a quiet night, and 1 job fails at 6am in the
morning, the 100% failure rate it reflects in grenade needs to be
quantified that it was 1 of 1, not 50 of 50.

So my feeling is we should move away from the point graphs we have, and
present these as weekly and daily failure rates (with graphs and error
bars). And slice those per job. My suggestion is that we do the actual
visualization with matplotlib because it's super easy to output that
from pandas data sets.

Basically we'll be mining Elastic Search - Pandas TimeSeries -
transforms and analysis - output tables and graphs. This is different
enough from our current jquery graphing that I want to get ACKs before
doing a bunch of work here and finding out people don't like it in reviews.

Also in this process upgrade the metadata that we provide for each of
those bugs so it's a little more clear what you are looking at.

= Take over of /recheck =

There is still a bunch of useful data coming in on recheck bug 
data which hasn't been curated into ER queries. I think the right thing
to do is treat these as a work queue of bugs we should be building
patterns out of (or completely invalidating). I've got a preliminary
gerrit bulk query piece of code that does this, which would remove the
need of the daemon the way that's currently happening. The gerrit
queries are a little long right now, but I think if we are only doing
this on hourly cron, the additional load will be negligible.

This would get us into a single view, which I think would be more
informative than the one we currently have.

= Categorize all the jobs =

We need a bit of refactoring to let us comment on all the jobs (not just
tempest ones). Basically we assumed pep8 and docs don't fail in the gate
at the beginning. Turns out they do, and are good indicators of infra /
external factor bugs. They are a part of the story so we should put them
in.

= Multi Line Fingerprints =

We've definitely found bugs where we never had a really satisfying
single line match, but we had some great matches if we could do multi line.

We could do that in ER, however it will mean giving up logstash as our
UI, because those queries can't be done in logstash. So in order to do
this we'll really need to implement some tools - cli minimum, which will
let us easily test a bug. A custom web UI might be in order as well,
though that's going to be it's own chunk of work, that we'll need more
volunteers for.

This would put us in a place where we should have all the infrastructure
to track 90% of the race conditions, and talk about them in certainty as
1%, 5%, 0.1% bugs.

 -Sean



Let's add regexp query support to elastic-recheck so that I could have 
fixed this better:


https://review.openstack.org/#/c/65303/

Then I could have just filtered the build_name with this:

build_name:/(check|gate)-(tempest|grenade)-[a-z\-]+/

--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-07 Thread Matt Riedemann



On 1/7/2014 5:26 PM, Sean Dague wrote:

On 01/07/2014 06:20 PM, Matt Riedemann wrote:



On 1/2/2014 8:29 PM, Sean Dague wrote:

A lot of elastic recheck this fall has been based on the ad hoc needs of
the moment, in between diving down into the race bugs that were
uncovered by it. This week away from it all helped provide a little
perspective on what I think we need to do to call it *done* (i.e.
something akin to a 1.0 even though we are CDing it).

Here is my current thinking on the next major things that should happen.
Opinions welcomed.

(These are roughly in implementation order based on urgency)

= Split of web UI =

The elastic recheck page is becoming a mismash of what was needed at the
time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

I think the landing page should be Know Bugs, as that's where we want
both bug hunters to go to prioritize things, as well as where people
looking for known bugs should start.

I think the overall Gate Health graphs should move to the zuul status
page. Possibly as part of the collection of graphs at the bottom.

We should have a secondary page (maybe column?) of the un-fingerprinted
recheck bugs, largely to use as candidates for fingerprinting. This will
let us eventually take over /recheck.

= Data Analysis / Graphs =

I spent a bunch of time playing with pandas over break
(http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind
of awesome. It also made me rethink our approach to handling the data.

I think the rolling average approach we were taking is more precise than
accurate. As these are statistical events they really need error bars.
Because when we have a quiet night, and 1 job fails at 6am in the
morning, the 100% failure rate it reflects in grenade needs to be
quantified that it was 1 of 1, not 50 of 50.

So my feeling is we should move away from the point graphs we have, and
present these as weekly and daily failure rates (with graphs and error
bars). And slice those per job. My suggestion is that we do the actual
visualization with matplotlib because it's super easy to output that
from pandas data sets.

Basically we'll be mining Elastic Search - Pandas TimeSeries -
transforms and analysis - output tables and graphs. This is different
enough from our current jquery graphing that I want to get ACKs before
doing a bunch of work here and finding out people don't like it in
reviews.

Also in this process upgrade the metadata that we provide for each of
those bugs so it's a little more clear what you are looking at.

= Take over of /recheck =

There is still a bunch of useful data coming in on recheck bug 
data which hasn't been curated into ER queries. I think the right thing
to do is treat these as a work queue of bugs we should be building
patterns out of (or completely invalidating). I've got a preliminary
gerrit bulk query piece of code that does this, which would remove the
need of the daemon the way that's currently happening. The gerrit
queries are a little long right now, but I think if we are only doing
this on hourly cron, the additional load will be negligible.

This would get us into a single view, which I think would be more
informative than the one we currently have.

= Categorize all the jobs =

We need a bit of refactoring to let us comment on all the jobs (not just
tempest ones). Basically we assumed pep8 and docs don't fail in the gate
at the beginning. Turns out they do, and are good indicators of infra /
external factor bugs. They are a part of the story so we should put them
in.

= Multi Line Fingerprints =

We've definitely found bugs where we never had a really satisfying
single line match, but we had some great matches if we could do multi
line.

We could do that in ER, however it will mean giving up logstash as our
UI, because those queries can't be done in logstash. So in order to do
this we'll really need to implement some tools - cli minimum, which will
let us easily test a bug. A custom web UI might be in order as well,
though that's going to be it's own chunk of work, that we'll need more
volunteers for.

This would put us in a place where we should have all the infrastructure
to track 90% of the race conditions, and talk about them in certainty as
1%, 5%, 0.1% bugs.

 -Sean



Let's add regexp query support to elastic-recheck so that I could have
fixed this better:

https://review.openstack.org/#/c/65303/

Then I could have just filtered the build_name with this:

build_name:/(check|gate)-(tempest|grenade)-[a-z\-]+/


If you want to extend the query files with:

regex:
- build_name: /(check|gate)-(tempest|grenade)-[a-z\-]+/
- some_other_field: /some other regex/

And make it work with the query builder, I think we should consider it.
It would be good to know how much more expensive those queries get
though, because our ES is under decent load as it is.

 -Sean





Yeah, alternatively we could turn on 

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-07 Thread Sean Dague

On 01/07/2014 06:44 PM, Matt Riedemann wrote:



On 1/7/2014 5:26 PM, Sean Dague wrote:

On 01/07/2014 06:20 PM, Matt Riedemann wrote:



On 1/2/2014 8:29 PM, Sean Dague wrote:

A lot of elastic recheck this fall has been based on the ad hoc
needs of
the moment, in between diving down into the race bugs that were
uncovered by it. This week away from it all helped provide a little
perspective on what I think we need to do to call it *done* (i.e.
something akin to a 1.0 even though we are CDing it).

Here is my current thinking on the next major things that should
happen.
Opinions welcomed.

(These are roughly in implementation order based on urgency)

= Split of web UI =

The elastic recheck page is becoming a mismash of what was needed at
the
time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

I think the landing page should be Know Bugs, as that's where we want
both bug hunters to go to prioritize things, as well as where people
looking for known bugs should start.

I think the overall Gate Health graphs should move to the zuul status
page. Possibly as part of the collection of graphs at the bottom.

We should have a secondary page (maybe column?) of the un-fingerprinted
recheck bugs, largely to use as candidates for fingerprinting. This
will
let us eventually take over /recheck.

= Data Analysis / Graphs =

I spent a bunch of time playing with pandas over break
(http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind
of awesome. It also made me rethink our approach to handling the data.

I think the rolling average approach we were taking is more precise
than
accurate. As these are statistical events they really need error bars.
Because when we have a quiet night, and 1 job fails at 6am in the
morning, the 100% failure rate it reflects in grenade needs to be
quantified that it was 1 of 1, not 50 of 50.

So my feeling is we should move away from the point graphs we have, and
present these as weekly and daily failure rates (with graphs and error
bars). And slice those per job. My suggestion is that we do the actual
visualization with matplotlib because it's super easy to output that
from pandas data sets.

Basically we'll be mining Elastic Search - Pandas TimeSeries -
transforms and analysis - output tables and graphs. This is different
enough from our current jquery graphing that I want to get ACKs before
doing a bunch of work here and finding out people don't like it in
reviews.

Also in this process upgrade the metadata that we provide for each of
those bugs so it's a little more clear what you are looking at.

= Take over of /recheck =

There is still a bunch of useful data coming in on recheck bug 
data which hasn't been curated into ER queries. I think the right thing
to do is treat these as a work queue of bugs we should be building
patterns out of (or completely invalidating). I've got a preliminary
gerrit bulk query piece of code that does this, which would remove the
need of the daemon the way that's currently happening. The gerrit
queries are a little long right now, but I think if we are only doing
this on hourly cron, the additional load will be negligible.

This would get us into a single view, which I think would be more
informative than the one we currently have.

= Categorize all the jobs =

We need a bit of refactoring to let us comment on all the jobs (not
just
tempest ones). Basically we assumed pep8 and docs don't fail in the
gate
at the beginning. Turns out they do, and are good indicators of infra /
external factor bugs. They are a part of the story so we should put
them
in.

= Multi Line Fingerprints =

We've definitely found bugs where we never had a really satisfying
single line match, but we had some great matches if we could do multi
line.

We could do that in ER, however it will mean giving up logstash as our
UI, because those queries can't be done in logstash. So in order to do
this we'll really need to implement some tools - cli minimum, which
will
let us easily test a bug. A custom web UI might be in order as well,
though that's going to be it's own chunk of work, that we'll need more
volunteers for.

This would put us in a place where we should have all the
infrastructure
to track 90% of the race conditions, and talk about them in
certainty as
1%, 5%, 0.1% bugs.

 -Sean



Let's add regexp query support to elastic-recheck so that I could have
fixed this better:

https://review.openstack.org/#/c/65303/

Then I could have just filtered the build_name with this:

build_name:/(check|gate)-(tempest|grenade)-[a-z\-]+/


If you want to extend the query files with:

regex:
- build_name: /(check|gate)-(tempest|grenade)-[a-z\-]+/
- some_other_field: /some other regex/

And make it work with the query builder, I think we should consider it.
It would be good to know how much more expensive those queries get
though, because our ES is under decent load as it is.

 -Sean




Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-06 Thread Joe Gordon
Overall this looks really good, and very spot on.


On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague s...@dague.net wrote:

 A lot of elastic recheck this fall has been based on the ad hoc needs of
 the moment, in between diving down into the race bugs that were uncovered
 by it. This week away from it all helped provide a little perspective on
 what I think we need to do to call it *done* (i.e. something akin to a 1.0
 even though we are CDing it).

 Here is my current thinking on the next major things that should happen.
 Opinions welcomed.

 (These are roughly in implementation order based on urgency)

 = Split of web UI =

 The elastic recheck page is becoming a mismash of what was needed at the
 time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

 I think the landing page should be Know Bugs, as that's where we want both
 bug hunters to go to prioritize things, as well as where people looking for
 known bugs should start.

 I think the overall Gate Health graphs should move to the zuul status
 page. Possibly as part of the collection of graphs at the bottom.

 We should have a secondary page (maybe column?) of the un-fingerprinted
 recheck bugs, largely to use as candidates for fingerprinting. This will
 let us eventually take over /recheck.


I think it would be cool to collect the list of unclassified failures (not
by recheck bug), so we can see how many (and what percentage) need to be
classified. This isn't gate health but more of e-r health or something like
that.


 = Data Analysis / Graphs =

 I spent a bunch of time playing with pandas over break (
 http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind of
 awesome. It also made me rethink our approach to handling the data.

 I think the rolling average approach we were taking is more precise than
 accurate. As these are statistical events they really need error bars.
 Because when we have a quiet night, and 1 job fails at 6am in the morning,
 the 100% failure rate it reflects in grenade needs to be quantified that it
 was 1 of 1, not 50 of 50.


 So my feeling is we should move away from the point graphs we have, and
 present these as weekly and daily failure rates (with graphs and error
 bars). And slice those per job. My suggestion is that we do the actual
 visualization with matplotlib because it's super easy to output that from
 pandas data sets.


The one thing that the current graph does, that weekly and daily failure
rates don't show, is a sudden spike in one of the lines.  If you stare at
the current graphs for long enough and can read through the noise, you can
see when the gate collectively crashes or if just the neutron related gates
start failing. So I think one more graph is needed.



 Basically we'll be mining Elastic Search - Pandas TimeSeries -
 transforms and analysis - output tables and graphs. This is different
 enough from our current jquery graphing that I want to get ACKs before
 doing a bunch of work here and finding out people don't like it in reviews.

 Also in this process upgrade the metadata that we provide for each of
 those bugs so it's a little more clear what you are looking at.


For example?



 = Take over of /recheck =

 There is still a bunch of useful data coming in on recheck bug  data
 which hasn't been curated into ER queries. I think the right thing to do is
 treat these as a work queue of bugs we should be building patterns out of
 (or completely invalidating). I've got a preliminary gerrit bulk query
 piece of code that does this, which would remove the need of the daemon the
 way that's currently happening. The gerrit queries are a little long right
 now, but I think if we are only doing this on hourly cron, the additional
 load will be negligible.

 This would get us into a single view, which I think would be more
 informative than the one we currently have.


treating /recheck as a work queue sounds great, but this needs a bit more
fleshing out I think.

I imagine the workflow as something like this:

* State 1: Path author files bug saying 'gate broke, I didn't do it and
don't know why it broke'.
* State 2: Someone investigates the bug and determines if bug is valid and
if its a duplicate or not. root cause still isn't known.
* State 3: Someone writes a fingerprint for this bug and commits it to
elastic-recheck.

Assuming we agree on this general workflow, it would be nice if /recheck
distinguished between bugs in states 1 and 2, and there is no need to list
bugs in state 3 as e-r bot will automatically tell a developer when he hits
it.



 = Categorize all the jobs =

 We need a bit of refactoring to let us comment on all the jobs (not just
 tempest ones). Basically we assumed pep8 and docs don't fail in the gate at
 the beginning. Turns out they do, and are good indicators of infra /
 external factor bugs. They are a part of the story so we should put them in.


Don't forget grenade



 = Multi Line 

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-06 Thread Sean Dague

On 01/06/2014 07:04 PM, Joe Gordon wrote:

Overall this looks really good, and very spot on.


On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague s...@dague.net
mailto:s...@dague.net wrote:

A lot of elastic recheck this fall has been based on the ad hoc
needs of the moment, in between diving down into the race bugs that
were uncovered by it. This week away from it all helped provide a
little perspective on what I think we need to do to call it *done*
(i.e. something akin to a 1.0 even though we are CDing it).

Here is my current thinking on the next major things that should
happen. Opinions welcomed.

(These are roughly in implementation order based on urgency)

= Split of web UI =

The elastic recheck page is becoming a mismash of what was needed at
the time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

I think the landing page should be Know Bugs, as that's where we
want both bug hunters to go to prioritize things, as well as where
people looking for known bugs should start.

I think the overall Gate Health graphs should move to the zuul
status page. Possibly as part of the collection of graphs at the bottom.

We should have a secondary page (maybe column?) of the
un-fingerprinted recheck bugs, largely to use as candidates for
fingerprinting. This will let us eventually take over /recheck.


I think it would be cool to collect the list of unclassified failures
(not by recheck bug), so we can see how many (and what percentage) need
to be classified. This isn't gate health but more of e-r health or
something like that.


Agreed. I've got the percentage in check_success today, but I agree that 
every gate job that fails that we don't have a fingerprint should be 
listed somewhere we can work through them.




= Data Analysis / Graphs =

I spent a bunch of time playing with pandas over break
(http://dague.net/2013/12/30/__ipython-notebook-experiments/
http://dague.net/2013/12/30/ipython-notebook-experiments/)__, it's
kind of awesome. It also made me rethink our approach to handling
the data.

I think the rolling average approach we were taking is more precise
than accurate. As these are statistical events they really need
error bars. Because when we have a quiet night, and 1 job fails at
6am in the morning, the 100% failure rate it reflects in grenade
needs to be quantified that it was 1 of 1, not 50 of 50.


So my feeling is we should move away from the point graphs we have,
and present these as weekly and daily failure rates (with graphs and
error bars). And slice those per job. My suggestion is that we do
the actual visualization with matplotlib because it's super easy to
output that from pandas data sets.


The one thing that the current graph does, that weekly and daily failure
rates don't show, is a sudden spike in one of the lines.  If you stare
at the current graphs for long enough and can read through the noise,
you can see when the gate collectively crashes or if just the neutron
related gates start failing. So I think one more graph is needed.


The point of the visualizations is to make sense to people that don't 
understand all the data, especially core members of various teams that 
are trying to figure out if I attack 1 bug right now, what's the 
biggest bang for my buck.



Basically we'll be mining Elastic Search - Pandas TimeSeries -
transforms and analysis - output tables and graphs. This is
different enough from our current jquery graphing that I want to get
ACKs before doing a bunch of work here and finding out people don't
like it in reviews.

Also in this process upgrade the metadata that we provide for each
of those bugs so it's a little more clear what you are looking at.


For example?


We should always be listing the bug title, not just the number. We 
should also list what projects it's filed against. I've stared at this 
bugs as much as anyone, and I still need to click through the top 4 to 
figure out which one is the ssh bug. :)



= Take over of /recheck =

There is still a bunch of useful data coming in on recheck bug
 data which hasn't been curated into ER queries. I think the
right thing to do is treat these as a work queue of bugs we should
be building patterns out of (or completely invalidating). I've got a
preliminary gerrit bulk query piece of code that does this, which
would remove the need of the daemon the way that's currently
happening. The gerrit queries are a little long right now, but I
think if we are only doing this on hourly cron, the additional load
will be negligible.

This would get us into a single view, which I think would be more
informative than the one we currently have.


treating /recheck as a work queue sounds great, but this needs a bit
more 

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-05 Thread James E. Blair
Sean Dague s...@dague.net writes:

 I think the main user-visible aspect of this decision is the delay
 before unprocessed bugs are made visible.  If a bug starts affecting a
 number of jobs, it might be nice to see what bug numbers people are
 using for rechecks without waiting for the next cron run.

 So my experience is that most rechecks happen  1 hr after a patch
 fails. And the people that are sitting on patches for bugs that have
 never been seen before find their way to IRC.

 The current state of the world is not all roses and unicorns. The
 recheck daemon has died, and not been noticed that it was dead for
 *weeks*. So a guarantee that we are only 1 hr delayed would actually
 be on average better than the delays we've seen over the last six
 months of following the event stream.

I wasn't suggesting that we keep the recheck daemon, I was suggesting
moving the real-time observation of rechecks into the elastic-recheck
daemon which will remain an important component of this system for the
foreseeable future.  It is fairly reliable and if it does die, we will
desperately want get it running again and fix the underlying problem
because it is so helpful.

 I also think that caching should probably actually happen in gerritlib
 itself. There is a concern that too many things are hitting gerrit,
 and the result is that everyone is implementing their own client side
 caching to try to be nice. (like the pickles in Russell's review stats
 programs). This seems like the wrong place to do be doing it.

That's not a bad idea, however it doesn't really address the fact that
you're looking for events -- you need to run a very large bulk query to
find all of the reviews over a certain amount of time.  You could reduce
this by caching results and then only querying reviews that are newer
than the last update.  But even so, you'll always have to query for that
window.  That's not as bad as querying for the same two weeks of data
every X minutes, but since there's already a daemon watching all of the
events anyway in real time, you already have the information if you just
don't discard it.

 But, part of the reason for this email was to sort these sorts of
 issues out, so let me know if you think the caching issue is an
 architectural blocker.

 Because if we're generally agreed on the architecture forward and are
 just reviewing for correctness, the code can move fast, and we can
 actually have ER 1.0 by the end of the month. Architecture review in
 gerrit is where we grind to a halt.

It looks like the bulk queries take about 4 full minutes of Gerrit CPU
time to fetch data from the last two weeks (and the last two weeks have
been quiet; I'd expect the next two weeks to take longer).  I don't
think it's going to kill us, but I think there are some really easy ways
to make this way more efficient, which isn't just about being nice to
Gerrit, but is also about being responsive for users.

My first preference is still to use the real-time data that the e-r
daemon collects already and feed it to the dashboard.

If you feel like the inter-process communication needed for that will
slow you down too much, then my second preference would be to introduce
local caching of the results so that you can query for
-age:query-interval instead of the full two weeks every time.  (And
if it's generalized enough, sure let's add that to gerritlib.)

I really think we at least ought to do one of those.  Running the same
bulk query repeatedly is, in this case, so inefficient that I think this
little bit of optimization is not premature.

Thanks again for working on this.  I really appreciate it and the time
you're spending on architecture.

-Jim

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-05 Thread Sean Dague

On 01/05/2014 05:49 PM, James E. Blair wrote:

Sean Dague s...@dague.net writes:


I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible.  If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.


So my experience is that most rechecks happen  1 hr after a patch
fails. And the people that are sitting on patches for bugs that have
never been seen before find their way to IRC.

The current state of the world is not all roses and unicorns. The
recheck daemon has died, and not been noticed that it was dead for
*weeks*. So a guarantee that we are only 1 hr delayed would actually
be on average better than the delays we've seen over the last six
months of following the event stream.


I wasn't suggesting that we keep the recheck daemon, I was suggesting
moving the real-time observation of rechecks into the elastic-recheck
daemon which will remain an important component of this system for the
foreseeable future.  It is fairly reliable and if it does die, we will
desperately want get it running again and fix the underlying problem
because it is so helpful.


That's a possible place to put it. The daemon is a bit of a mess at the 
moment, so I was hoping to not refactor it until the end of the month as 
part of the cleaning up to handle the additional jobs.



I also think that caching should probably actually happen in gerritlib
itself. There is a concern that too many things are hitting gerrit,
and the result is that everyone is implementing their own client side
caching to try to be nice. (like the pickles in Russell's review stats
programs). This seems like the wrong place to do be doing it.


That's not a bad idea, however it doesn't really address the fact that
you're looking for events -- you need to run a very large bulk query to
find all of the reviews over a certain amount of time.  You could reduce
this by caching results and then only querying reviews that are newer
than the last update.  But even so, you'll always have to query for that
window.  That's not as bad as querying for the same two weeks of data
every X minutes, but since there's already a daemon watching all of the
events anyway in real time, you already have the information if you just
don't discard it.


I don't really want to trust us not failing, because we do. So we're 
going to need replay ability anyway.



But, part of the reason for this email was to sort these sorts of
issues out, so let me know if you think the caching issue is an
architectural blocker.

Because if we're generally agreed on the architecture forward and are
just reviewing for correctness, the code can move fast, and we can
actually have ER 1.0 by the end of the month. Architecture review in
gerrit is where we grind to a halt.


It looks like the bulk queries take about 4 full minutes of Gerrit CPU
time to fetch data from the last two weeks (and the last two weeks have
been quiet; I'd expect the next two weeks to take longer).  I don't
think it's going to kill us, but I think there are some really easy ways
to make this way more efficient, which isn't just about being nice to
Gerrit, but is also about being responsive for users.


Interesting, I thought this was more like 1 minute. 4 definitely gets a 
bit wonkier.



My first preference is still to use the real-time data that the e-r
daemon collects already and feed it to the dashboard.

If you feel like the inter-process communication needed for that will
slow you down too much, then my second preference would be to introduce
local caching of the results so that you can query for
-age:query-interval instead of the full two weeks every time.  (And
if it's generalized enough, sure let's add that to gerritlib.)


Yeh, the biggest complexity is the result merge. I was finding that 
-age:4h still ended up return nearly 20% of the entire dataset, and 
wasn't as much quicker as you'd expect.


But the new data and the old data are overlapping a lot, because you can 
only query by time on the review, not on the comments. And those are 
leaves in funny ways.


I think the right way to do that would be build on top of pandas data 
series merge functionality. All good things, just new building blocks we 
don't have yet.



I really think we at least ought to do one of those.  Running the same
bulk query repeatedly is, in this case, so inefficient that I think this
little bit of optimization is not premature.


Sure, I wonder how the various other review stats tools are handling 
this case. Putting Russell and Ilya (Stackalytics) into the mix. Because 
it seems like we should have a common solution here for all the tools 
hitting gerrit on cron for largely the same info.


-Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

___
OpenStack-dev mailing list

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-03 Thread James E. Blair
Sean Dague s...@dague.net writes:

 So my feeling is we should move away from the point graphs we have,
 and present these as weekly and daily failure rates (with graphs and
 error bars). And slice those per job. My suggestion is that we do the
 actual visualization with matplotlib because it's super easy to output
 that from pandas data sets.

I am very excited about this and everything above it!

 = Take over of /recheck =

 There is still a bunch of useful data coming in on recheck bug 
 data which hasn't been curated into ER queries. I think the right
 thing to do is treat these as a work queue of bugs we should be
 building patterns out of (or completely invalidating). I've got a
 preliminary gerrit bulk query piece of code that does this, which
 would remove the need of the daemon the way that's currently
 happening. The gerrit queries are a little long right now, but I think
 if we are only doing this on hourly cron, the additional load will be
 negligible.

I think this is fine and am all for reducing complexity, but consider
this alternative: over the break, I moved both components of
elastic-recheck onto a new server (status.openstack.org).  Since they
are now co-located, you could have the component of e-r that watches the
stream to provide responses to gerrit also note recheck actions.  You
could stick the data in a file, memcache, trove database, etc, and the
status page could display that work queue.  No extra daemons required.

I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible.  If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.

On another topic, it's worth mentioning that we now (again, this is new
from over the break) have timeouts _inside_ the devstack-gate jobs that
should hit before the Jenkins timeout, so log collection for
devstack-gate jobs that run long and hit the timeout should still happen
(meaning that e-r can now see these failures).

Thanks for all your work on this.  I think it's extremely useful and
exciting!

-Jim

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-02 Thread Sean Dague
A lot of elastic recheck this fall has been based on the ad hoc needs of 
the moment, in between diving down into the race bugs that were 
uncovered by it. This week away from it all helped provide a little 
perspective on what I think we need to do to call it *done* (i.e. 
something akin to a 1.0 even though we are CDing it).


Here is my current thinking on the next major things that should happen. 
Opinions welcomed.


(These are roughly in implementation order based on urgency)

= Split of web UI =

The elastic recheck page is becoming a mismash of what was needed at the 
time. I think what we really have emerging is:

 * Overall Gate Health
 * Known (to ER) Bugs
 * Unknown (to ER) Bugs - more below

I think the landing page should be Know Bugs, as that's where we want 
both bug hunters to go to prioritize things, as well as where people 
looking for known bugs should start.


I think the overall Gate Health graphs should move to the zuul status 
page. Possibly as part of the collection of graphs at the bottom.


We should have a secondary page (maybe column?) of the un-fingerprinted 
recheck bugs, largely to use as candidates for fingerprinting. This will 
let us eventually take over /recheck.


= Data Analysis / Graphs =

I spent a bunch of time playing with pandas over break 
(http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind 
of awesome. It also made me rethink our approach to handling the data.


I think the rolling average approach we were taking is more precise than 
accurate. As these are statistical events they really need error bars. 
Because when we have a quiet night, and 1 job fails at 6am in the 
morning, the 100% failure rate it reflects in grenade needs to be 
quantified that it was 1 of 1, not 50 of 50.


So my feeling is we should move away from the point graphs we have, and 
present these as weekly and daily failure rates (with graphs and error 
bars). And slice those per job. My suggestion is that we do the actual 
visualization with matplotlib because it's super easy to output that 
from pandas data sets.


Basically we'll be mining Elastic Search - Pandas TimeSeries - 
transforms and analysis - output tables and graphs. This is different 
enough from our current jquery graphing that I want to get ACKs before 
doing a bunch of work here and finding out people don't like it in reviews.


Also in this process upgrade the metadata that we provide for each of 
those bugs so it's a little more clear what you are looking at.


= Take over of /recheck =

There is still a bunch of useful data coming in on recheck bug  
data which hasn't been curated into ER queries. I think the right thing 
to do is treat these as a work queue of bugs we should be building 
patterns out of (or completely invalidating). I've got a preliminary 
gerrit bulk query piece of code that does this, which would remove the 
need of the daemon the way that's currently happening. The gerrit 
queries are a little long right now, but I think if we are only doing 
this on hourly cron, the additional load will be negligible.


This would get us into a single view, which I think would be more 
informative than the one we currently have.


= Categorize all the jobs =

We need a bit of refactoring to let us comment on all the jobs (not just 
tempest ones). Basically we assumed pep8 and docs don't fail in the gate 
at the beginning. Turns out they do, and are good indicators of infra / 
external factor bugs. They are a part of the story so we should put them in.


= Multi Line Fingerprints =

We've definitely found bugs where we never had a really satisfying 
single line match, but we had some great matches if we could do multi line.


We could do that in ER, however it will mean giving up logstash as our 
UI, because those queries can't be done in logstash. So in order to do 
this we'll really need to implement some tools - cli minimum, which will 
let us easily test a bug. A custom web UI might be in order as well, 
though that's going to be it's own chunk of work, that we'll need more 
volunteers for.


This would put us in a place where we should have all the infrastructure 
to track 90% of the race conditions, and talk about them in certainty as 
1%, 5%, 0.1% bugs.


-Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-02 Thread Clark Boylan
On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague s...@dague.net wrote:
 A lot of elastic recheck this fall has been based on the ad hoc needs of the
 moment, in between diving down into the race bugs that were uncovered by it.
 This week away from it all helped provide a little perspective on what I
 think we need to do to call it *done* (i.e. something akin to a 1.0 even
 though we are CDing it).

 Here is my current thinking on the next major things that should happen.
 Opinions welcomed.

 (These are roughly in implementation order based on urgency)

 = Split of web UI =

 The elastic recheck page is becoming a mismash of what was needed at the
 time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

 I think the landing page should be Know Bugs, as that's where we want both
 bug hunters to go to prioritize things, as well as where people looking for
 known bugs should start.

 I think the overall Gate Health graphs should move to the zuul status page.
 Possibly as part of the collection of graphs at the bottom.

 We should have a secondary page (maybe column?) of the un-fingerprinted
 recheck bugs, largely to use as candidates for fingerprinting. This will let
 us eventually take over /recheck.

 = Data Analysis / Graphs =

 I spent a bunch of time playing with pandas over break
 (http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind of
 awesome. It also made me rethink our approach to handling the data.

 I think the rolling average approach we were taking is more precise than
 accurate. As these are statistical events they really need error bars.
 Because when we have a quiet night, and 1 job fails at 6am in the morning,
 the 100% failure rate it reflects in grenade needs to be quantified that it
 was 1 of 1, not 50 of 50.

 So my feeling is we should move away from the point graphs we have, and
 present these as weekly and daily failure rates (with graphs and error
 bars). And slice those per job. My suggestion is that we do the actual
 visualization with matplotlib because it's super easy to output that from
 pandas data sets.

 Basically we'll be mining Elastic Search - Pandas TimeSeries - transforms
 and analysis - output tables and graphs. This is different enough from our
 current jquery graphing that I want to get ACKs before doing a bunch of work
 here and finding out people don't like it in reviews.

 Also in this process upgrade the metadata that we provide for each of those
 bugs so it's a little more clear what you are looking at.

 = Take over of /recheck =

 There is still a bunch of useful data coming in on recheck bug  data
 which hasn't been curated into ER queries. I think the right thing to do is
 treat these as a work queue of bugs we should be building patterns out of
 (or completely invalidating). I've got a preliminary gerrit bulk query piece
 of code that does this, which would remove the need of the daemon the way
 that's currently happening. The gerrit queries are a little long right now,
 but I think if we are only doing this on hourly cron, the additional load
 will be negligible.

 This would get us into a single view, which I think would be more
 informative than the one we currently have.

 = Categorize all the jobs =

 We need a bit of refactoring to let us comment on all the jobs (not just
 tempest ones). Basically we assumed pep8 and docs don't fail in the gate at
 the beginning. Turns out they do, and are good indicators of infra /
 external factor bugs. They are a part of the story so we should put them in.

 = Multi Line Fingerprints =

 We've definitely found bugs where we never had a really satisfying single
 line match, but we had some great matches if we could do multi line.

 We could do that in ER, however it will mean giving up logstash as our UI,
 because those queries can't be done in logstash. So in order to do this
 we'll really need to implement some tools - cli minimum, which will let us
 easily test a bug. A custom web UI might be in order as well, though that's
 going to be it's own chunk of work, that we'll need more volunteers for.

 This would put us in a place where we should have all the infrastructure to
 track 90% of the race conditions, and talk about them in certainty as 1%,
 5%, 0.1% bugs.

 -Sean

 --
 Sean Dague
 Samsung Research America
 s...@dague.net / sean.da...@samsung.com
 http://dague.net

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

This is great stuff. Out of curiousity is doing the graphing with
pandas and ES vs graphite so that we can graph things in a more ad hoc
fashion? Also, for the dashboard, Kibana3 does a lot more stuff than
Kibana2 which we currently use. I have been meaning to get Kibana3
running alongside Kibana2 and I think it may be able to do multi line
queries (I need to double 

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-02 Thread Clark Boylan
On Thu, Jan 2, 2014 at 6:44 PM, Clark Boylan clark.boy...@gmail.com wrote:
 On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague s...@dague.net wrote:
 A lot of elastic recheck this fall has been based on the ad hoc needs of the
 moment, in between diving down into the race bugs that were uncovered by it.
 This week away from it all helped provide a little perspective on what I
 think we need to do to call it *done* (i.e. something akin to a 1.0 even
 though we are CDing it).

 Here is my current thinking on the next major things that should happen.
 Opinions welcomed.

 (These are roughly in implementation order based on urgency)

 = Split of web UI =

 The elastic recheck page is becoming a mismash of what was needed at the
 time. I think what we really have emerging is:
  * Overall Gate Health
  * Known (to ER) Bugs
  * Unknown (to ER) Bugs - more below

 I think the landing page should be Know Bugs, as that's where we want both
 bug hunters to go to prioritize things, as well as where people looking for
 known bugs should start.

 I think the overall Gate Health graphs should move to the zuul status page.
 Possibly as part of the collection of graphs at the bottom.

 We should have a secondary page (maybe column?) of the un-fingerprinted
 recheck bugs, largely to use as candidates for fingerprinting. This will let
 us eventually take over /recheck.

 = Data Analysis / Graphs =

 I spent a bunch of time playing with pandas over break
 (http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind of
 awesome. It also made me rethink our approach to handling the data.

 I think the rolling average approach we were taking is more precise than
 accurate. As these are statistical events they really need error bars.
 Because when we have a quiet night, and 1 job fails at 6am in the morning,
 the 100% failure rate it reflects in grenade needs to be quantified that it
 was 1 of 1, not 50 of 50.

 So my feeling is we should move away from the point graphs we have, and
 present these as weekly and daily failure rates (with graphs and error
 bars). And slice those per job. My suggestion is that we do the actual
 visualization with matplotlib because it's super easy to output that from
 pandas data sets.

 Basically we'll be mining Elastic Search - Pandas TimeSeries - transforms
 and analysis - output tables and graphs. This is different enough from our
 current jquery graphing that I want to get ACKs before doing a bunch of work
 here and finding out people don't like it in reviews.

 Also in this process upgrade the metadata that we provide for each of those
 bugs so it's a little more clear what you are looking at.

 = Take over of /recheck =

 There is still a bunch of useful data coming in on recheck bug  data
 which hasn't been curated into ER queries. I think the right thing to do is
 treat these as a work queue of bugs we should be building patterns out of
 (or completely invalidating). I've got a preliminary gerrit bulk query piece
 of code that does this, which would remove the need of the daemon the way
 that's currently happening. The gerrit queries are a little long right now,
 but I think if we are only doing this on hourly cron, the additional load
 will be negligible.

 This would get us into a single view, which I think would be more
 informative than the one we currently have.

 = Categorize all the jobs =

 We need a bit of refactoring to let us comment on all the jobs (not just
 tempest ones). Basically we assumed pep8 and docs don't fail in the gate at
 the beginning. Turns out they do, and are good indicators of infra /
 external factor bugs. They are a part of the story so we should put them in.

 = Multi Line Fingerprints =

 We've definitely found bugs where we never had a really satisfying single
 line match, but we had some great matches if we could do multi line.

 We could do that in ER, however it will mean giving up logstash as our UI,
 because those queries can't be done in logstash. So in order to do this
 we'll really need to implement some tools - cli minimum, which will let us
 easily test a bug. A custom web UI might be in order as well, though that's
 going to be it's own chunk of work, that we'll need more volunteers for.

 This would put us in a place where we should have all the infrastructure to
 track 90% of the race conditions, and talk about them in certainty as 1%,
 5%, 0.1% bugs.

 -Sean

 --
 Sean Dague
 Samsung Research America
 s...@dague.net / sean.da...@samsung.com
 http://dague.net

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 This is great stuff. Out of curiousity is doing the graphing with
 pandas and ES vs graphite so that we can graph things in a more ad hoc
 fashion? Also, for the dashboard, Kibana3 does a lot more stuff than
 Kibana2 which we currently use. I have been meaning to get Kibana3
 running alongside 

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

2014-01-02 Thread Sean Dague

On 01/02/2014 09:44 PM, Clark Boylan wrote:
snip

This is great stuff. Out of curiousity is doing the graphing with
pandas and ES vs graphite so that we can graph things in a more ad hoc
fashion?


So, we need to go to ES for the fingerprints anyway (because that's 
where we mine them from), which means we need a way to process ES data 
into TimeSeries. In order to calculate frequencies we need largely 
equivalent TimeSeries that are base lines for # of jobs run of 
particular types. Given that we can get that with an ES query, it 
prevents the need of having to have a different data transformation 
process to get to the same kind of TimeSeries.


It also lets us bulk query. With 1 ~20second ES query we get all states, 
of all jobs, across all queues, over the last 7 days (as well as 
information on review). And the transform to slice is super easy because 
it's 10s of thousands of records that are dictionaries, which makes for 
good input. You'd need to do a bunch of unbinning and transforms to 
massage the graphite data to pair with what we have in the fingerprint data.


Eventually having tools to do the same thing with graphite is probe ably 
a good thing, largely for other analysis people want to do on that (I 
think long term having some data kits for our bulk data to let people 
play with it is goodness). I'd just put it after a 1.0 as I think it's 
not really needed.



Also, for the dashboard, Kibana3 does a lot more stuff than
Kibana2 which we currently use. I have been meaning to get Kibana3
running alongside Kibana2 and I think it may be able to do multi line
queries (I need to double check that but it has a lot more query and
graphing capability). I think Kibana3 is worth looking into as well
before we go too far down the road of custom UI.


Absolutely. There is a reason that's all the way at the bottom of the 
list, and honestly, something I almost didn't put in there. But I 
figured we needed to understand the implications of multi line matches 
with our current UI, and the fact that they will make some things 
better, but discovering those matches will be harder with the existing UI.


If Kibana3 solves it, score. One less thing to do. Because I'd really 
like to not be in the business of maintaining a custom web UI just for 
discovery of fingerprints.


-Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev