Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Joshua D. Drake wrote:

Andrew Dunstan wrote:

Tom Lane wrote:

Martijn van Oosterhout  writes:

But I'm just sprouting ideas here, the proof is in the pudding. If the
logs are easily available (or a subset of, say the last month) then
people could play with that and see what happens...

Anyone who wants to play around can replicate what I did, which was to
download the table that Andrew made available upthread, and then pull
the log files matching interesting rows.


To save people this trouble, I have made an extract for the last 3
months, augmented by log field, which is pretty much the last stage log.
The dump is 27Mb and can be got at

Should we just automate this and make it a weekly?


Sure. Talk to me offline about it - very simple to do.



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Joshua D. Drake
Andrew Dunstan wrote:
> Tom Lane wrote:
>> Martijn van Oosterhout  writes:
>>> But I'm just sprouting ideas here, the proof is in the pudding. If the
>>> logs are easily available (or a subset of, say the last month) then
>>> people could play with that and see what happens...
>> Anyone who wants to play around can replicate what I did, which was to
>> download the table that Andrew made available upthread, and then pull
>> the log files matching interesting rows.
> [snip]
> To save people this trouble, I have made an extract for the last 3
> months, augmented by log field, which is pretty much the last stage log.
> The dump is 27Mb and can be got at

Should we just automate this and make it a weekly?

> cheers
> andrew


  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997

Donate to the PostgreSQL Project:
PostgreSQL Replication:

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Tom Lane wrote:

Martijn van Oosterhout  writes:

But I'm just sprouting ideas here, the proof is in the pudding. If the
logs are easily available (or a subset of, say the last month) then
people could play with that and see what happens...

Anyone who wants to play around can replicate what I did, which was to
download the table that Andrew made available upthread, and then pull
the log files matching interesting rows.  


To save people this trouble, I have made an extract for the last 3 
months, augmented by log field, which is pretty much the last stage log. 
The dump is 27Mb and can be got at



---(end of broadcast)---
TIP 4: Have you searched our list archives?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Tom Lane
Martijn van Oosterhout  writes:
> But I'm just sprouting ideas here, the proof is in the pudding. If the
> logs are easily available (or a subset of, say the last month) then
> people could play with that and see what happens...

Anyone who wants to play around can replicate what I did, which was to
download the table that Andrew made available upthread, and then pull
the log files matching interesting rows.  I used the attached functions
to generate URLs for the failing stage logs, and then a shell script
looping over lwp-download ...

CREATE FUNCTION lastfile(mfailures) RETURNS text
AS $$
select replace(
'' || $1.sysname || '&dt=' || $1.snapshot ||
'&stg=' ||
replace($1.log_archive_filenames[array_upper($1.log_archive_filenames, 1)],
'.log', ''),
  ' ', '%20')

CREATE FUNCTION lastlog(mfailures) RETURNS text
AS $$
select '' || lastfile($1)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Martijn van Oosterhout
On Tue, Mar 20, 2007 at 11:36:09AM -0400, Andrew Dunstan wrote:
> My biggest worry apart from maintenance (which doesn't matter that much 
> - if people don't enter the regexes they don't get the tags they want) 
> is that the regexes will not be specific enough, and so give false 
> positives on the tags. Then if you're looking for things that aren't 
> tagged you be even more likely than today to miss the outliers. Lord 

I think you could solve that by displaying the text that matched the
regex. If it starts matching odd things it'd be visible.

But I'm just sprouting ideas here, the proof is in the pudding. If the
logs are easily available (or a subset of, say the last month) then
people could play with that and see what happens...

Have a nice day,
Martijn van Oosterhout
> From each according to his ability. To each according to his ability to 
> litigate.

Description: Digital signature

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Arturo Perez wrote:

I don't know if this has come up yet but

In terms of tagging errors we might be able to use some machine 
learning techniques.

There are NLP/learning systems that interpret logs.  They learn over 
time what is normal and what isn't and can flag things that are abnormal.

We can make extracts of the database (including the log data) available 
to anyone who wants to do research using any learning technique that 
appeals to them.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Arturo Perez

I don't know if this has come up yet but

In terms of tagging errors we might be able to use some machine  
learning techniques.

There are NLP/learning systems that interpret logs.  They learn over  
time what is normal and what isn't and can flag things that are  

For example, people are using support vector machines (SVM) analysis  
on log files to do intrusion detection.  Here's a link for intrusion  
detection called Robust Anomaly Detection Using Support Vector  

This paper from IBM gives some more background information on how  
such a thing might work. 

I have previously used an open source toolkit from CMU called rainbow  
to do these types of analysis.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> The wrinkle is that applying the tags on the fly is probably not a great 
> idea - the status page query is already in desperate need of overhauling 
> because it's too slow. So we'd need a daemon to set up the tags in the 
> background. But that's an implementation detail. Screen real estate on 
> the dashboard page is also in very short supply. Maybe we could play 
> with the background colour, so that a tagged failure had, say, a blue 
> background, as opposed to the red/pink/yellow we use for failures now. 
> Again - an implementation detail.

I'm not sure that the current status dashboard needs to pay any attention
to the tags.  The view that I would like to have of "recent failures
across all machines in a branch" is the one that needs to be tag-aware,
and perhaps also the existing display of a given machine's branch history.

> My biggest worry apart from maintenance (which doesn't matter that much 
> - if people don't enter the regexes they don't get the tags they want) 
> is that the regexes will not be specific enough, and so give false 
> positives on the tags.

True.  I strongly suggest that we want an interactive search-and-tag
capability *before* worrying about automatic tagging --- one of the
reasons for that is to provide a way to test a regex that you might
then consider adding to the automatic filter for future reports.

> This would be a fine SOC project - I at least won't have time to develop 
> it for quite some time.

Agreed.  Who's maintaining the SOC project list page?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Tom Lane wrote:

The point I think you are missing is that having something like this
will *eliminate* repetitive, boring work, namely recognizing multiple
reports of the same problem.  The buildfarm has gotten big enough that
some way of dealing with that is desperately needed, else our ability
to spot infrequently-reported issues will disappear entirely.

OK. How about if we have a table of description, start_date> plus some webby transactions for approved users 
to edit this?

The wrinkle is that applying the tags on the fly is probably not a great 
idea - the status page query is already in desperate need of overhauling 
because it's too slow. So we'd need a daemon to set up the tags in the 
background. But that's an implementation detail. Screen real estate on 
the dashboard page is also in very short supply. Maybe we could play 
with the background colour, so that a tagged failure had, say, a blue 
background, as opposed to the red/pink/yellow we use for failures now. 
Again - an implementation detail.

My biggest worry apart from maintenance (which doesn't matter that much 
- if people don't enter the regexes they don't get the tags they want) 
is that the regexes will not be specific enough, and so give false 
positives on the tags. Then if you're looking for things that aren't 
tagged you be even more likely than today to miss the outliers. Lord 
knows that regexes are hard to get right - I've been using them for a 
couple of decades and they've earned me lots of money, and I still get 
them wrong regularly (including several cases on the buildfarm). but 
maybe we need to take the plunge and see how it works.

This would be a fine SOC project - I at least won't have time to develop 
it for quite some time.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Martijn van Oosterhout wrote:
>> Maybe a simple compromise would be being able to setup a set of regexes
>> that search the output and set a flag it that string is found. If you
>> find the string, it gets marked with a flag, which means that when you
>> look at mongoose, any failures that don't have the flag become easier
>> to spot.
>> It also means that once you've found a common failure, you can create
>> the regex and then any other failures with the same string get tagged
>> also, making unexplained ones easier to spot.

> You need to show first that this is an adequate tagging mechanism, both 
> in tagging things adequately and in not picking up false positives, 
> which would make things worse, not better. And even then you need 
> someone to do the analysis to create the regex.

Well, my experiment over the weekend with doing exactly that convinced
me that regexes could be used successfully to identify common-mode
failures.  So I think Martijn has a fine idea here.  And I don't see a
problem with lack of motivation, at least for those of us who try to pay
attention to buildfarm results --- once you've looked at a couple of
reports of the same issue, you really don't want to have to repeat the
analysis over and over.  But just assuming that every report on a
particular day reflects the same breakage is exactly the risk I wish
we didn't have to take.

For a lot of cases there is not a need for an ongoing filter: we break
something, we get a pile of reports, we fix it, and then we want to tag
all the reports of that something so that we can see if anything else
happened in the same interval.  So for this, something based on an
interactive search API would work fine.  You could even use that for
repetitive problems such as buildfarm misconfigurations, though having
to repeat the search every so often would get old in the end.  The main
thing though is for the database to remember the tags once made.

> The buildfarm works because it leverages our strength, namely automating 
> things. But all the tagging suggestions I've seen will involve regular, 
> repetitive and possibly boring work, precisely the thing we are not good 
> at as a group.

Well, responding to bug reports could be called regular and repetitive
work --- in reality I don't find it so, because every bug is different.
The point I think you are missing is that having something like this
will *eliminate* repetitive, boring work, namely recognizing multiple
reports of the same problem.  The buildfarm has gotten big enough that
some way of dealing with that is desperately needed, else our ability
to spot infrequently-reported issues will disappear entirely.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Alvaro Herrera wrote:

Andrew Dunstan wrote:

The buildfarm works because it leverages our strength, namely automating 
things. But all the tagging suggestions I've seen will involve regular, 
repetitive and possibly boring work, precisely the thing we are not good 
at as a group.

You may be forgetting that Martijn and others tagged the database.  Now, there are some untagged errors, but
I'd say that that's because we don't control the tool, so we cannot fix
it if there are false positives.  We do control the buildfarm however,
so we can develop systematic solutions for widespread problems (instead
of forcing us to checking and tagging every single occurance of
widespread problems).


Well, I'm sure we can provide appropriate access or data for anyone who 
wants to do research in this area and prove me wrong.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Alvaro Herrera
Andrew Dunstan wrote:

> The buildfarm works because it leverages our strength, namely automating 
> things. But all the tagging suggestions I've seen will involve regular, 
> repetitive and possibly boring work, precisely the thing we are not good 
> at as a group.

You may be forgetting that Martijn and others tagged the database.  Now, there are some untagged errors, but
I'd say that that's because we don't control the tool, so we cannot fix
it if there are false positives.  We do control the buildfarm however,
so we can develop systematic solutions for widespread problems (instead
of forcing us to checking and tagging every single occurance of
widespread problems).

Alvaro Herrera
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Stefan Kaltenbrunner wrote:
however as a buildfarm admin I occasionally wished i had a way to 
invalidate reports generated from my boxes to prevent someone wasting 
time to investigate them (like errors caused by system 
upgrades,configuration problems or other local issues).

It would be extremely simply to provide a 'revoke report' API and 
client. Good idea.

But that's quite different from what we have been discussing.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Stefan Kaltenbrunner

Andrew Dunstan wrote:

Martijn van Oosterhout wrote:

On Tue, Mar 20, 2007 at 02:57:13AM -0400, Tom Lane wrote:

Maybe we should think about filtering the noise.  Like, say, discarding
every report from mongoose that involves an icc core dump ... 

Maybe a simple compromise would be being able to setup a set of regexes
that search the output and set a flag it that string is found. If you
find the string, it gets marked with a flag, which means that when you
look at mongoose, any failures that don't have the flag become easier
to spot.

It also means that once you've found a common failure, you can create
the regex and then any other failures with the same string get tagged
also, making unexplained ones easier to spot.


You need to show first that this is an adequate tagging mechanism, both 
in tagging things adequately and in not picking up false positives, 
which would make things worse, not better. And even then you need 
someone to do the analysis to create the regex.

The buildfarm works because it leverages our strength, namely automating 
things. But all the tagging suggestions I've seen will involve regular, 
repetitive and possibly boring work, precisely the thing we are not good 
at as a group.

this is probably true - however as a buildfarm admin I occasionally 
wished i had a way to invalidate reports generated from my boxes to 
prevent someone wasting time to investigate them (like errors caused by 
system upgrades,configuration problems or other local issues).

But I agree that it might be difficult to make that "manual tagging" 
process scalable and reliable enough so that it really is an improvment 
over what we have now.


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Andrew Dunstan

Martijn van Oosterhout wrote:

On Tue, Mar 20, 2007 at 02:57:13AM -0400, Tom Lane wrote:

Maybe we should think about filtering the noise.  Like, say, discarding
every report from mongoose that involves an icc core dump ...

Maybe a simple compromise would be being able to setup a set of regexes
that search the output and set a flag it that string is found. If you
find the string, it gets marked with a flag, which means that when you
look at mongoose, any failures that don't have the flag become easier
to spot.

It also means that once you've found a common failure, you can create
the regex and then any other failures with the same string get tagged
also, making unexplained ones easier to spot.


You need to show first that this is an adequate tagging mechanism, both 
in tagging things adequately and in not picking up false positives, 
which would make things worse, not better. And even then you need 
someone to do the analysis to create the regex.

The buildfarm works because it leverages our strength, namely automating 
things. But all the tagging suggestions I've seen will involve regular, 
repetitive and possibly boring work, precisely the thing we are not good 
at as a group.

If we had some staff they could be given this task (among others), 
assuming we show that it actually works. We don't, so they can't.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-20 Thread Martijn van Oosterhout
On Tue, Mar 20, 2007 at 02:57:13AM -0400, Tom Lane wrote:
> Maybe we should think about filtering the noise.  Like, say, discarding
> every report from mongoose that involves an icc core dump ...

Maybe a simple compromise would be being able to setup a set of regexes
that search the output and set a flag it that string is found. If you
find the string, it gets marked with a flag, which means that when you
look at mongoose, any failures that don't have the flag become easier
to spot.

It also means that once you've found a common failure, you can create
the regex and then any other failures with the same string get tagged
also, making unexplained ones easier to spot.

Have a nice day,
Martijn van Oosterhout
> From each according to his ability. To each according to his ability to 
> litigate.

Description: Digital signature

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> But we've already had a couple of cases of interesting failures going
>> unnoticed because of the noise level.  Between duplicate reports about
>> busted patches and transient problems on particular build machines
>> (out of disk space, misconfiguration, etc) it's pretty hard to not miss
>> the once-in-a-while failures.  Is there some other way we could attack
>> that problem?

> The real issue is the one you identify of stuff getting lost in the 
> noise. But I'm not sure there's any realistic cure for that.

Maybe we should think about filtering the noise.  Like, say, discarding
every report from mongoose that involves an icc core dump ...

That's only semi-serious, but I do think that it's getting harder to
pluck the wheat from the chaff.  My investigations over the weekend
showed that we have got basically three categories of reports:

1. genuine code breakage from unportable patches: normally multiple
reports over a short period until we fix or revert the cause.
2. failures on a single buildfarm member due to misconfiguration,
hardware flakiness, etc.  These are sometimes repeatable and sometimes
3. all the rest, of which some fraction represents bugs we need to fix,
only we don't know they're there.

In category 1 the buildfarm certainly pays for itself, but we'd hoped
that it would help us spot less-reproducible errors too.  The problem
I'm seeing is that category 2 is overwhelming our ability to recognize
patterns within category 3.  How can we dial down the noise level?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Andrew Dunstan

Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Actually what I *really* want is something closer to "show me all the
unexplained failures", but unless Andrew is willing to support some way
of tagging failures in the master database, I suppose that won't happen.


Who would do the tagging, and how?

Well, that's the hard part isn't it?  I was sort of envisioning a group
of users who'd be authorized to log in and set tags on database entries
somehow.  I'm not sure about details.  One issue is that the majority
of failures come in batches (when one of us commits a bad patch).
With the current web interface it would be real tedious to verify which
of the failures in a particular time interval matched the symptoms of
a failure.  What I did for my experiment this weekend was to download
the last-stage-log of each failed build, which required an hour or so
of setup time; then I could use grep to confirm which logs matched a
failure that I'd identified.  Doing that through the current webpage
would involve lots of clicking and waiting.  If we could expose a
text-search-style API for grepping the stage logs, it'd be a lot easier
to collect related failures.  Then maybe a few widgets to let authorized
users apply a tag to the search results ...

I'm not entirely sure that this infrastructure would pay for itself,
though.  Without some users willing to take the time to separate
explained from unexplained failures, it'd be a waste of effort.
But we've already had a couple of cases of interesting failures going
unnoticed because of the noise level.  Between duplicate reports about
busted patches and transient problems on particular build machines
(out of disk space, misconfiguration, etc) it's pretty hard to not miss
the once-in-a-while failures.  Is there some other way we could attack
that problem?

I'm not too sanguine about having a team of eager taggers.

I think we probably need to work on a usable API for extracting data in 
small or large amounts, and maybe some good text search facilities.

The real issue is the one you identify of stuff getting lost in the 
noise. But I'm not sure there's any realistic cure for that.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Actually what I *really* want is something closer to "show me all the
>> unexplained failures", but unless Andrew is willing to support some way
>> of tagging failures in the master database, I suppose that won't happen.

> Who would do the tagging, and how?

Well, that's the hard part isn't it?  I was sort of envisioning a group
of users who'd be authorized to log in and set tags on database entries
somehow.  I'm not sure about details.  One issue is that the majority
of failures come in batches (when one of us commits a bad patch).
With the current web interface it would be real tedious to verify which
of the failures in a particular time interval matched the symptoms of
a failure.  What I did for my experiment this weekend was to download
the last-stage-log of each failed build, which required an hour or so
of setup time; then I could use grep to confirm which logs matched a
failure that I'd identified.  Doing that through the current webpage
would involve lots of clicking and waiting.  If we could expose a
text-search-style API for grepping the stage logs, it'd be a lot easier
to collect related failures.  Then maybe a few widgets to let authorized
users apply a tag to the search results ...

I'm not entirely sure that this infrastructure would pay for itself,
though.  Without some users willing to take the time to separate
explained from unexplained failures, it'd be a waste of effort.
But we've already had a couple of cases of interesting failures going
unnoticed because of the noise level.  Between duplicate reports about
busted patches and transient problems on particular build machines
(out of disk space, misconfiguration, etc) it's pretty hard to not miss
the once-in-a-while failures.  Is there some other way we could attack
that problem?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Andrew Dunstan

I wrote:

2. I was annoyed repeatedly that some buildfarm members weren't
reporting log_archive_filenames entries, which forced going the long
way round in the process I was using.  Seems like we need some more
proactive means for getting buildfarm owners to keep their script
versions up-to-date.  Not sure what that should look like exactly,
as long as it's not "you can run an ancient version as long as you


Modern clients report the versions of the two scripts involved (see 
script_version and web_script_version in reported config) so we could 
easily enforce a minimum version on these.

Meanwhile, the owner of the main 2 offending machines has said he will 
upgrade them.



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Andrew Dunstan

Tom Lane wrote:

I think what would be nice is some way to view all the failures for a
given branch, extending back not-sure-how-far.  Right now the only way
to see past failures is to look at individual machines' histories, which
is not real satisfactory when you want a broader view.

Actually what I *really* want is something closer to "show me all the
unexplained failures", but unless Andrew is willing to support some way
of tagging failures in the master database, I suppose that won't happen.


Well, if I understood how it might work it might happen.

Who would do the tagging, and how?



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Tom Lane
"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> The current buildfarm webpages make it easy to see when a branch tip
>> is seriously broken, but it's not very easy to investigate transient
>> failures, such as a regression test race condition that only
>> materializes once in awhile.

> If the data is already there and just not represented, just let me know
> exactly what you want and I will implement pages for that data happily.

I think what would be nice is some way to view all the failures for a
given branch, extending back not-sure-how-far.  Right now the only way
to see past failures is to look at individual machines' histories, which
is not real satisfactory when you want a broader view.

Actually what I *really* want is something closer to "show me all the
unexplained failures", but unless Andrew is willing to support some way
of tagging failures in the master database, I suppose that won't happen.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Gregory Stark

"Gregory Stark" <[EMAIL PROTECTED]> writes:

> "Tom Lane" <[EMAIL PROTECTED]> writes:
>>  row-ordering discrepancy in rowtypes test| 
>> 2007-02-10 03:00:02 | 3
> Is this because the test is fixed or unfixable? If not shouldn't the test get
> an ORDER BY clause so that it will reliably pass on future versions? 

Hm, I took a quick look at this test and while there are a couple tests
missing ORDER BY clauses I can't see how they could possibly generate results
that are out of order. Perhaps the ones that do have ORDER BY clauses only
recently acquired them?

  Gregory Stark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Tom Lane
Gregory Stark <[EMAIL PROTECTED]> writes:
> "Tom Lane" <[EMAIL PROTECTED]> writes:
>> missing BYTE_ORDER definition for Solaris| 
>> 2007-01-10 14:18:23 | 1

> What is this BYTE_ORDER macro? Should I be using it instead of the
> AC_C_BIGENDIAN test in configure for the packed varlena patch?

Actually, if we start to rely on AC_C_BIGENDIAN, I'd prefer to see us
get rid of direct usages of BYTE_ORDER.  It looks like only
contrib/pgcrypto is depending on it today, but we've got lots of
cruft in the include/port/ files supporting that.

>> row-ordering discrepancy in rowtypes test| 
>> 2007-02-10 03:00:02 | 3

> Is this because the test is fixed or unfixable?

It's fixed.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Andrew Dunstan

Tom Lane wrote:

BTW, before I forget, this little project turned up a couple of
small improvements for the current buildfarm infrastructure:

1.  There are half a dozen entries with obviously bogus timestamps:

bfarm=# select sysname,snapshot,branch from mfailures where snapshot < 
  sysname   |  snapshot   | branch 

 corgi  | 1997-10-14 14:20:10 | HEAD
 kookaburra | 1970-01-01 01:23:00 | HEAD
 corgi  | 1997-09-30 11:47:08 | HEAD
 corgi  | 1997-10-17 14:20:11 | HEAD
 corgi  | 1997-12-21 15:20:11 | HEAD
 corgi  | 1997-10-15 14:20:10 | HEAD
 corgi  | 1997-09-28 11:47:09 | HEAD
 corgi  | 1997-09-28 11:47:08 | HEAD
(8 rows)

indicating wrong system clock settings on these buildfarm machines.
(Indeed, IIRC these failures were actually caused by the ridiculous
clock settings --- we have at least one regression test that checks
century >= 21 ...)  Perhaps the buildfarm server should bounce
reports with timestamps more than a day in the past or a few minutes in
the future.  I think though that a more useful answer would be to
include "time of receipt of report" in the permanent record, and then
subsequent analysis could make its own decisions about whether to
believe the snapshot timestamp --- plus we could track elapsed times for
builds, which could be interesting in itself.

We actually do timestamp the reports - I just didn't include that in the 
extract. I will alter the view it's based on. We started doing this in 
Nov 2005, so I'm going to restrict the view to cases where the 
report_time is not null - I doubt we're interested in ancient history.

A revised extract is available at

We already reject snapshot times that are in the future.

Use of NTP is highly recommended to buildfarm members, but I'm reluctant 
to make it mandatory, as they might not have it available. I think we 
can do this: alter the client script to report its idea of current time 
at the time it makes the web transaction. If it's off from the server 
time by more than some small value (say 60 secs), adjust the snapshot 
time accordingly. If they don't report it then we can reject insane 
dates (more than 24hours ago seems about right).

So I agree with both your suggestions ;-)

2. I was annoyed repeatedly that some buildfarm members weren't
reporting log_archive_filenames entries, which forced going the long
way round in the process I was using.  Seems like we need some more
proactive means for getting buildfarm owners to keep their script
versions up-to-date.  Not sure what that should look like exactly,
as long as it's not "you can run an ancient version as long as you


Modern clients report the versions of the two scripts involved (see 
script_version and web_script_version in reported config) so we could 
easily enforce a minimum version on these.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Stefan Kaltenbrunner

Gregory Stark wrote:

"Tom Lane" <[EMAIL PROTECTED]> writes:

Also, for completeness, the causes I wrote off as not interesting
(anymore, in some cases):

 missing BYTE_ORDER definition for Solaris| 
2007-01-10 14:18:23 | 1

What is this BYTE_ORDER macro? Should I be using it instead of the
AC_C_BIGENDIAN test in configure for the packed varlena patch?

FYI: this is the relevant commit (the affected buildfarm member was 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-19 Thread Gregory Stark

"Tom Lane" <[EMAIL PROTECTED]> writes:

> Also, for completeness, the causes I wrote off as not interesting
> (anymore, in some cases):
>  missing BYTE_ORDER definition for Solaris| 
> 2007-01-10 14:18:23 | 1

What is this BYTE_ORDER macro? Should I be using it instead of the
AC_C_BIGENDIAN test in configure for the packed varlena patch?

>  row-ordering discrepancy in rowtypes test| 
> 2007-02-10 03:00:02 | 3

Is this because the test is fixed or unfixable? If not shouldn't the test get
an ORDER BY clause so that it will reliably pass on future versions? 

  Gregory Stark

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Tom Lane
BTW, before I forget, this little project turned up a couple of
small improvements for the current buildfarm infrastructure:

1.  There are half a dozen entries with obviously bogus timestamps:

bfarm=# select sysname,snapshot,branch from mfailures where snapshot < 
  sysname   |  snapshot   | branch 
 corgi  | 1997-10-14 14:20:10 | HEAD
 kookaburra | 1970-01-01 01:23:00 | HEAD
 corgi  | 1997-09-30 11:47:08 | HEAD
 corgi  | 1997-10-17 14:20:11 | HEAD
 corgi  | 1997-12-21 15:20:11 | HEAD
 corgi  | 1997-10-15 14:20:10 | HEAD
 corgi  | 1997-09-28 11:47:09 | HEAD
 corgi  | 1997-09-28 11:47:08 | HEAD
(8 rows)

indicating wrong system clock settings on these buildfarm machines.
(Indeed, IIRC these failures were actually caused by the ridiculous
clock settings --- we have at least one regression test that checks
century >= 21 ...)  Perhaps the buildfarm server should bounce
reports with timestamps more than a day in the past or a few minutes in
the future.  I think though that a more useful answer would be to
include "time of receipt of report" in the permanent record, and then
subsequent analysis could make its own decisions about whether to
believe the snapshot timestamp --- plus we could track elapsed times for
builds, which could be interesting in itself.

2. I was annoyed repeatedly that some buildfarm members weren't
reporting log_archive_filenames entries, which forced going the long
way round in the process I was using.  Seems like we need some more
proactive means for getting buildfarm owners to keep their script
versions up-to-date.  Not sure what that should look like exactly,
as long as it's not "you can run an ancient version as long as you

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Tom Lane
Jeremy Drake <[EMAIL PROTECTED]> writes:
> These on mongoose are most likely a result of flaky hardware.

Yeah, I saw a pretty fair number of irreproducible issues that are
probably hardware flake-outs.  Of course you can't tell which are those
and which are low-probability software bugs for many moons...

I believe that a large fraction of the buildfarm consists of
semi-retired equipment that is probably more prone to this sort of
problem than newer stuff would be.  But that's the price we must pay
for building such a large test farm on a shoestring.  What we need to do
to deal with it, I think, is institutionalize some kind of long-term
tracking so that we can tell the recurrent from the non-recurrent
issues.  I don't quite know how to do that; what I did over this past
weekend was labor-intensive and not scalable.

SoC project perhaps?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Tom Lane
"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
>> Some of these might possibly be interesting to other people ...

> If you provide the various greps, etc... I will put it into the website 
> proper...

Unfortunately I didn't keep notes on exactly what I searched for in each
case.  Some of them were not based on grep at all, but rather "this
failure looks similar to those others and happened in the period between
a known bad patch commit and its fix".  The goal was essentially to
group together failures that probably arose from the same cause --- I
may have made a mistake or two along the way ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Jeremy Drake
On Sun, 18 Mar 2007, Tom Lane wrote:

>  another icc crash| 
> 2007-02-03 10:50:01 | 1
>  icc "internal error" | 
> 2007-03-16 16:30:01 |29

These on mongoose are most likely a result of flaky hardware.  They tend
to occur most often when either
a) I am doing something else on the box when the build runs, or
b) the ambient temperature in the room is > ~72degF

I need to bring down this box at some point and try to figure out if it is
bad memory or what.

Anyway, ICC seems to be one of the few things that are really succeptable
to hardware issues (on this box at least, it is mostly ICC and firefox),
and I apologize for the noise this caused in the buildfarm logs...

American business long ago gave up on demanding that prospective
employees be honest and hardworking.  It has even stopped hoping for
employees who are educated enough that they can tell the difference
between the men's room and the women's room without having little
pictures on the doors.
-- Dave Barry, "Urine Trouble, Mister"

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Joshua D. Drake

   | 2007-01-31 17:30:01 |16

 use of // comment| 
2007-02-16 09:23:02 | 1
 xml code teething problems   | 
2007-02-16 16:01:05 |79
(54 rows)

Some of these might possibly be interesting to other people ...

If you provide the various greps, etc... I will put it into the website 

Joshua D. Drake

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> OK, for anyone that wants to play, I have created an extract that 
> contains a summary of every non-CVS-related failure we've had. It's a 
> single table looking like this:

I did some analysis on this data.  Attached is a text dump of a table
declared as

CREATE TABLE mreasons (
sysname text,
snapshot timestamp without time zone,
branch text,
reason text,
known boolean

where the sysname/snapshot/branch data is taken from your table,
"reason" is a brief sketch of the failure, and "known" indicates
whether the cause is known ... although as I went along it sort
of evolved into "does this seem worthy of more investigation?".

I looked at every failure back through early December.  I'd intended to
go back further, but decided I'd hit a point of diminishing returns.
However, failures back to the beginning of July that matched grep
searches for recent symptoms are classified in the table.

The gross stats are: 2231 failures classified, 71 distinct reason
codes, 81 failures (with 18 reasons) that seem worthy of closer

bfarm=# select reason,branch,max(snapshot) as latest, count(*) from mreasons 
where not known group by 1,2 order by 1,2 ;
  reason  |branch   
  |   latest| count 
 Input/output error - possible hardware problem   | HEAD
  | 2007-03-06 10:30:01 | 1
 No rule to make target   | HEAD
  | 2007-02-08 15:30:01 | 6
 No rule to make target   | 
REL8_0_STABLE | 2007-02-28 03:15:02 | 9
 No rule to make target   | 
REL8_2_STABLE | 2006-12-17 20:00:01 | 1
 could not open relation with OID | HEAD
  | 2007-03-16 16:45:01 | 2
 could not open relation with OID | 
REL8_1_STABLE | 2006-08-29 23:30:07 | 2
 createlang not found?| 
REL8_1_STABLE | 2007-02-28 02:50:00 | 1
 irreproducible contrib/sslinfo build failure, likely not our bug | HEAD
  | 2007-02-03 07:03:02 | 1
 irreproducible opr_sanity failure| HEAD
  | 2006-12-18 19:15:02 | 2
 libintl.h rejected by configure  | HEAD
  | 2007-01-11 20:35:00 | 3
 libintl.h rejected by configure  | 
REL8_0_STABLE | 2007-03-01 20:28:04 |22
 postmaster failed to start   | 
REL7_4_STABLE | 2007-02-28 22:23:20 | 1
 postmaster failed to start   | 
REL8_0_STABLE | 2007-02-28 22:30:44 | 1
 random Solaris configure breakage| HEAD
  | 2007-01-14 05:30:00 | 1
 random Windows breakage  | HEAD
  | 2007-03-16 09:48:31 | 3
 random Windows breakage  | 
REL8_0_STABLE | 2007-03-15 03:15:09 | 7
 segfault during bootstrap| HEAD
  | 2007-03-12 23:03:03 | 1
 server does not shut down| HEAD
  | 2007-01-08 03:03:03 | 3
 tablespace is not empty  | HEAD
  | 2007-02-24 15:00:10 | 6
 tablespace is not empty  | 
REL8_1_STABLE | 2007-01-25 02:30:01 | 2
 unexpected statement_timeout failure | HEAD
  | 2007-01-25 05:05:06 | 1
 unexplained tsearch2 crash   | HEAD
  | 2007-01-10 22:05:02 | 1
 weird DST-transition-like timestamp test failure | HEAD
  | 2007-02-04 07:25:04 | 1
 weird assembler failure, likely not our bug  | HEAD
  | 2006-12-26 17:02:01 | 1
 weird assembler failure, likely not our bug  | 
REL8_2_STABLE | 2007-02-03 23:47:01 | 1
 weird install failure| HEAD
  | 2007-01-25 12:35:00 | 1
(26 rows)

I think I know the cause of the recent 'could not open relation with
OID' failures in HEAD, but the rest of these maybe need a look.
Any volunteers?

Also, for completeness, the causes I wrote off as not interesting
(anymore, in some cases):

bfarm=# select reason,max(snapshot) as latest, count(*) from mreasons where 
known group by 1 order by 1 ;
latest| count 
 DST transition test failure  

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-18 Thread Josh Berkus


Lastly, note that some buildfarm enhancements are on the SOC project 
list. I have no idea if anyone will express any interest in that, of 
course. It's not very glamorous work.

On the other hand, I think there are a lot more student perl hackers and 
web people than there are folks with the potential to do backend stuff. 
 So who knows?


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Andrew Dunstan
Jeremy Drake wrote:
>> The dump is just under 1Mb and can be downloaded from
> Sure about that?
> HTTP request sent, awaiting response... 200 OK
> Length: 9,184,142 (8.8M) [text/plain]

Damn these new specs. They made me skip a digit.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Jeremy Drake
On Fri, 16 Mar 2007, Andrew Dunstan wrote:

> OK, for anyone that wants to play, I have created an extract that contains a
> summary of every non-CVS-related failure we've had. It's a single table
> looking like this:
> CREATE TABLE mfailures (
>sysname text,
>snapshot timestamp without time zone,
>stage text,
>conf_sum text,
>branch text,
>changed_this_run text,
>changed_since_success text,
>log_archive_filenames text[],
>build_flags text[]
> );

Sweet.  Should be interesting to look at.

> The dump is just under 1Mb and can be downloaded from

Sure about that?

   => `mfailures.dump'
Connecting to||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9,184,142 (8.8M) [text/plain]

BOO!  We changed Coke again!  BLEAH!  BLEAH!

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Andrew Dunstan

Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:
Well, the db is currently running around 13Gb, so that's not something 
to be exported lightly ;-)

Yeah.  I would assume though that the vast bulk of that is captured log
files.  For the purposes I'm imagining, it'd be sufficient to export
only the rest of the database --- or ideally, records including all the
other fields and a URL for each log file.  For the small number of log
files you actually need to examine, you'd chase the URL.


OK, for anyone that wants to play, I have created an extract that 
contains a summary of every non-CVS-related failure we've had. It's a 
single table looking like this:

CREATE TABLE mfailures (
   sysname text,
   snapshot timestamp without time zone,
   stage text,
   conf_sum text,
   branch text,
   changed_this_run text,
   changed_since_success text,
   log_archive_filenames text[],
   build_flags text[]

The dump is just under 1Mb and can be downloaded from

If this is useful we can create it or something like it on a regular 
basis (say nightly).

The summary log for a given build can be got from:

To look at the log for a given run stage select 
- the stage names available (if any) are the entries in 
log_archive_filenames, stripped of the ".log" suffix.

We can make these available over an API that isn't plain http is people 
want. Or we can provide a version of the buildlog that is tripped of the 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Well, the db is currently running around 13Gb, so that's not something 
> to be exported lightly ;-)

Yeah.  I would assume though that the vast bulk of that is captured log
files.  For the purposes I'm imagining, it'd be sufficient to export
only the rest of the database --- or ideally, records including all the
other fields and a URL for each log file.  For the small number of log
files you actually need to examine, you'd chase the URL.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Joshua D. Drake

> Well, the db is currently running around 13Gb, so that's not something
> to be exported lightly ;-)
> If we upgraded from Postgres 8.0.x to 8.2.x we could make use of some
> features, like dynamic partitioning and copy from queries, that might
> make life easier (CP people: that's a hint :-) )

Yeah, Yeah... I need to get you off that machine as a whole :) Which is
on the list but I am waiting for 8.3 *badda bing*.


Joshua D. Drake


  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997

Donate to the PostgreSQL Project:
PostgreSQL Replication:

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Andrew Dunstan

Tom Lane wrote:

The current buildfarm webpages make it easy to see when a branch tip
is seriously broken, but it's not very easy to investigate transient
failures, such as a regression test race condition that only
materializes once in awhile.  I would like to have a way of seeing
just the failed build attempts across all machines running a given
branch.  Ideally it would be possible to tag failures as to the cause
(if known) and/or symptom pattern, and then be able to examine just
the ones without known cause or having similar symptoms.

I'm not sure how much of this is reasonable to try to do with webpages
similar to what we've got.  But the data is all in a database AIUI,
so another possibility is to do this work via SQL.  That'd require
having the ability to pull the information from the buildfarm database
so someone else could manipulate it.

So I guess the first question is can you make the build data available,
and the second is whether you're interested in building more flexible
views or just want to let someone else do that.  Also, if anyone does
make an effort to tag failures, it'd be good to somehow push that data
back into the master database, so that we don't end up duplicating such


Well, the db is currently running around 13Gb, so that's not something 
to be exported lightly ;-)

If we upgraded from Postgres 8.0.x to 8.2.x we could make use of some 
features, like dynamic partitioning and copy from queries, that might 
make life easier (CP people: that's a hint :-) )

I don't want to fragment effort, but I also know CP don't want open 
access, for obvious reasons.

We can also look at a safe API that we could make available freely. I've 
already done this over SOAP (see example client at 
). Doing updates is a whole other matter, of course.

Lastly, note that some buildfarm enhancements are on the SOC project 
list. I have no idea if anyone will express any interest in that, of 
course. It's not very glamorous work.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Buildfarm feature request: some way to track/classify failures

2007-03-16 Thread Joshua D. Drake
Tom Lane wrote:
> The current buildfarm webpages make it easy to see when a branch tip
> is seriously broken, but it's not very easy to investigate transient
> failures, such as a regression test race condition that only
> materializes once in awhile.  I would like to have a way of seeing
> just the failed build attempts across all machines running a given
> branch.  Ideally it would be possible to tag failures as to the cause
> (if known) and/or symptom pattern, and then be able to examine just
> the ones without known cause or having similar symptoms.
> I'm not sure how much of this is reasonable to try to do with webpages
> similar to what we've got.  But the data is all in a database AIUI,
> so another possibility is to do this work via SQL.  That'd require
> having the ability to pull the information from the buildfarm database
> so someone else could manipulate it.
> So I guess the first question is can you make the build data available,
> and the second is whether you're interested in building more flexible
> views or just want to let someone else do that.  Also, if anyone does
> make an effort to tag failures, it'd be good to somehow push that data
> back into the master database, so that we don't end up duplicating such
> work.

If the data is already there and just not represented, just let me know
exactly what you want and I will implement pages for that data happily.

Joshua D. Drake

>   regards, tom lane
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?


  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997

Donate to the PostgreSQL Project:
PostgreSQL Replication:

---(end of broadcast)---
TIP 4: Have you searched our list archives?