Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2015-01-21 Thread Matt Kelly

 Sure, but nobody who is not a developer is going to care about that.
 A typical user who sees pgstat wait timeout, or doesn't, isn't going
 to be able to make anything at all out of that.


As a user, I wholeheartedly disagree.

That warning helped me massively in diagnosing an unhealthy database server
in the past at TripAdvisor (i.e. high end server class box, not a raspberry
pie).  I have realtime monitoring that looks at pg_stat_database at regular
intervals particularly for the velocity of change of xact_commit and
xact_rollback columns, similar to how check_postgres does it.
https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl#L4234

When one of those servers was unhealthy, it stopped reporting statistics
for 30 seconds+ at a time.  My dashboard which polled far more frequently
than that indicated the server was normally processing 0 tps with
intermittent spikes. I went directly onto the server and sampled
pg_stat_database.  That warning was the only thing that directly indicated
that the statistics collector was not to be trusted.  It obviously was a
victim of what was going on in the server, but its pretty important to know
when your methods for measuring server health are lying to you.  The spiky
TPS at first glance appears like some sort of live lock, not just that the
server is overloaded.

Now, I know: 0 change in stats = collector broken.  Rereading the docks,

 Also, the collector itself emits a new report at most once per
 PGSTAT_STAT_INTERVAL milliseconds (500 ms unless altered while building
 the server).


Without context this merely reads: We sleep for 500ms, plus the time to
write the file, plus whenever the OS decides to enforce the timer
interrupt... so like 550-650ms.  It doesn't read, When server is
unhealthy, but _still_ serving queries, the stats collector might not be
able to keep up and will just stop reporting stats all together.

I think the warning is incredibly valuable.  Along those lines I'd also
love to see a pg_stat_snapshot_timestamp() for monitoring code to use to
determine if its using a stale snapshot, as well as helping to smooth
graphs of the statistics that are based on highly granular snapshotting.

- Matt Kelly


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2015-01-21 Thread Andres Freund
On 2015-01-21 22:43:03 -0500, Matt Kelly wrote:
 
  Sure, but nobody who is not a developer is going to care about that.
  A typical user who sees pgstat wait timeout, or doesn't, isn't going
  to be able to make anything at all out of that.
 
 
 As a user, I wholeheartedly disagree.

Note that that the change we discussed wasn't removing the message. It
was changing the log level from WARNING to LOG. Which means the change
is not going to the client anymore, but still to the server log (perhaps
surprisingly, the likelihood for the latter actually increases).

 I think the warning is incredibly valuable.  Along those lines I'd also
 love to see a pg_stat_snapshot_timestamp() for monitoring code to use to
 determine if its using a stale snapshot, as well as helping to smooth
 graphs of the statistics that are based on highly granular snapshotting.

I can see that being useful.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2015-01-20 Thread Tomas Vondra
On 25.12.2014 22:28, Tomas Vondra wrote:
 On 25.12.2014 21:14, Andres Freund wrote:

 That's indeed odd. Seems to have been lost when the statsfile was
 split into multiple files. Alvaro, Tomas?
 
 The goal was to keep the logic as close to the original as possible.
 IIRC there were pgstat wait timeout issues before, and in most cases
 the conclusion was that it's probably because of overloaded I/O.
 
 But maybe there actually was another bug, and it's entirely possible
 that the split introduced a new one, and that's what we're seeing now.
 The strange thing is that the split happened ~2 years ago, which is
 inconsistent with the sudden increase of this kind of issues. So maybe
 something changed on that particular animal (a failing SD card causing
 I/O stalls, perhaps)?
 
 Anyway, I happen to have a spare Raspberry PI, so I'll try to reproduce
 and analyze the issue locally. But that won't happen until January.

I've tried to reproduce this on my Raspberry PI 'machine' and it's not
very difficult to trigger this. About 7 out of 10 'make check' runs fail
because of 'pgstat wait timeout'.

All the occurences I've seen were right after some sort of VACUUM
(sometimes plain, sometimes ANALYZE or FREEZE), and the I/O at the time
looked something like this:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
mmcblk0   0.0075.000.008.00 0.0036.00
9.00 5.73 15633.750.00 15633.75 125.00 100.00

So pretty terrible (this is a Class 4 SD card, supposedly able to handle
4 MB/s). If hamster had faulty SD card, it might have been much worse, I
guess.

This of course does not prove the absence of a bug - I plan to dig into
this a bit more. Feel free to point out some suspicious scenarios that
might be worth reproducing and analyzing.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2015-01-20 Thread Tomas Vondra
On 21.1.2015 00:38, Michael Paquier wrote:
 On Wed, Jan 21, 2015 at 1:08 AM, Tomas Vondra

 I've tried to reproduce this on my Raspberry PI 'machine' and it's not
 very difficult to trigger this. About 7 out of 10 'make check' runs fail
 because of 'pgstat wait timeout'.

 All the occurences I've seen were right after some sort of VACUUM
 (sometimes plain, sometimes ANALYZE or FREEZE), and the I/O at the time
 looked something like this:

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 mmcblk0   0.0075.000.008.00 0.0036.00
 9.00 5.73 15633.750.00 15633.75 125.00 100.00

 So pretty terrible (this is a Class 4 SD card, supposedly able to
 handle 4 MB/s). If hamster had faulty SD card, it might have been
 much worse, I guess.

 By experience, a class 10 is at least necessary, with a minimum
 amount of memory to minimize the apparition of those warnings,
 hamster having now a 8GB class 10 card.

Well, my goal was exactly to produce those warnings ;-) and see if I can
identify some strange cases. That's why I chose just class 4. But even
then it produces rather low number of those warnings (one or two per
check run), and mostly at the expected places with significant I/O
overload. So I'm not any wiser :-(

regards
Tomas




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2015-01-20 Thread Michael Paquier
On Wed, Jan 21, 2015 at 1:08 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 On 25.12.2014 22:28, Tomas Vondra wrote:
 On 25.12.2014 21:14, Andres Freund wrote:

 That's indeed odd. Seems to have been lost when the statsfile was
 split into multiple files. Alvaro, Tomas?

 The goal was to keep the logic as close to the original as possible.
 IIRC there were pgstat wait timeout issues before, and in most cases
 the conclusion was that it's probably because of overloaded I/O.

 But maybe there actually was another bug, and it's entirely possible
 that the split introduced a new one, and that's what we're seeing now.
 The strange thing is that the split happened ~2 years ago, which is
 inconsistent with the sudden increase of this kind of issues. So maybe
 something changed on that particular animal (a failing SD card causing
 I/O stalls, perhaps)?

 Anyway, I happen to have a spare Raspberry PI, so I'll try to reproduce
 and analyze the issue locally. But that won't happen until January.

 I've tried to reproduce this on my Raspberry PI 'machine' and it's not
 very difficult to trigger this. About 7 out of 10 'make check' runs fail
 because of 'pgstat wait timeout'.

 All the occurences I've seen were right after some sort of VACUUM
 (sometimes plain, sometimes ANALYZE or FREEZE), and the I/O at the time
 looked something like this:

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 mmcblk0   0.0075.000.008.00 0.0036.00
 9.00 5.73 15633.750.00 15633.75 125.00 100.00

 So pretty terrible (this is a Class 4 SD card, supposedly able to handle
 4 MB/s). If hamster had faulty SD card, it might have been much worse, I
 guess.
By experience, a class 10 is at least necessary, with a minimum amount
of memory to minimize the apparition of those warnings, hamster having
now a 8GB class 10 card.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-28 Thread Robert Haas
On Sat, Dec 27, 2014 at 8:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Dec 27, 2014 at 7:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This would have the effect of transferring all responsibility for
 dead-stats-entry cleanup to autovacuum.  For ordinary users, I think
 that'd be just fine.  It might be less fine though for people who
 disable autovacuum, if there still are any.

 -1.  I don't think it's a good idea to inflict pain on people who want
 to schedule their vacuums manually (and yes, there are some) to get
 clean buildfarm runs.

 Did you read the rest of it?

Yeah...  since when do I not read your emails?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-27 Thread Heikki Linnakangas

On 12/27/2014 12:16 AM, Alvaro Herrera wrote:

Tom Lane wrote:


The argument that autovac workers need fresher stats than anything else
seems pretty dubious to start with.  Why shouldn't we simplify that down
to they use PGSTAT_STAT_INTERVAL like everybody else?


The point of wanting fresher stats than that, eons ago, was to avoid a
worker vacuuming a table that some other worker vacuumed more recently
than PGSTAT_STAT_INTERVAL.  I realize now that the semantics we really
want was something like stats no older than XYZ where the given value
is the timestamp at which we start checking; if we get anything newer
than that it would be okay, but we currently reject it because of lack
of a more appropriate API.  (If it takes more than PGSTAT_STAT_INTERVAL
to get the stats back, a regular backend would ask for fresher stats,
but to an autovac worker they would be good enough as long as they are
newer than its recheck start time.)

Nowadays we can probably disregard the whole issue, since starting a new
vacuum just after the prior one finished should not cause much stress to
the system thanks to the visibility map.


Vacuuming is far from free, even if the visibility map says that most 
pages are visible to all: you still scan all indexes, if you remove any 
dead tuples at all.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-27 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 12/27/2014 12:16 AM, Alvaro Herrera wrote:
 Tom Lane wrote:
 The argument that autovac workers need fresher stats than anything else
 seems pretty dubious to start with.  Why shouldn't we simplify that down
 to they use PGSTAT_STAT_INTERVAL like everybody else?

 The point of wanting fresher stats than that, eons ago, was to avoid a
 worker vacuuming a table that some other worker vacuumed more recently
 than PGSTAT_STAT_INTERVAL. ...
 Nowadays we can probably disregard the whole issue, since starting a new
 vacuum just after the prior one finished should not cause much stress to
 the system thanks to the visibility map.

 Vacuuming is far from free, even if the visibility map says that most 
 pages are visible to all: you still scan all indexes, if you remove any 
 dead tuples at all.

With typical autovacuum settings, I kinda doubt that there's much value in
reducing the window for this problem from 500ms to 10ms.  As Alvaro says,
this was just a partial, kluge solution from the start --- if we're
worried about such duplicate vacuuming, we should undertake a real
solution that closes the window altogether.  In any case, timeouts
occurring inside autovacuum are not directly causing the buildfarm
failures, since autovacuum's log entries don't reflect into regression
outputs.  (It's possible that autovacuum's tight tolerance is contributing
to the failures by increasing the load on the stats collector, but I'm
not sure I believe that.)

To get back to that original complaint about buildfarm runs failing,
I notice that essentially all of those failures are coming from wait
timeout warnings reported by manual VACUUM commands.  Now, VACUUM itself
has no need to read the stats files.  What's actually causing these
messages is failure to get a timely response in pgstat_vacuum_stat().
So let me propose a drastic solution: let's dike out this bit in vacuum.c:

/*
 * Send info about dead objects to the statistics collector, unless we are
 * in autovacuum --- autovacuum.c does this for itself.
 */
if ((vacstmt-options  VACOPT_VACUUM)  !IsAutoVacuumWorkerProcess())
pgstat_vacuum_stat();

This would have the effect of transferring all responsibility for
dead-stats-entry cleanup to autovacuum.  For ordinary users, I think
that'd be just fine.  It might be less fine though for people who
disable autovacuum, if there still are any.  To analyze the effects
on them, let's break down what pgstat_vacuum_stat() actually does:

1. It causes drops of whole databases from pgstat's tables.  I claim
we don't need to do that at this level.  There's no issue unless
the stats collector missed the PgStat_MsgDropdb message when the database
was dropped; and even if it did, the negative consequences of having
a dead DB's stats still lying around are pretty minimal since we split
up the stats files.  There will be no extra I/O except for one small
record in the global stats file.  So I think we can well afford to say
that with autovac off, such missed databases only get cleaned up the
next time an autovac is forced for antiwraparound.

2. Within the current database, it causes drops of pgstat entries for
dropped tables.  This would be a tad annoying if you have lots of
transient tables and no or minimal autovacuuming.  However, lots of
transient tables can be a pain point anyway, because we don't currently
have any mechanism other than pgstat_vacuum_stat() for cleaning up
per-table stats for dropped tables.  It seems like it might be time to do
something about that.  We could probably extend the PgStat_TableXactStatus
infrastructure to keep track of whether a table was created or deleted in
the current transaction, and solve the problem without very much new code.
Or maybe we could move the responsibility for reporting stale entries
into the code that reads the stats files for the stats views.

3. Within the current database, it causes drops of pgstat entries for
dropped functions, if you're tracking per-function execution stats.
Since that's not the default, maybe we could get away with saying that
we don't clean up such dead entries very often unless you're running
autovacuum.  I don't really want to propose building the infrastructure
to support dropping such entries retail.

Or, much more simply, we could conclude that it's not that important
if pgstat_vacuum_stat() is unable to get up-to-date stats, and rejigger
the code so that we don't bleat when the file-reading request comes from
that source.  This should be a back-patchable fix, whereas #2 above might
be a bit too complicated for that.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-27 Thread Robert Haas
On Sat, Dec 27, 2014 at 7:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 12/27/2014 12:16 AM, Alvaro Herrera wrote:
 Tom Lane wrote:
 The argument that autovac workers need fresher stats than anything else
 seems pretty dubious to start with.  Why shouldn't we simplify that down
 to they use PGSTAT_STAT_INTERVAL like everybody else?

 The point of wanting fresher stats than that, eons ago, was to avoid a
 worker vacuuming a table that some other worker vacuumed more recently
 than PGSTAT_STAT_INTERVAL. ...
 Nowadays we can probably disregard the whole issue, since starting a new
 vacuum just after the prior one finished should not cause much stress to
 the system thanks to the visibility map.

 Vacuuming is far from free, even if the visibility map says that most
 pages are visible to all: you still scan all indexes, if you remove any
 dead tuples at all.

 With typical autovacuum settings, I kinda doubt that there's much value in
 reducing the window for this problem from 500ms to 10ms.  As Alvaro says,
 this was just a partial, kluge solution from the start --- if we're
 worried about such duplicate vacuuming, we should undertake a real
 solution that closes the window altogether.  In any case, timeouts
 occurring inside autovacuum are not directly causing the buildfarm
 failures, since autovacuum's log entries don't reflect into regression
 outputs.  (It's possible that autovacuum's tight tolerance is contributing
 to the failures by increasing the load on the stats collector, but I'm
 not sure I believe that.)

 To get back to that original complaint about buildfarm runs failing,
 I notice that essentially all of those failures are coming from wait
 timeout warnings reported by manual VACUUM commands.  Now, VACUUM itself
 has no need to read the stats files.  What's actually causing these
 messages is failure to get a timely response in pgstat_vacuum_stat().
 So let me propose a drastic solution: let's dike out this bit in vacuum.c:

 /*
  * Send info about dead objects to the statistics collector, unless we are
  * in autovacuum --- autovacuum.c does this for itself.
  */
 if ((vacstmt-options  VACOPT_VACUUM)  !IsAutoVacuumWorkerProcess())
 pgstat_vacuum_stat();

 This would have the effect of transferring all responsibility for
 dead-stats-entry cleanup to autovacuum.  For ordinary users, I think
 that'd be just fine.  It might be less fine though for people who
 disable autovacuum, if there still are any.

-1.  I don't think it's a good idea to inflict pain on people who want
to schedule their vacuums manually (and yes, there are some) to get
clean buildfarm runs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-27 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Dec 27, 2014 at 7:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This would have the effect of transferring all responsibility for
 dead-stats-entry cleanup to autovacuum.  For ordinary users, I think
 that'd be just fine.  It might be less fine though for people who
 disable autovacuum, if there still are any.

 -1.  I don't think it's a good idea to inflict pain on people who want
 to schedule their vacuums manually (and yes, there are some) to get
 clean buildfarm runs.

Did you read the rest of it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-26 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 Yeah, I've been getting more annoyed by that too lately.  I keep wondering
 though whether there's an actual bug underneath that behavior that we're
 failing to see.

 I think the first thing to do is reconsider usage of PGSTAT_RETRY_DELAY
 instead of PGSTAT_STAT_INTERVAL in autovacuum workers.  That decreases
 the wait time 50-fold, if I recall this correctly, and causes large
 amounts of extra I/O traffic.

Yeah --- that means that instead of the normal behavior that a stats file
newer than 500 msec is good enough, an autovac worker insists on a stats
file newer than 10 msec.  I did some experimentation on prairiedog, and
found that it's not hard at all to see autovac workers demanding multiple
stats writes per second:

2014-12-26 16:26:52.958 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:26:53.128 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:26:53.188 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:26:54.903 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:26:55.058 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:00.022 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:00.285 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:00.792 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:01.010 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:01.163 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:01.193 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:03.595 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:03.673 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:03.839 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:03.878 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:05.878 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:06.571 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:07.001 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:07.769 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:07.950 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:10.256 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:11.039 EST 21026 LOG:  sending inquiry for database 45116
2014-12-26 16:27:11.402 EST 21026 LOG:  sending inquiry for database 45116

The argument that autovac workers need fresher stats than anything else
seems pretty dubious to start with.  Why shouldn't we simplify that down
to they use PGSTAT_STAT_INTERVAL like everybody else?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-26 Thread Alvaro Herrera
Tom Lane wrote:

 The argument that autovac workers need fresher stats than anything else
 seems pretty dubious to start with.  Why shouldn't we simplify that down
 to they use PGSTAT_STAT_INTERVAL like everybody else?

The point of wanting fresher stats than that, eons ago, was to avoid a
worker vacuuming a table that some other worker vacuumed more recently
than PGSTAT_STAT_INTERVAL.  I realize now that the semantics we really
want was something like stats no older than XYZ where the given value
is the timestamp at which we start checking; if we get anything newer
than that it would be okay, but we currently reject it because of lack
of a more appropriate API.  (If it takes more than PGSTAT_STAT_INTERVAL
to get the stats back, a regular backend would ask for fresher stats,
but to an autovac worker they would be good enough as long as they are
newer than its recheck start time.)

Nowadays we can probably disregard the whole issue, since starting a new
vacuum just after the prior one finished should not cause much stress to
the system thanks to the visibility map.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Andres Freund
Hi,

We quite regularly have buildfarm failures that are caused by 'WARNING:
pgstat wait timeout' at random points during the build. Especially on
some of the ARM buildfarm animals those are really frequent, to the
point that it's hard to know the actual state of the buildfarm without
looking at several animals. The ARM ones are probably affected more
frequently because they'll often run on sd cards and such.

So I think a better way to deal with that warning would be a good
idea. Besides somehow making the mechanism there are two ways to attack
this that I can think of, neither of them awe inspiring:

1) Make that WARNING a LOG message instead. Since those don't get send
to the client with default settings...
2) Increase PGSTAT_MAX_WAIT_TIME even further than what 99b545 increased
it to.

Does anybody have an actually good idea?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 So I think a better way to deal with that warning would be a good
 idea. Besides somehow making the mechanism there are two ways to attack
 this that I can think of, neither of them awe inspiring:

 1) Make that WARNING a LOG message instead. Since those don't get send
 to the client with default settings...
 2) Increase PGSTAT_MAX_WAIT_TIME even further than what 99b545 increased
 it to.

Yeah, I've been getting more annoyed by that too lately.  I keep wondering
though whether there's an actual bug underneath that behavior that we're
failing to see.  PGSTAT_MAX_WAIT_TIME is already 10 seconds; it's hard to
credit that increasing it still further would be fixing anything.
The other change would also mainly just sweep the issue under the rug,
if there is any issue and it's not just that we're overloading
underpowered buildfarm machines.  (Maybe a better fix would be to reduce
MAX_CONNECTIONS for the tests on these machines?)

I wonder whether when multiple processes are demanding statsfile updates,
there's some misbehavior that causes them to suck CPU away from the stats
collector and/or convince it that it doesn't need to write anything.
There are odd things in the logs in some of these events.  For example in
today's failure on hamster,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamsterdt=2014-12-25%2016%3A00%3A07
there are two client-visible wait-timeout warnings, one each in the
gist and spgist tests.  But in the postmaster log we find these in
fairly close succession:

[549c38ba.724d:2] WARNING:  pgstat wait timeout
[549c39b1.73e7:10] WARNING:  pgstat wait timeout
[549c38ba.724d:3] WARNING:  pgstat wait timeout

Correlating these with other log entries shows that the first and third
are from the autovacuum launcher while the second is from the gist test
session.  So the spgist failure failed to get logged, and in any case the
big picture is that we had four timeout warnings occurring in a pretty
short span of time, in a parallel test set that's not all that demanding
(12 parallel tests, well below our max).  Not sure what to make of that.

BTW, I notice that in the current state of pgstat.c, all the logic for
keeping track of request arrival times is dead code, because nothing is
actually looking at DBWriteRequest.request_time.  This makes me think that
somebody simplified away some logic we maybe should have kept.  But if
we're going to leave it like this, we could replace the DBWriteRequest
data structure with a simple OID list and save a fair amount of code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Andres Freund
On 2014-12-25 14:36:42 -0500, Tom Lane wrote:
 I wonder whether when multiple processes are demanding statsfile updates,
 there's some misbehavior that causes them to suck CPU away from the stats
 collector and/or convince it that it doesn't need to write anything.
 There are odd things in the logs in some of these events.  For example in
 today's failure on hamster,
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamsterdt=2014-12-25%2016%3A00%3A07
 there are two client-visible wait-timeout warnings, one each in the
 gist and spgist tests.  But in the postmaster log we find these in
 fairly close succession:
 
 [549c38ba.724d:2] WARNING:  pgstat wait timeout
 [549c39b1.73e7:10] WARNING:  pgstat wait timeout
 [549c38ba.724d:3] WARNING:  pgstat wait timeout
 
 Correlating these with other log entries shows that the first and third
 are from the autovacuum launcher while the second is from the gist test
 session.  So the spgist failure failed to get logged, and in any case the
 big picture is that we had four timeout warnings occurring in a pretty
 short span of time, in a parallel test set that's not all that demanding
 (12 parallel tests, well below our max).  Not sure what to make of that.

My guess is that a checkpoint happened at that time. Maybe it'd be a
good idea to make pg_regress start postgres with log_checkpoints
enabled? My guess is that we'd find horrendous 'sync' times.

Michael: Could you perhaps turn log_checkpoints on in the config?

 BTW, I notice that in the current state of pgstat.c, all the logic for
 keeping track of request arrival times is dead code, because nothing is
 actually looking at DBWriteRequest.request_time.  This makes me think that
 somebody simplified away some logic we maybe should have kept.  But if
 we're going to leave it like this, we could replace the DBWriteRequest
 data structure with a simple OID list and save a fair amount of code.

That's indeed odd. Seems to have been lost when the statsfile was split
into multiple files. Alvaro, Tomas?

I wondered for a second whether the split could be responsible somehow,
but there's reports of that in older backbranches as well:
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=mereswinedt=2014-12-23%2019%3A17%3A41

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tomas Vondra
On 25.12.2014 20:36, Tom Lane wrote:

 Yeah, I've been getting more annoyed by that too lately. I keep
 wondering though whether there's an actual bug underneath that
 behavior that we're failing to see. PGSTAT_MAX_WAIT_TIME is already
 10 seconds; it's hard to credit that increasing it still further
 would be fixing anything. The other change would also mainly just
 sweep the issue under the rug, if there is any issue and it's not
 just that we're overloading underpowered buildfarm machines. (Maybe a
 better fix would be to reduce MAX_CONNECTIONS for the tests on these
 machines?)

I agree that increasing the limit further is not a good solution.

 BTW, I notice that in the current state of pgstat.c, all the logic
 for keeping track of request arrival times is dead code, because
 nothing is actually looking at DBWriteRequest.request_time. This
 makes me think that somebody simplified away some logic we maybe
 should have kept. But if we're going to leave it like this, we could
 replace the DBWriteRequest data structure with a simple OID list and
 save a fair amount of code.

Really? Which part of the code is dead? I see pgstat_recv_inquiry() is
updating the request_time after receiving the inquiry, and
pgstat_db_requested() is looking at it when writing the files.

If we can simplify the code by keeping just OIDs, let's do that. I think
the main reason why we haven't done that in 187492b6 was to keep as much
of the existing logic (and maybe change it in a separate patch).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 25.12.2014 20:36, Tom Lane wrote:
 BTW, I notice that in the current state of pgstat.c, all the logic
 for keeping track of request arrival times is dead code, because
 nothing is actually looking at DBWriteRequest.request_time.

 Really? Which part of the code is dead? I see pgstat_recv_inquiry() is
 updating the request_time after receiving the inquiry, and
 pgstat_db_requested() is looking at it when writing the files.

Where is pgstat_db_requested() looking at request_time?

 If we can simplify the code by keeping just OIDs, let's do that. I think
 the main reason why we haven't done that in 187492b6 was to keep as much
 of the existing logic (and maybe change it in a separate patch).

The real point here is that I think that commit *already* changed the
existing logic, because the time-of-receipt used to matter.  In
particular, there used to be a throttle on how often the stats file
could get written, which seems to have vanished.  I seriously doubt
that that was a good change, especially on write-bandwidth-challenged
platforms.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tomas Vondra
On 25.12.2014 21:14, Andres Freund wrote:
 On 2014-12-25 14:36:42 -0500, Tom Lane wrote:

 My guess is that a checkpoint happened at that time. Maybe it'd be a 
 good idea to make pg_regress start postgres with log_checkpoints 
 enabled? My guess is that we'd find horrendous 'sync' times.
 
 Michael: Could you perhaps turn log_checkpoints on in the config?

Logging timestamps (using log_line_prefux) would be also helpful.

 
 BTW, I notice that in the current state of pgstat.c, all the logic
 for keeping track of request arrival times is dead code, because
 nothing is actually looking at DBWriteRequest.request_time. This
 makes me think that somebody simplified away some logic we maybe
 should have kept. But if we're going to leave it like this, we
 could replace the DBWriteRequest data structure with a simple OID
 list and save a fair amount of code.
 
 That's indeed odd. Seems to have been lost when the statsfile was
 split into multiple files. Alvaro, Tomas?

The goal was to keep the logic as close to the original as possible.
IIRC there were pgstat wait timeout issues before, and in most cases
the conclusion was that it's probably because of overloaded I/O.

But maybe there actually was another bug, and it's entirely possible
that the split introduced a new one, and that's what we're seeing now.
The strange thing is that the split happened ~2 years ago, which is
inconsistent with the sudden increase of this kind of issues. So maybe
something changed on that particular animal (a failing SD card causing
I/O stalls, perhaps)?

Anyway, I happen to have a spare Raspberry PI, so I'll try to reproduce
and analyze the issue locally. But that won't happen until January.

 I wondered for a second whether the split could be responsible
 somehow, but there's reports of that in older backbranches as well:
 http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=mereswinedt=2014-12-23%2019%3A17%3A41


Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 The strange thing is that the split happened ~2 years ago, which is
 inconsistent with the sudden increase of this kind of issues. So maybe
 something changed on that particular animal (a failing SD card causing
 I/O stalls, perhaps)?

I think that hamster has basically got a tin can and string for an I/O
subsystem.  It's not real clear to me whether there's actually been an
increase in wait timeout failures recently; somebody would have to
go through and count them before I'd have much faith in that thesis.
However, I notice that at least the last few occurrences on hamster
all seem to have been in this parallel block:

test: brin gin gist spgist privileges security_label collate matview lock 
replica_identity rowsecurity object_address

which as recently as 9.4 contained just these tests:

test: privileges security_label collate matview lock replica_identity

I'm fairly sure that the index-related tests in this batch are I/O
intensive, and since they were not there at all six months ago, it's not
hard to believe that this block of tests has far greater I/O demands than
used to exist.  Draw your own conclusions ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tomas Vondra
On 25.12.2014 22:16, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 25.12.2014 20:36, Tom Lane wrote:
 BTW, I notice that in the current state of pgstat.c, all the logic
 for keeping track of request arrival times is dead code, because
 nothing is actually looking at DBWriteRequest.request_time.
 
 Really? Which part of the code is dead? I see pgstat_recv_inquiry() is
 updating the request_time after receiving the inquiry, and
 pgstat_db_requested() is looking at it when writing the files.
 
 Where is pgstat_db_requested() looking at request_time?

Oh, right. pgstat_db_requested() is not looking at the timestamp - it
only checks the OID of the database. But pgstat_recv_inquiry() is
checking it, comparing it to cutoff_time etc.

ISTM the main change related to this is that this:

if (last_statwrite  last_statrequest)
pgstat_write_statsfile(false);

got replaced by this:

if (pgstat_write_statsfile_needed())
pgstat_write_statsfiles(false, false);

where pgstat_write_statsfile_needed() only checks if the list is empty
(and ignores the request/write timestamps).

If I understand that correctly, this can would lead to writing the files
more often, and we're dealing with a timeout.

 If we can simplify the code by keeping just OIDs, let's do that. I think
 the main reason why we haven't done that in 187492b6 was to keep as much
 of the existing logic (and maybe change it in a separate patch).
 
 The real point here is that I think that commit *already* changed
 the existing logic, because the time-of-receipt used to matter. In 
 particular, there used to be a throttle on how often the stats file 
 could get written, which seems to have vanished. I seriously doubt 
 that that was a good change, especially on
 write-bandwidth-challenged platforms.

Yes - if that change causes writing the files being written more
frequently, it's not a good change.

But I think the time-of-receipt still matters - pgstat_recv_inquiry
logic remained the same, just applied per database. ISTM the only thing
that disappeared is the (last_statwrite  last_statrequest) check.

I have to think about this a bit more, I haven't seen this code since
the split patch.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tomas Vondra
On 25.12.2014 22:40, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 The strange thing is that the split happened ~2 years ago, which is
 inconsistent with the sudden increase of this kind of issues. So maybe
 something changed on that particular animal (a failing SD card causing
 I/O stalls, perhaps)?
 
 I think that hamster has basically got a tin can and string for an I/O
 subsystem.  It's not real clear to me whether there's actually been an

Yes. It's called SD card.

 increase in wait timeout failures recently; somebody would have to
 go through and count them before I'd have much faith in that thesis.

That's what I did. On hamster I see this (in the HEAD):

2014-12-25 16:00:07 yes
2014-12-24 16:00:07 yes
2014-12-23 16:00:07 yes
2014-12-22 16:00:07 yes
2014-12-19 16:00:07 yes
2014-12-15 16:00:11 no
2014-10-25 16:00:06 no
2014-10-24 16:00:06 no
2014-10-23 16:00:06 no
2014-10-22 16:00:06 no
2014-10-21 16:00:07 no
2014-10-19 16:00:06 no
2014-09-28 16:00:06 no
2014-09-26 16:00:07 no
2014-08-28 16:00:06 no
2014-08-12 16:00:06 no
2014-08-05 22:04:48 no
2014-07-19 01:53:30 no
2014-07-06 16:00:06 no
2014-07-04 16:00:06 no
2014-06-29 16:00:06 no
2014-05-09 16:00:04 no
2014-05-07 16:00:04 no
2014-05-04 16:00:04 no
2014-04-28 16:00:04 no
2014-04-18 16:00:04 no
2014-04-04 16:00:04 no

(where yes means pgstat wait timeout is in the logs). On chipmunk,
the trend is much less convincing (but there's much less failures, and
only 3 of them failed because of the pgstat wait timeout).

However, it's worth mentioning that all the pgstat failures happened at
16:00:07 and most of the older failures are before that. So it may be
that it was failing because of something else, and the pgstat timeout
could not happen because of that. OTOH, there's a fair amount of
successful runs.


 However, I notice that at least the last few occurrences on hamster
 all seem to have been in this parallel block:
 
 test: brin gin gist spgist privileges security_label collate matview
 lock replica_identity rowsecurity object_address
 
 which as recently as 9.4 contained just these tests:
 
 test: privileges security_label collate matview lock replica_identity
 
 I'm fairly sure that the index-related tests in this batch are I/O 
 intensive, and since they were not there at all six months ago, it's
 not hard to believe that this block of tests has far greater I/O
 demands than used to exist. Draw your own conclusions ...

Yes, that might be the culprit here. Would be interesting to know what's
happening on the machine while the tests are running to confirm this
hypothesis.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 25.12.2014 22:40, Tom Lane wrote:
 I think that hamster has basically got a tin can and string for an I/O
 subsystem.  It's not real clear to me whether there's actually been an
 increase in wait timeout failures recently; somebody would have to
 go through and count them before I'd have much faith in that thesis.

 That's what I did. On hamster I see this (in the HEAD):

 2014-12-25 16:00:07 yes
 2014-12-24 16:00:07 yes
 2014-12-23 16:00:07 yes
 2014-12-22 16:00:07 yes
 2014-12-19 16:00:07 yes
 2014-12-15 16:00:11 no
 2014-10-25 16:00:06 no
 2014-10-24 16:00:06 no
 2014-10-23 16:00:06 no
 2014-10-22 16:00:06 no
 2014-10-21 16:00:07 no
 2014-10-19 16:00:06 no
 2014-09-28 16:00:06 no
 2014-09-26 16:00:07 no
 2014-08-28 16:00:06 no
 2014-08-12 16:00:06 no
 2014-08-05 22:04:48 no
 2014-07-19 01:53:30 no
 2014-07-06 16:00:06 no
 2014-07-04 16:00:06 no
 2014-06-29 16:00:06 no
 2014-05-09 16:00:04 no
 2014-05-07 16:00:04 no
 2014-05-04 16:00:04 no
 2014-04-28 16:00:04 no
 2014-04-18 16:00:04 no
 2014-04-04 16:00:04 no

 (where yes means pgstat wait timeout is in the logs). On chipmunk,
 the trend is much less convincing (but there's much less failures, and
 only 3 of them failed because of the pgstat wait timeout).

mereswine's history is also pretty interesting in this context.  That
series makes it look like the probability of pgstat wait timeout took
a big jump around the beginning of December, especially if you make the
unproven-but-not-unreasonable assumption that the two pg_upgradecheck
failures since then were also wait timeout failures.  That's close enough
after commit 88fc71926392115c (Nov 19) to make me suspect that that was
what put us over the edge: that added a bunch more I/O *and* a bunch more
statistics demands to this one block of parallel tests.

But even if we are vastly overstressing the I/O subsystem on these boxes,
why is it manifesting like this?  pgstat never fsyncs the stats temp file,
so it should not have to wait for physical I/O I'd think.  Or perhaps the
file rename() operations get fsync'd behind the scenes by the filesystem?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Tomas Vondra
On 26.12.2014 02:59, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 25.12.2014 22:40, Tom Lane wrote:
 I think that hamster has basically got a tin can and string for an I/O
 subsystem.  It's not real clear to me whether there's actually been an
 increase in wait timeout failures recently; somebody would have to
 go through and count them before I'd have much faith in that thesis.
 
 That's what I did. On hamster I see this (in the HEAD):
 
 2014-12-25 16:00:07 yes
 2014-12-24 16:00:07 yes
 2014-12-23 16:00:07 yes
 2014-12-22 16:00:07 yes
 2014-12-19 16:00:07 yes
 2014-12-15 16:00:11 no
 2014-10-25 16:00:06 no
 2014-10-24 16:00:06 no
 2014-10-23 16:00:06 no
 2014-10-22 16:00:06 no
 2014-10-21 16:00:07 no
 2014-10-19 16:00:06 no
 2014-09-28 16:00:06 no
 2014-09-26 16:00:07 no
 2014-08-28 16:00:06 no
 2014-08-12 16:00:06 no
 2014-08-05 22:04:48 no
 2014-07-19 01:53:30 no
 2014-07-06 16:00:06 no
 2014-07-04 16:00:06 no
 2014-06-29 16:00:06 no
 2014-05-09 16:00:04 no
 2014-05-07 16:00:04 no
 2014-05-04 16:00:04 no
 2014-04-28 16:00:04 no
 2014-04-18 16:00:04 no
 2014-04-04 16:00:04 no
 
 (where yes means pgstat wait timeout is in the logs). On
 chipmunk, the trend is much less convincing (but there's much less
 failures, and only 3 of them failed because of the pgstat wait
 timeout).
 
 mereswine's history is also pretty interesting in this context. That 
 series makes it look like the probability of pgstat wait timeout
 took a big jump around the beginning of December, especially if you
 make the unproven-but-not-unreasonable assumption that the two
 pg_upgradecheck failures since then were also wait timeout failures.
 That's close enough after commit 88fc71926392115c (Nov 19) to make me
 suspect that that was what put us over the edge: that added a bunch
 more I/O *and* a bunch more statistics demands to this one block of
 parallel tests.

Interesting. But even if this commit tipped us over the edge, it's not a
proof that the split patch was perfectly correct.

 But even if we are vastly overstressing the I/O subsystem on these
 boxes, why is it manifesting like this? pgstat never fsyncs the stats
 temp file, so it should not have to wait for physical I/O I'd think.
 Or perhaps the file rename() operations get fsync'd behind the scenes
 by the filesystem?

My guess is that the amount of dirty data in page cache reaches, reaches
dirty_ratio/dirty_bytes, effectively forcing the writes to go directly
to the disks. Those ARM machines have rather low amounts of RAM
(typically 256-512MB), and the default values for dirty_ratio are ~20%
IIRC. So that's ~50MB-100MB.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Michael Paquier
On Fri, Dec 26, 2014 at 6:28 AM, Tomas Vondra t...@fuzzy.cz wrote:
 On 25.12.2014 21:14, Andres Freund wrote:
 On 2014-12-25 14:36:42 -0500, Tom Lane wrote:

 My guess is that a checkpoint happened at that time. Maybe it'd be a
 good idea to make pg_regress start postgres with log_checkpoints
 enabled? My guess is that we'd find horrendous 'sync' times.

 Michael: Could you perhaps turn log_checkpoints on in the config?

 Logging timestamps (using log_line_prefux) would be also helpful.
Done. If have been wondering about those failures as well but didn't
get the time to look around. FYI, hamster is running with a 8GB class
10 SD card able to do 40M/s in read, card that I changed 2 weeks ago
btw, the former 4GB card beginning to show I/O errors, RIP to it. So
this is what is triggering the wait timeouts more frequently... But I
have no real explanation why REL9_4_STABLE shows no signs of failures.

For the time being, what about putting pg_stats_tmp into a ram disk to
leverage the I/O? Whatever what we come up with, I imagine that I/O
will still be tight on hamster.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way of dealing with pgstat wait timeout during buildfarm runs?

2014-12-25 Thread Alvaro Herrera
Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  So I think a better way to deal with that warning would be a good
  idea. Besides somehow making the mechanism there are two ways to attack
  this that I can think of, neither of them awe inspiring:
 
  1) Make that WARNING a LOG message instead. Since those don't get send
  to the client with default settings...
  2) Increase PGSTAT_MAX_WAIT_TIME even further than what 99b545 increased
  it to.
 
 Yeah, I've been getting more annoyed by that too lately.  I keep wondering
 though whether there's an actual bug underneath that behavior that we're
 failing to see.

I think the first thing to do is reconsider usage of PGSTAT_RETRY_DELAY
instead of PGSTAT_STAT_INTERVAL in autovacuum workers.  That decreases
the wait time 50-fold, if I recall this correctly, and causes large
amounts of extra I/O traffic.

 BTW, I notice that in the current state of pgstat.c, all the logic for
 keeping track of request arrival times is dead code, because nothing is
 actually looking at DBWriteRequest.request_time.  This makes me think that
 somebody simplified away some logic we maybe should have kept.

I will have a look.  I remember being confused about this at some point
when reviewing that patch.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers