Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-06-04 Thread Noah Misch
On Tue, Apr 18, 2017 at 09:59:26PM +0900, Michael Paquier wrote:
> On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan 
>  wrote:
> > On 04/18/2017 08:23 AM, Michael Paquier wrote:
> >> Increasing wal_sender_timeout and wal_receiver_timeout can help in
> >> reducing the failures seen.
> >
> > OK, but you're only talking about a handful of these, right?
> 
> Yup, that would be one solution but that's not attacking the problem
> at its root.
> 
> > Lets's say we have a bunch of possible environment settings with names
> > that all begin with "PG_TAP_" PostgresNode.pm could check for the
> > existence of these and take action accordingly, and you could set them
> > on a buildfarm animal in the config file, or for interactive use in your
> > .profile.
> 
> That's the point I am trying to make upthread: slow buildfarm animals
> should have minimal impact on core code modifications. We could for
> example have one environment variable that lists all the parameters to
> modify in a single string and appends them at the end of
> postgresql.conf. But honestly I don't think that this is necessary if
> there is only one variable able to define a base directory for
> temporary statistics as the real bottleneck comes from there at least
> in the case of hamster. When initializing a node via PostgresNode.pm,
> we would just check for this variable, and the init() routine just
> creates a temporary folder in it, setting up temp_stats_path in
> postgresql.conf.

Each of the above approaches has fairly low impact on the code, so we should
use other criteria to choose.  I'd welcome a feature for augmenting every
postgresql.conf of every test suite (a generalization of "pg_regress
--temp-config", which has proven its value).  I can envision using it with
force_parallel_mode, default_transaction_isolation, log_*, wal_*_timeout,
autovacuum_naptime, and others.

Even for hamster, I'm skeptical that changing stats_temp_directory would
suffice.  Every hamster BinInstallCheck failure since 2017-02-13 had a "LOG:
terminating walsender process due to replication timeout".  Most, but not all,
of those replication timeouts followed a "LOG:  using stale statistics instead
of current ones because stats collector is not responding".  For the remaining
minority, I expect to eventually need wal_sender_timeout.  Example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2017-02-24%2016%3A00%3A06


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-19 Thread Michael Paquier
On Wed, Apr 19, 2017 at 7:03 AM, Michael Paquier
 wrote:
> On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane  wrote:
>> Alvaro Herrera  writes:
>>> Tom Lane wrote:
 FWIW, I'm a bit suspicious of relocating the temp stats directory as
 being a reliable fix for this.
>>
>>> It's an SD card (the kind typically used in cameras and phones), not SSD.
>>> Saying it's slow is an understatement.  It's *excruciatingly* slow.
>>
>> Oh, I misread it ... but still, the modern definition of "excruciatingly
>> slow" doesn't seem all that far off what 90s-era hard drives could do.
>> It is clear from googling though that there's an enormous performance
>> range in SD cards' random write performance, eg wikipedia's entry has
>> a link to
>>
>> http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/
>>
>> Seems like it's hard to judge this without knowing exactly which
>> SD card Michael has got in that thing.
>
> Professional micro SD HC 8GB, with class 10, which is utterly slow in
> the things I have tested up to now and of course I cannot find its
> read/write properties. I had a SanDisk class 10 in it that died some
> months ago, and that was way faster. RIP.

Until something happens, I have removed --enable-tap-tests from the
script of this buildfarm machine. There is no point to let it in red
without a clear resolution plan on the table first.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Michael Paquier
On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> FWIW, I'm a bit suspicious of relocating the temp stats directory as
>>> being a reliable fix for this.
>
>> It's an SD card (the kind typically used in cameras and phones), not SSD.
>> Saying it's slow is an understatement.  It's *excruciatingly* slow.
>
> Oh, I misread it ... but still, the modern definition of "excruciatingly
> slow" doesn't seem all that far off what 90s-era hard drives could do.
> It is clear from googling though that there's an enormous performance
> range in SD cards' random write performance, eg wikipedia's entry has
> a link to
>
> http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/
>
> Seems like it's hard to judge this without knowing exactly which
> SD card Michael has got in that thing.

Professional micro SD HC 8GB, with class 10, which is utterly slow in
the things I have tested up to now and of course I cannot find its
read/write properties. I had a SanDisk class 10 in it that died some
months ago, and that was way faster. RIP.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> FWIW, I'm a bit suspicious of relocating the temp stats directory as
>> being a reliable fix for this.

> It's an SD card (the kind typically used in cameras and phones), not SSD.
> Saying it's slow is an understatement.  It's *excruciatingly* slow.

Oh, I misread it ... but still, the modern definition of "excruciatingly
slow" doesn't seem all that far off what 90s-era hard drives could do.
It is clear from googling though that there's an enormous performance
range in SD cards' random write performance, eg wikipedia's entry has
a link to

http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/

Seems like it's hard to judge this without knowing exactly which
SD card Michael has got in that thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Alvaro Herrera
Tom Lane wrote:
> Michael Paquier  writes:
> > That's the point I am trying to make upthread: slow buildfarm animals
> > should have minimal impact on core code modifications. We could for
> > example have one environment variable that lists all the parameters to
> > modify in a single string and appends them at the end of
> > postgresql.conf. But honestly I don't think that this is necessary if
> > there is only one variable able to define a base directory for
> > temporary statistics as the real bottleneck comes from there at least
> > in the case of hamster.
> 
> FWIW, I'm a bit suspicious of relocating the temp stats directory as
> being a reliable fix for this.  It looks to me like hamster isn't that
> much slower than gaur/pademelon, ie the same machine that was my primary
> development machine for well over a decade, and on which I have NEVER
> seen a "stats collector not responding" failure.  Plus, if hamster's
> main storage is SSD, that seems unlikely to be slower than the spinning
> rust in gaur/pademelon.  So I can't escape the suspicion that something
> else is going on there.  Seemingly-unexplainable stats collector problems
> have been a bugaboo for a long time ...

It's an SD card (the kind typically used in cameras and phones), not SSD.
Saying it's slow is an understatement.  It's *excruciatingly* slow.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Tom Lane
Michael Paquier  writes:
> That's the point I am trying to make upthread: slow buildfarm animals
> should have minimal impact on core code modifications. We could for
> example have one environment variable that lists all the parameters to
> modify in a single string and appends them at the end of
> postgresql.conf. But honestly I don't think that this is necessary if
> there is only one variable able to define a base directory for
> temporary statistics as the real bottleneck comes from there at least
> in the case of hamster.

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this.  It looks to me like hamster isn't that
much slower than gaur/pademelon, ie the same machine that was my primary
development machine for well over a decade, and on which I have NEVER
seen a "stats collector not responding" failure.  Plus, if hamster's
main storage is SSD, that seems unlikely to be slower than the spinning
rust in gaur/pademelon.  So I can't escape the suspicion that something
else is going on there.  Seemingly-unexplainable stats collector problems
have been a bugaboo for a long time ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Andrew Dunstan


On 04/18/2017 08:59 AM, Michael Paquier wrote:
>> Lets's say we have a bunch of possible environment settings with names
>> that all begin with "PG_TAP_" PostgresNode.pm could check for the
>> existence of these and take action accordingly, and you could set them
>> on a buildfarm animal in the config file, or for interactive use in your
>> .profile.
> That's the point I am trying to make upthread: slow buildfarm animals
> should have minimal impact on core code modifications. We could for
> example have one environment variable that lists all the parameters to
> modify in a single string and appends them at the end of
> postgresql.conf. But honestly I don't think that this is necessary if
> there is only one variable able to define a base directory for
> temporary statistics as the real bottleneck comes from there at least
> in the case of hamster. When initializing a node via PostgresNode.pm,
> we would just check for this variable, and the init() routine just
> creates a temporary folder in it, setting up temp_stats_path in
> postgresql.conf.



How is that going to deal with your wal_*_timeout etc settings?

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Michael Paquier
On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan
 wrote:
>
>
> On 04/18/2017 08:23 AM, Michael Paquier wrote:
>> On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
>>  wrote:
>>> Yeah, but the way you have done it could also to lead to errors unless
>>> you're very careful, as I found on axolotl (which died recently,
>>> unfortunately). There I had to set the stats_temp directory to a
>>> branch-specific name so a crash on branch A didn't leave stats picked up
>>> by a run on branch B. I had a script that ran before every buildfarm run
>>> that made sure the branch-specific directories existed on the tmpfs.
>> I didn't consider that. Still hamster is only set to run master so
>> that's not a problem. Running more branches would be too much for this
>> little dear as it includes TAP tests.
>>
  but it is not possible to set up that with the
 TAP tests. I could always disable --enable-tap-tests on this machine
 but I don't think that this is a correct answer. Something that could
 be done is to use an environment variable to set up a base directory
 for all the nodes created in PostgresNode.pm, and use that for
 temporary statistics with a custom folder name. But that's not
 scalable if we want to enforce more parameter.
>>> What more parameters do you want?
>> Increasing wal_sender_timeout and wal_receiver_timeout can help in
>> reducing the failures seen.
>
> OK, but you're only talking about a handful of these, right?

Yup, that would be one solution but that's not attacking the problem
at its root.

> Lets's say we have a bunch of possible environment settings with names
> that all begin with "PG_TAP_" PostgresNode.pm could check for the
> existence of these and take action accordingly, and you could set them
> on a buildfarm animal in the config file, or for interactive use in your
> .profile.

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster. When initializing a node via PostgresNode.pm,
we would just check for this variable, and the init() routine just
creates a temporary folder in it, setting up temp_stats_path in
postgresql.conf.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Andrew Dunstan


On 04/18/2017 08:23 AM, Michael Paquier wrote:
> On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
>  wrote:
>> Yeah, but the way you have done it could also to lead to errors unless
>> you're very careful, as I found on axolotl (which died recently,
>> unfortunately). There I had to set the stats_temp directory to a
>> branch-specific name so a crash on branch A didn't leave stats picked up
>> by a run on branch B. I had a script that ran before every buildfarm run
>> that made sure the branch-specific directories existed on the tmpfs.
> I didn't consider that. Still hamster is only set to run master so
> that's not a problem. Running more branches would be too much for this
> little dear as it includes TAP tests.
>
>>>  but it is not possible to set up that with the
>>> TAP tests. I could always disable --enable-tap-tests on this machine
>>> but I don't think that this is a correct answer. Something that could
>>> be done is to use an environment variable to set up a base directory
>>> for all the nodes created in PostgresNode.pm, and use that for
>>> temporary statistics with a custom folder name. But that's not
>>> scalable if we want to enforce more parameter.
>> What more parameters do you want?
> Increasing wal_sender_timeout and wal_receiver_timeout can help in
> reducing the failures seen.


OK, but you're only talking about a handful of these, right?

Lets's say we have a bunch of possible environment settings with names
that all begin with "PG_TAP_" PostgresNode.pm could check for the
existence of these and take action accordingly, and you could set them
on a buildfarm animal in the config file, or for interactive use in your
.profile.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Michael Paquier
On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
 wrote:
> Yeah, but the way you have done it could also to lead to errors unless
> you're very careful, as I found on axolotl (which died recently,
> unfortunately). There I had to set the stats_temp directory to a
> branch-specific name so a crash on branch A didn't leave stats picked up
> by a run on branch B. I had a script that ran before every buildfarm run
> that made sure the branch-specific directories existed on the tmpfs.

I didn't consider that. Still hamster is only set to run master so
that's not a problem. Running more branches would be too much for this
little dear as it includes TAP tests.

>>  but it is not possible to set up that with the
>> TAP tests. I could always disable --enable-tap-tests on this machine
>> but I don't think that this is a correct answer. Something that could
>> be done is to use an environment variable to set up a base directory
>> for all the nodes created in PostgresNode.pm, and use that for
>> temporary statistics with a custom folder name. But that's not
>> scalable if we want to enforce more parameter.
>
> What more parameters do you want?

Increasing wal_sender_timeout and wal_receiver_timeout can help in
reducing the failures seen.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Andrew Dunstan


On 04/18/2017 03:07 AM, Michael Paquier wrote:
> Hi all,
>
> Some of you may have noticed that hamster is heavily red on the
> buildfarm. I have done a bit of investigation, and I am able to
> reproduce the failure manually. But actually after looking at the logs
> the error has obviously showed up:
> 2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
> accept connections
> 2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
> terminating walsender process due to replication timeout
> 2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
>
> Stale regressions means that the system is just constrained so much
> that things are timing out.
>
> In order to avoid such failures with normal regression tests, I have
> set up extra_config so as stats_temp_directory goes to a tmpfs to
> avoid stale statistics,


Yeah, but the way you have done it could also to lead to errors unless
you're very careful, as I found on axolotl (which died recently,
unfortunately). There I had to set the stats_temp directory to a
branch-specific name so a crash on branch A didn't leave stats picked up
by a run on branch B. I had a script that ran before every buildfarm run
that made sure the branch-specific directories existed on the tmpfs.



>  but it is not possible to set up that with the
> TAP tests. I could always disable --enable-tap-tests on this machine
> but I don't think that this is a correct answer. Something that could
> be done is to use an environment variable to set up a base directory
> for all the nodes created in PostgresNode.pm, and use that for
> temporary statistics with a custom folder name. But that's not
> scalable if we want to enforce more parameter.
>
>

What more parameters do you want?

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Michael Paquier
On Tue, Apr 18, 2017 at 4:15 PM, Andres Freund  wrote:
> Hi,
>
> On 2017-04-18 16:07:38 +0900, Michael Paquier wrote:
>> Some of you may have noticed that hamster is heavily red on the
>> buildfarm. I have done a bit of investigation, and I am able to
>> reproduce the failure manually. But actually after looking at the logs
>> the error has obviously showed up:
>> 2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
>> accept connections
>> 2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
>> instead of current ones because stats collector is not responding
>> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
>> terminating walsender process due to replication timeout
>> 2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
>> instead of current ones because stats collector is not responding
>>
>> Stale regressions means that the system is just constrained so much
>> that things are timing out.
>>
>> In order to avoid such failures with normal regression tests, I have
>> set up extra_config so as stats_temp_directory goes to a tmpfs to
>> avoid stale statistics
>
> How high do you need to make the hardcoded limit for this to succeed
> without a tmpfs?

Increasing wal_sender_timeout helps visibly to reduce the failure
rate. With 10 attempts I can see before at least 3 failures, and
nothing after.

> If hamster fails this regularly I think we have to do
> something about it, rather than paper over it.  What's the storage
> situation currently like?

The SD card of this RPI is half-empty.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check

2017-04-18 Thread Andres Freund
Hi,

On 2017-04-18 16:07:38 +0900, Michael Paquier wrote:
> Some of you may have noticed that hamster is heavily red on the
> buildfarm. I have done a bit of investigation, and I am able to
> reproduce the failure manually. But actually after looking at the logs
> the error has obviously showed up:
> 2017-04-16 05:07:19.650 JST [18282] LOG:  database system is ready to
> accept connections
> 2017-04-16 05:08:36.725 JST [18296] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
> terminating walsender process due to replication timeout
> 2017-04-16 05:10:30.180 JST [18306] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 
> Stale regressions means that the system is just constrained so much
> that things are timing out.
> 
> In order to avoid such failures with normal regression tests, I have
> set up extra_config so as stats_temp_directory goes to a tmpfs to
> avoid stale statistics

How high do you need to make the hardcoded limit for this to succeed
without a tmpfs?  If hamster fails this regularly I think we have to do
something about it, rather than paper over it.  What's the storage
situation currently like?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers