Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On Tue, Apr 18, 2017 at 09:59:26PM +0900, Michael Paquier wrote: > On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan > wrote: > > On 04/18/2017 08:23 AM, Michael Paquier wrote: > >> Increasing wal_sender_timeout and wal_receiver_timeout can help in > >> reducing the failures seen. > > > > OK, but you're only talking about a handful of these, right? > > Yup, that would be one solution but that's not attacking the problem > at its root. > > > Lets's say we have a bunch of possible environment settings with names > > that all begin with "PG_TAP_" PostgresNode.pm could check for the > > existence of these and take action accordingly, and you could set them > > on a buildfarm animal in the config file, or for interactive use in your > > .profile. > > That's the point I am trying to make upthread: slow buildfarm animals > should have minimal impact on core code modifications. We could for > example have one environment variable that lists all the parameters to > modify in a single string and appends them at the end of > postgresql.conf. But honestly I don't think that this is necessary if > there is only one variable able to define a base directory for > temporary statistics as the real bottleneck comes from there at least > in the case of hamster. When initializing a node via PostgresNode.pm, > we would just check for this variable, and the init() routine just > creates a temporary folder in it, setting up temp_stats_path in > postgresql.conf. Each of the above approaches has fairly low impact on the code, so we should use other criteria to choose. I'd welcome a feature for augmenting every postgresql.conf of every test suite (a generalization of "pg_regress --temp-config", which has proven its value). I can envision using it with force_parallel_mode, default_transaction_isolation, log_*, wal_*_timeout, autovacuum_naptime, and others. Even for hamster, I'm skeptical that changing stats_temp_directory would suffice. Every hamster BinInstallCheck failure since 2017-02-13 had a "LOG: terminating walsender process due to replication timeout". Most, but not all, of those replication timeouts followed a "LOG: using stale statistics instead of current ones because stats collector is not responding". For the remaining minority, I expect to eventually need wal_sender_timeout. Example: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2017-02-24%2016%3A00%3A06 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On Wed, Apr 19, 2017 at 7:03 AM, Michael Paquier wrote: > On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane wrote: >> Alvaro Herrera writes: >>> Tom Lane wrote: FWIW, I'm a bit suspicious of relocating the temp stats directory as being a reliable fix for this. >> >>> It's an SD card (the kind typically used in cameras and phones), not SSD. >>> Saying it's slow is an understatement. It's *excruciatingly* slow. >> >> Oh, I misread it ... but still, the modern definition of "excruciatingly >> slow" doesn't seem all that far off what 90s-era hard drives could do. >> It is clear from googling though that there's an enormous performance >> range in SD cards' random write performance, eg wikipedia's entry has >> a link to >> >> http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/ >> >> Seems like it's hard to judge this without knowing exactly which >> SD card Michael has got in that thing. > > Professional micro SD HC 8GB, with class 10, which is utterly slow in > the things I have tested up to now and of course I cannot find its > read/write properties. I had a SanDisk class 10 in it that died some > months ago, and that was way faster. RIP. Until something happens, I have removed --enable-tap-tests from the script of this buildfarm machine. There is no point to let it in red without a clear resolution plan on the table first. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> FWIW, I'm a bit suspicious of relocating the temp stats directory as >>> being a reliable fix for this. > >> It's an SD card (the kind typically used in cameras and phones), not SSD. >> Saying it's slow is an understatement. It's *excruciatingly* slow. > > Oh, I misread it ... but still, the modern definition of "excruciatingly > slow" doesn't seem all that far off what 90s-era hard drives could do. > It is clear from googling though that there's an enormous performance > range in SD cards' random write performance, eg wikipedia's entry has > a link to > > http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/ > > Seems like it's hard to judge this without knowing exactly which > SD card Michael has got in that thing. Professional micro SD HC 8GB, with class 10, which is utterly slow in the things I have tested up to now and of course I cannot find its read/write properties. I had a SanDisk class 10 in it that died some months ago, and that was way faster. RIP. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
Alvaro Herrera writes: > Tom Lane wrote: >> FWIW, I'm a bit suspicious of relocating the temp stats directory as >> being a reliable fix for this. > It's an SD card (the kind typically used in cameras and phones), not SSD. > Saying it's slow is an understatement. It's *excruciatingly* slow. Oh, I misread it ... but still, the modern definition of "excruciatingly slow" doesn't seem all that far off what 90s-era hard drives could do. It is clear from googling though that there's an enormous performance range in SD cards' random write performance, eg wikipedia's entry has a link to http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/ Seems like it's hard to judge this without knowing exactly which SD card Michael has got in that thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
Tom Lane wrote: > Michael Paquier writes: > > That's the point I am trying to make upthread: slow buildfarm animals > > should have minimal impact on core code modifications. We could for > > example have one environment variable that lists all the parameters to > > modify in a single string and appends them at the end of > > postgresql.conf. But honestly I don't think that this is necessary if > > there is only one variable able to define a base directory for > > temporary statistics as the real bottleneck comes from there at least > > in the case of hamster. > > FWIW, I'm a bit suspicious of relocating the temp stats directory as > being a reliable fix for this. It looks to me like hamster isn't that > much slower than gaur/pademelon, ie the same machine that was my primary > development machine for well over a decade, and on which I have NEVER > seen a "stats collector not responding" failure. Plus, if hamster's > main storage is SSD, that seems unlikely to be slower than the spinning > rust in gaur/pademelon. So I can't escape the suspicion that something > else is going on there. Seemingly-unexplainable stats collector problems > have been a bugaboo for a long time ... It's an SD card (the kind typically used in cameras and phones), not SSD. Saying it's slow is an understatement. It's *excruciatingly* slow. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
Michael Paquier writes: > That's the point I am trying to make upthread: slow buildfarm animals > should have minimal impact on core code modifications. We could for > example have one environment variable that lists all the parameters to > modify in a single string and appends them at the end of > postgresql.conf. But honestly I don't think that this is necessary if > there is only one variable able to define a base directory for > temporary statistics as the real bottleneck comes from there at least > in the case of hamster. FWIW, I'm a bit suspicious of relocating the temp stats directory as being a reliable fix for this. It looks to me like hamster isn't that much slower than gaur/pademelon, ie the same machine that was my primary development machine for well over a decade, and on which I have NEVER seen a "stats collector not responding" failure. Plus, if hamster's main storage is SSD, that seems unlikely to be slower than the spinning rust in gaur/pademelon. So I can't escape the suspicion that something else is going on there. Seemingly-unexplainable stats collector problems have been a bugaboo for a long time ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On 04/18/2017 08:59 AM, Michael Paquier wrote: >> Lets's say we have a bunch of possible environment settings with names >> that all begin with "PG_TAP_" PostgresNode.pm could check for the >> existence of these and take action accordingly, and you could set them >> on a buildfarm animal in the config file, or for interactive use in your >> .profile. > That's the point I am trying to make upthread: slow buildfarm animals > should have minimal impact on core code modifications. We could for > example have one environment variable that lists all the parameters to > modify in a single string and appends them at the end of > postgresql.conf. But honestly I don't think that this is necessary if > there is only one variable able to define a base directory for > temporary statistics as the real bottleneck comes from there at least > in the case of hamster. When initializing a node via PostgresNode.pm, > we would just check for this variable, and the init() routine just > creates a temporary folder in it, setting up temp_stats_path in > postgresql.conf. How is that going to deal with your wal_*_timeout etc settings? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan wrote: > > > On 04/18/2017 08:23 AM, Michael Paquier wrote: >> On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan >> wrote: >>> Yeah, but the way you have done it could also to lead to errors unless >>> you're very careful, as I found on axolotl (which died recently, >>> unfortunately). There I had to set the stats_temp directory to a >>> branch-specific name so a crash on branch A didn't leave stats picked up >>> by a run on branch B. I had a script that ran before every buildfarm run >>> that made sure the branch-specific directories existed on the tmpfs. >> I didn't consider that. Still hamster is only set to run master so >> that's not a problem. Running more branches would be too much for this >> little dear as it includes TAP tests. >> but it is not possible to set up that with the TAP tests. I could always disable --enable-tap-tests on this machine but I don't think that this is a correct answer. Something that could be done is to use an environment variable to set up a base directory for all the nodes created in PostgresNode.pm, and use that for temporary statistics with a custom folder name. But that's not scalable if we want to enforce more parameter. >>> What more parameters do you want? >> Increasing wal_sender_timeout and wal_receiver_timeout can help in >> reducing the failures seen. > > OK, but you're only talking about a handful of these, right? Yup, that would be one solution but that's not attacking the problem at its root. > Lets's say we have a bunch of possible environment settings with names > that all begin with "PG_TAP_" PostgresNode.pm could check for the > existence of these and take action accordingly, and you could set them > on a buildfarm animal in the config file, or for interactive use in your > .profile. That's the point I am trying to make upthread: slow buildfarm animals should have minimal impact on core code modifications. We could for example have one environment variable that lists all the parameters to modify in a single string and appends them at the end of postgresql.conf. But honestly I don't think that this is necessary if there is only one variable able to define a base directory for temporary statistics as the real bottleneck comes from there at least in the case of hamster. When initializing a node via PostgresNode.pm, we would just check for this variable, and the init() routine just creates a temporary folder in it, setting up temp_stats_path in postgresql.conf. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On 04/18/2017 08:23 AM, Michael Paquier wrote: > On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan > wrote: >> Yeah, but the way you have done it could also to lead to errors unless >> you're very careful, as I found on axolotl (which died recently, >> unfortunately). There I had to set the stats_temp directory to a >> branch-specific name so a crash on branch A didn't leave stats picked up >> by a run on branch B. I had a script that ran before every buildfarm run >> that made sure the branch-specific directories existed on the tmpfs. > I didn't consider that. Still hamster is only set to run master so > that's not a problem. Running more branches would be too much for this > little dear as it includes TAP tests. > >>> but it is not possible to set up that with the >>> TAP tests. I could always disable --enable-tap-tests on this machine >>> but I don't think that this is a correct answer. Something that could >>> be done is to use an environment variable to set up a base directory >>> for all the nodes created in PostgresNode.pm, and use that for >>> temporary statistics with a custom folder name. But that's not >>> scalable if we want to enforce more parameter. >> What more parameters do you want? > Increasing wal_sender_timeout and wal_receiver_timeout can help in > reducing the failures seen. OK, but you're only talking about a handful of these, right? Lets's say we have a bunch of possible environment settings with names that all begin with "PG_TAP_" PostgresNode.pm could check for the existence of these and take action accordingly, and you could set them on a buildfarm animal in the config file, or for interactive use in your .profile. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan wrote: > Yeah, but the way you have done it could also to lead to errors unless > you're very careful, as I found on axolotl (which died recently, > unfortunately). There I had to set the stats_temp directory to a > branch-specific name so a crash on branch A didn't leave stats picked up > by a run on branch B. I had a script that ran before every buildfarm run > that made sure the branch-specific directories existed on the tmpfs. I didn't consider that. Still hamster is only set to run master so that's not a problem. Running more branches would be too much for this little dear as it includes TAP tests. >> but it is not possible to set up that with the >> TAP tests. I could always disable --enable-tap-tests on this machine >> but I don't think that this is a correct answer. Something that could >> be done is to use an environment variable to set up a base directory >> for all the nodes created in PostgresNode.pm, and use that for >> temporary statistics with a custom folder name. But that's not >> scalable if we want to enforce more parameter. > > What more parameters do you want? Increasing wal_sender_timeout and wal_receiver_timeout can help in reducing the failures seen. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On 04/18/2017 03:07 AM, Michael Paquier wrote: > Hi all, > > Some of you may have noticed that hamster is heavily red on the > buildfarm. I have done a bit of investigation, and I am able to > reproduce the failure manually. But actually after looking at the logs > the error has obviously showed up: > 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to > accept connections > 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics > instead of current ones because stats collector is not responding > 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: > terminating walsender process due to replication timeout > 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics > instead of current ones because stats collector is not responding > > Stale regressions means that the system is just constrained so much > that things are timing out. > > In order to avoid such failures with normal regression tests, I have > set up extra_config so as stats_temp_directory goes to a tmpfs to > avoid stale statistics, Yeah, but the way you have done it could also to lead to errors unless you're very careful, as I found on axolotl (which died recently, unfortunately). There I had to set the stats_temp directory to a branch-specific name so a crash on branch A didn't leave stats picked up by a run on branch B. I had a script that ran before every buildfarm run that made sure the branch-specific directories existed on the tmpfs. > but it is not possible to set up that with the > TAP tests. I could always disable --enable-tap-tests on this machine > but I don't think that this is a correct answer. Something that could > be done is to use an environment variable to set up a base directory > for all the nodes created in PostgresNode.pm, and use that for > temporary statistics with a custom folder name. But that's not > scalable if we want to enforce more parameter. > > What more parameters do you want? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
On Tue, Apr 18, 2017 at 4:15 PM, Andres Freund wrote: > Hi, > > On 2017-04-18 16:07:38 +0900, Michael Paquier wrote: >> Some of you may have noticed that hamster is heavily red on the >> buildfarm. I have done a bit of investigation, and I am able to >> reproduce the failure manually. But actually after looking at the logs >> the error has obviously showed up: >> 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to >> accept connections >> 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics >> instead of current ones because stats collector is not responding >> 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: >> terminating walsender process due to replication timeout >> 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics >> instead of current ones because stats collector is not responding >> >> Stale regressions means that the system is just constrained so much >> that things are timing out. >> >> In order to avoid such failures with normal regression tests, I have >> set up extra_config so as stats_temp_directory goes to a tmpfs to >> avoid stale statistics > > How high do you need to make the hardcoded limit for this to succeed > without a tmpfs? Increasing wal_sender_timeout helps visibly to reduce the failure rate. With 10 attempts I can see before at least 3 failures, and nothing after. > If hamster fails this regularly I think we have to do > something about it, rather than paper over it. What's the storage > situation currently like? The SD card of this RPI is half-empty. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
Hi, On 2017-04-18 16:07:38 +0900, Michael Paquier wrote: > Some of you may have noticed that hamster is heavily red on the > buildfarm. I have done a bit of investigation, and I am able to > reproduce the failure manually. But actually after looking at the logs > the error has obviously showed up: > 2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to > accept connections > 2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics > instead of current ones because stats collector is not responding > 2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG: > terminating walsender process due to replication timeout > 2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics > instead of current ones because stats collector is not responding > > Stale regressions means that the system is just constrained so much > that things are timing out. > > In order to avoid such failures with normal regression tests, I have > set up extra_config so as stats_temp_directory goes to a tmpfs to > avoid stale statistics How high do you need to make the hardcoded limit for this to succeed without a tmpfs? If hamster fails this regularly I think we have to do something about it, rather than paper over it. What's the storage situation currently like? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers