Re: [HACKERS] jacana hung after failing to acquire random number
On 12/12/2016 09:02 AM, Heikki Linnakangas wrote: On 12/12/2016 03:40 PM, Andrew Dunstan wrote: Or should we at least get pg_regress to try to shut down the postmaster if it can't connect after 120 seconds? I think that makes a lot of sense, independently of this random stuff. I will add it to my long TODO list. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
Heikki Linnakangas writes: > On 12/12/2016 03:40 PM, Andrew Dunstan wrote: >> Should one or more of these errors be fatal? Or should we at least get >> pg_regress to try to shut down the postmaster if it can't connect after >> 120 seconds? > Making it fatal, i.e. bringing down the server, doesn't seem like an > improvement. If the failure is transient, you don't want to kill the > whole server, when one connection attempt fails. > It would be nice to fail earlier if it's permanently failing, though. > Like, if someone does "rm /dev/urandom". Perhaps we should perform one > pg_strong_random() call at postmaster startup, and if that fails, refuse > to start up. That's sort of contradictory. If you're worried about transient failures, allowing a single failed try to cause postmaster startup failure isn't the way to make things more robust. Giving up after a bunch of failed tries over a very short interval isn't much better. I'm not sure how hard we need to work here. The case at hand seems to be one of simply not having gotten the bugs out of the initial implementation, so maybe we shouldn't read too much into it. I do agree that the buildfarm needs to be more robust against broken postmasters, because finding bugs is its raison d' etre. But I'm not convinced that it's a good idea to have the postmaster itself conclude that there's something wrong with its configured random-number source. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
On 12/12/2016 03:40 PM, Andrew Dunstan wrote: Or should we at least get pg_regress to try to shut down the postmaster if it can't connect after 120 seconds? I think that makes a lot of sense, independently of this random stuff. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
On 12/12/2016 03:40 PM, Andrew Dunstan wrote: On 12/12/2016 02:32 AM, Heikki Linnakangas wrote: On 12/12/2016 05:58 AM, Michael Paquier wrote: On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan wrote: jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make check". After starting the autovacuum launcher there are 120 messages on its log about "Could not acquire random number". Then nothing. So I suspect the problem here is commit fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in detail. Shouldn't we want the postmaster to shut down if it's not going to go further? Note that frogmouth, also mingw, which builds with openssl, doesn't have this issue. Did you unlock it in some way at the end? Here is the shape of the report for others: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15 And here is of course the interesting bit: 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire random number 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire random number 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire random number I am not seeing any problems with MSVC without openssl, so that's a problem proper to MinGW. I am getting to wonder if it is actually a good idea to cache the crypt context and then re-use it. Using a new context all the time is definitely not performance-wise though. Actually, looking at the config.log on jacana, it's trying to use /dev/urandom: configure:15028: checking for /dev/urandom configure:15041: result: yes configure:15054: checking which random number source to use configure:15073: result: /dev/urandom And looking closer at configure.in, I can see why: elif test "$PORTNAME" = x"win32" ; then USE_WIN32_RANDOM=1 That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the left side of the comparison doesn't have the 'x'. Oops. Fixed that, let's see if it made jacana happy again. This makes me wonder if we should work a bit harder to get a good error message, if acquiring a random number fails for any reason. This needs to work in the frontend as well backend, but we could still have an elog(LOG, ...) there, inside an #ifndef FRONTEND block. I see you have now improved the messages in postmaster.c, which is good. Well, I only wordsmithed them a bit, it still doesn't give much clue on why it failed. We should add more details to it. However, the bigger problem (ISTM) is that when this failed I had a system which was running but where every connection immediately failed: == creating temporary instance== == initializing database system == == starting postmaster== pg_regress: postmaster did not respond within 120 seconds Examine c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log for the reason make: *** [check] Error 2 Should one or more of these errors be fatal? Or should we at least get pg_regress to try to shut down the postmaster if it can't connect after 120 seconds? Making it fatal, i.e. bringing down the server, doesn't seem like an improvement. If the failure is transient, you don't want to kill the whole server, when one connection attempt fails. It would be nice to fail earlier if it's permanently failing, though. Like, if someone does "rm /dev/urandom". Perhaps we should perform one pg_strong_random() call at postmaster startup, and if that fails, refuse to start up. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
On 12/12/2016 02:32 AM, Heikki Linnakangas wrote: On 12/12/2016 05:58 AM, Michael Paquier wrote: On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan wrote: jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make check". After starting the autovacuum launcher there are 120 messages on its log about "Could not acquire random number". Then nothing. So I suspect the problem here is commit fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in detail. Shouldn't we want the postmaster to shut down if it's not going to go further? Note that frogmouth, also mingw, which builds with openssl, doesn't have this issue. Did you unlock it in some way at the end? Here is the shape of the report for others: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15 And here is of course the interesting bit: 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire random number 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire random number 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire random number I am not seeing any problems with MSVC without openssl, so that's a problem proper to MinGW. I am getting to wonder if it is actually a good idea to cache the crypt context and then re-use it. Using a new context all the time is definitely not performance-wise though. Actually, looking at the config.log on jacana, it's trying to use /dev/urandom: configure:15028: checking for /dev/urandom configure:15041: result: yes configure:15054: checking which random number source to use configure:15073: result: /dev/urandom And looking closer at configure.in, I can see why: elif test "$PORTNAME" = x"win32" ; then USE_WIN32_RANDOM=1 That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the left side of the comparison doesn't have the 'x'. Oops. Fixed that, let's see if it made jacana happy again. This makes me wonder if we should work a bit harder to get a good error message, if acquiring a random number fails for any reason. This needs to work in the frontend as well backend, but we could still have an elog(LOG, ...) there, inside an #ifndef FRONTEND block. I see you have now improved the messages in postmaster.c, which is good. However, the bigger problem (ISTM) is that when this failed I had a system which was running but where every connection immediately failed: == creating temporary instance== == initializing database system == == starting postmaster== pg_regress: postmaster did not respond within 120 seconds Examine c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log for the reason make: *** [check] Error 2 Should one or more of these errors be fatal? Or should we at least get pg_regress to try to shut down the postmaster if it can't connect after 120 seconds? [In answer to Michael's question above, I forcibly shut down the postmaster by hand. Otherwise it would still be running, and we would not have got the report on the buildfarm server.] cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
On Mon, Dec 12, 2016 at 4:32 PM, Heikki Linnakangas wrote: > That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the > left side of the comparison doesn't have the 'x'. Oops. Good catch. > This makes me wonder if we should work a bit harder to get a good error > message, if acquiring a random number fails for any reason. This needs to > work in the frontend as well backend, but we could still have an elog(LOG, > ...) there, inside an #ifndef FRONTEND block. Yeah, this has been itching me as well. We could allocate an error string in a psprintf()'d string and let the callers of pg_strong_backend() use it as they are responsible for the error handling. What do you think? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
On 12/12/2016 05:58 AM, Michael Paquier wrote: On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan wrote: jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make check". After starting the autovacuum launcher there are 120 messages on its log about "Could not acquire random number". Then nothing. So I suspect the problem here is commit fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in detail. Shouldn't we want the postmaster to shut down if it's not going to go further? Note that frogmouth, also mingw, which builds with openssl, doesn't have this issue. Did you unlock it in some way at the end? Here is the shape of the report for others: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15 And here is of course the interesting bit: 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire random number 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire random number 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire random number I am not seeing any problems with MSVC without openssl, so that's a problem proper to MinGW. I am getting to wonder if it is actually a good idea to cache the crypt context and then re-use it. Using a new context all the time is definitely not performance-wise though. Actually, looking at the config.log on jacana, it's trying to use /dev/urandom: configure:15028: checking for /dev/urandom configure:15041: result: yes configure:15054: checking which random number source to use configure:15073: result: /dev/urandom And looking closer at configure.in, I can see why: elif test "$PORTNAME" = x"win32" ; then USE_WIN32_RANDOM=1 That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the left side of the comparison doesn't have the 'x'. Oops. Fixed that, let's see if it made jacana happy again. This makes me wonder if we should work a bit harder to get a good error message, if acquiring a random number fails for any reason. This needs to work in the frontend as well backend, but we could still have an elog(LOG, ...) there, inside an #ifndef FRONTEND block. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jacana hung after failing to acquire random number
On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan wrote: > > jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make > check". After starting the autovacuum launcher there are 120 messages on its > log about "Could not acquire random number". Then nothing. > > > So I suspect the problem here is commit > fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in > detail. > > > Shouldn't we want the postmaster to shut down if it's not going to go > further? Note that frogmouth, also mingw, which builds with openssl, doesn't > have this issue. Did you unlock it in some way at the end? Here is the shape of the report for others: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15 And here is of course the interesting bit: 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire random number 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire random number 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire random number I am not seeing any problems with MSVC without openssl, so that's a problem proper to MinGW. I am getting to wonder if it is actually a good idea to cache the crypt context and then re-use it. Using a new context all the time is definitely not performance-wise though. A second difference are the missing CRYPT_MACHINE_KEYSET and CRYP_NEWKEYSET. So, with something like the patch attached, do you see improvements? What this patch does is to use a different context at each call, and with the key container flags to allow proper access to it (as there are winxp-only limitations here). I have tried to compile with MinGW in my environment but my gcc compiler keep crashing, so I cannot reproduce directly the problem I am afraid. -- Michael diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c index 6d3aa38..34cab6f 100644 --- a/src/port/pg_strong_random.c +++ b/src/port/pg_strong_random.c @@ -31,14 +31,6 @@ #include #endif -#ifdef WIN32 -/* - * Cache a global crypto provider that only gets freed when the process - * exits, in case we need random numbers more than once. - */ -static HCRYPTPROV hProvider = 0; -#endif - #if defined(USE_DEV_URANDOM) /* * Read (random) bytes from a file. @@ -111,28 +103,36 @@ pg_strong_random(void *buf, size_t len) * Windows has CryptoAPI for strong cryptographic numbers. */ #elif defined(USE_WIN32_RANDOM) - if (hProvider == 0) + HCRYPTPROV hProvider; + DWORD flags; + + flags = CRYPT_VERIFYCONTEXT | CRYPT_SILENT | CRYPT_MACHINE_KEYSET; + + /* Create a crypto provider */ + if (!CryptAcquireContext(&hProvider, + NULL, + MS_DEF_PROV, + PROV_RSA_FULL, + flags)) { + /* If previous creation failed, try with a new key container */ + flags |= CRYPT_NEWKEYSET; if (!CryptAcquireContext(&hProvider, NULL, MS_DEF_PROV, PROV_RSA_FULL, - CRYPT_VERIFYCONTEXT | CRYPT_SILENT)) - { - /* - * On failure, set back to 0 in case the value was for some reason - * modified. - */ - hProvider = 0; - } + flags)) + return false; } - /* Re-check in case we just retrieved the provider */ - if (hProvider != 0) + + if (!CryptGenRandom(hProvider, len, buf)) { - if (CryptGenRandom(hProvider, len, buf)) - return true; + CryptReleaseContext(hProvider, 0); + return false; } - return false; + + CryptReleaseContext(hProvider, 0); + return true; /* * Read /dev/urandom ourselves. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] jacana hung after failing to acquire random number
jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make check". After starting the autovacuum launcher there are 120 messages on its log about "Could not acquire random number". Then nothing. So I suspect the problem here is commit fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in detail. Shouldn't we want the postmaster to shut down if it's not going to go further? Note that frogmouth, also mingw, which builds with openssl, doesn't have this issue. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers