Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-12 Thread Andrew Dunstan



On 12/12/2016 09:02 AM, Heikki Linnakangas wrote:

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:

Or should we at least get pg_regress to try to shut down the
postmaster if it can't connect after 120 seconds?


I think that makes a lot of sense, independently of this random stuff.





I will add it to my long TODO list.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-12 Thread Tom Lane
Heikki Linnakangas  writes:
> On 12/12/2016 03:40 PM, Andrew Dunstan wrote:
>> Should one or more of these errors be fatal? Or should we at least get
>> pg_regress to try to shut down the postmaster if it can't connect after
>> 120 seconds?

> Making it fatal, i.e. bringing down the server, doesn't seem like an 
> improvement. If the failure is transient, you don't want to kill the 
> whole server, when one connection attempt fails.

> It would be nice to fail earlier if it's permanently failing, though. 
> Like, if someone does "rm /dev/urandom". Perhaps we should perform one 
> pg_strong_random() call at postmaster startup, and if that fails, refuse 
> to start up.

That's sort of contradictory.  If you're worried about transient failures,
allowing a single failed try to cause postmaster startup failure isn't the
way to make things more robust.  Giving up after a bunch of failed tries
over a very short interval isn't much better.

I'm not sure how hard we need to work here.  The case at hand seems
to be one of simply not having gotten the bugs out of the initial
implementation, so maybe we shouldn't read too much into it.

I do agree that the buildfarm needs to be more robust against broken
postmasters, because finding bugs is its raison d' etre.  But I'm not
convinced that it's a good idea to have the postmaster itself conclude
that there's something wrong with its configured random-number source.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-12 Thread Heikki Linnakangas

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:

Or should we at least get pg_regress to try to shut down the
postmaster if it can't connect after 120 seconds?


I think that makes a lot of sense, independently of this random stuff.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-12 Thread Heikki Linnakangas

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:



On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:

On 12/12/2016 05:58 AM, Michael Paquier wrote:

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan 
wrote:


jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120
messages on its
log about "Could not acquire random number". Then nothing.


So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.


Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl,
doesn't
have this issue.


Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2016-12-10%2022%3A00%3A15

And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.


Actually, looking at the config.log on jacana, it's trying to use
/dev/urandom:

configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

  elif test "$PORTNAME" = x"win32" ; then
USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom,
but the left side of the comparison doesn't have the 'x'. Oops.

Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good
error message, if acquiring a random number fails for any reason. This
needs to work in the frontend as well backend, but we could still have
an elog(LOG, ...) there, inside an #ifndef FRONTEND block.



I see you have now improved the messages in postmaster.c, which is good.


Well, I only wordsmithed them a bit, it still doesn't give much clue on 
why it failed. We should add more details to it.



However, the bigger problem (ISTM) is that when this failed I had a
system which was running but where every connection immediately failed:

== creating temporary instance==
== initializing database system   ==
== starting postmaster==

pg_regress: postmaster did not respond within 120 seconds
Examine 
c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log
 for the reason
make: *** [check] Error 2

Should one or more of these errors be fatal? Or should we at least get
pg_regress to try to shut down the postmaster if it can't connect after
120 seconds?


Making it fatal, i.e. bringing down the server, doesn't seem like an 
improvement. If the failure is transient, you don't want to kill the 
whole server, when one connection attempt fails.


It would be nice to fail earlier if it's permanently failing, though. 
Like, if someone does "rm /dev/urandom". Perhaps we should perform one 
pg_strong_random() call at postmaster startup, and if that fails, refuse 
to start up.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-12 Thread Andrew Dunstan



On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:

On 12/12/2016 05:58 AM, Michael Paquier wrote:
On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan  
wrote:


jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120 
messages on its

log about "Could not acquire random number". Then nothing.


So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.


Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl, 
doesn't

have this issue.


Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2016-12-10%2022%3A00%3A15 


And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.


Actually, looking at the config.log on jacana, it's trying to use 
/dev/urandom:


configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

  elif test "$PORTNAME" = x"win32" ; then
USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom, 
but the left side of the comparison doesn't have the 'x'. Oops.


Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good 
error message, if acquiring a random number fails for any reason. This 
needs to work in the frontend as well backend, but we could still have 
an elog(LOG, ...) there, inside an #ifndef FRONTEND block.



I see you have now improved the messages in postmaster.c, which is good.

However, the bigger problem (ISTM) is that when this failed I had a 
system which was running but where every connection immediately failed:


   == creating temporary instance==
   == initializing database system   ==
   == starting postmaster==

   pg_regress: postmaster did not respond within 120 seconds
   Examine 
c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log
 for the reason
   make: *** [check] Error 2

Should one or more of these errors be fatal? Or should we at least get 
pg_regress to try to shut down the postmaster if it can't connect after 
120 seconds?



[In answer to Michael's question above, I forcibly shut down the 
postmaster by hand. Otherwise it would still be running, and we would 
not have got the report on the buildfarm server.]


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-11 Thread Michael Paquier
On Mon, Dec 12, 2016 at 4:32 PM, Heikki Linnakangas  wrote:
> That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the 
> left side of the comparison doesn't have the 'x'. Oops.

Good catch.

> This makes me wonder if we should work a bit harder to get a good error
> message, if acquiring a random number fails for any reason. This needs to
> work in the frontend as well backend, but we could still have an elog(LOG,
> ...) there, inside an #ifndef FRONTEND block.

Yeah, this has been itching me as well. We could allocate an error
string in a psprintf()'d string and let the callers of
pg_strong_backend() use it as they are responsible for the error
handling. What do you think?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-11 Thread Heikki Linnakangas

On 12/12/2016 05:58 AM, Michael Paquier wrote:

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan  wrote:


jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120 messages on its
log about "Could not acquire random number". Then nothing.


So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.


Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl, doesn't
have this issue.


Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2016-12-10%2022%3A00%3A15
And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.


Actually, looking at the config.log on jacana, it's trying to use 
/dev/urandom:


configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

  elif test "$PORTNAME" = x"win32" ; then
USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom, but 
the left side of the comparison doesn't have the 'x'. Oops.


Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good error 
message, if acquiring a random number fails for any reason. This needs 
to work in the frontend as well backend, but we could still have an 
elog(LOG, ...) there, inside an #ifndef FRONTEND block.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jacana hung after failing to acquire random number

2016-12-11 Thread Michael Paquier
On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan  wrote:
>
> jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
> check". After starting the autovacuum launcher there are 120 messages on its
> log about "Could not acquire random number". Then nothing.
>
>
> So I suspect the problem here is commit
> fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
> detail.
>
>
> Shouldn't we want the postmaster to shut down if it's not going to go
> further? Note that frogmouth, also mingw, which builds with openssl, doesn't
> have this issue.

Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2016-12-10%2022%3A00%3A15
And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though. A
second difference are the missing CRYPT_MACHINE_KEYSET and
CRYP_NEWKEYSET. So, with something like the patch attached, do you see
improvements? What this patch does is to use a different context at
each call, and with the key container flags to allow proper access to
it (as there are winxp-only limitations here). I have tried to compile
with MinGW in my environment but my gcc compiler keep crashing, so I
cannot reproduce directly the problem I am afraid.
-- 
Michael
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
index 6d3aa38..34cab6f 100644
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -31,14 +31,6 @@
 #include 
 #endif
 
-#ifdef WIN32
-/*
- * Cache a global crypto provider that only gets freed when the process
- * exits, in case we need random numbers more than once.
- */
-static HCRYPTPROV hProvider = 0;
-#endif
-
 #if defined(USE_DEV_URANDOM)
 /*
  * Read (random) bytes from a file.
@@ -111,28 +103,36 @@ pg_strong_random(void *buf, size_t len)
 	 * Windows has CryptoAPI for strong cryptographic numbers.
 	 */
 #elif defined(USE_WIN32_RANDOM)
-	if (hProvider == 0)
+	HCRYPTPROV	hProvider;
+	DWORD		flags;
+
+	flags = CRYPT_VERIFYCONTEXT | CRYPT_SILENT | CRYPT_MACHINE_KEYSET;
+
+	/* Create a crypto provider */
+	if (!CryptAcquireContext(,
+			 NULL,
+			 MS_DEF_PROV,
+			 PROV_RSA_FULL,
+			 flags))
 	{
+		/* If previous creation failed, try with a new key container */
+		flags |= CRYPT_NEWKEYSET;
 		if (!CryptAcquireContext(,
  NULL,
  MS_DEF_PROV,
  PROV_RSA_FULL,
- CRYPT_VERIFYCONTEXT | CRYPT_SILENT))
-		{
-			/*
-			 * On failure, set back to 0 in case the value was for some reason
-			 * modified.
-			 */
-			hProvider = 0;
-		}
+ flags))
+			return false;
 	}
-	/* Re-check in case we just retrieved the provider */
-	if (hProvider != 0)
+
+	if (!CryptGenRandom(hProvider, len, buf))
 	{
-		if (CryptGenRandom(hProvider, len, buf))
-			return true;
+		CryptReleaseContext(hProvider, 0);
+		return false;
 	}
-	return false;
+
+	CryptReleaseContext(hProvider, 0);
+	return true;
 
 	/*
 	 * Read /dev/urandom ourselves.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] jacana hung after failing to acquire random number

2016-12-10 Thread Andrew Dunstan


jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make 
check". After starting the autovacuum launcher there are 120 messages on 
its log about "Could not acquire random number". Then nothing.



So I suspect the problem here is commit 
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in 
detail.



Shouldn't we want the postmaster to shut down if it's not going to go 
further? Note that frogmouth, also mingw, which builds with openssl, 
doesn't have this issue.



cheers


andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers