Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-19 Thread Noah Misch
On Tue, Oct 18, 2016 at 10:03:39AM -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 10/18/2016 04:13 PM, Tom Lane wrote:
> >> There's a smoking gun in the postmaster log:
> >> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for 
> >> process 18491
> 
> > Ok, I've reverted that commit for now. It clearly needs more thought, 
> > because of this, and the pademelon failure discussed on the other thread.
> 
> I think that was an overreaction.  The problem is pretty obvious after
> adding some instrumentation:

I endorse Heikki's revert.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Michael Paquier
On Tue, Oct 18, 2016 at 11:40 PM, Tom Lane  wrote:
> I wrote:
>> The cleanest fix might be to change those various "long" variables
>> to uint32.  You'd have to think about how to handle the ntohl/htonl
>> calls that are used on them, though.
>
> Or actually, no, you wouldn't have to think very hard.  I was supposing
> that those calls were declared to traffic in "long"s, but they aren't
> and never have been, at least not since SUSv2:
>
> uint32_t htonl(uint32_t hostlong);
> uint16_t htons(uint16_t hostshort);
> uint32_t ntohl(uint32_t netlong);
> uint16_t ntohs(uint16_t netshort);
>
> So s/long/uint32/ would probably fix it just fine.

The pg_strong_random patch has been reverted, what's done is done.
FWIW only doing that is needed to fix the problem as already mentioned
by Tom:
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f232083..634578d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -38,7 +38,7 @@ volatile uint32 CritSectionCount = 0;
 intMyProcPid;
 pg_time_t  MyStartTime;
 struct Port *MyProcPort;
-long   MyCancelKey;
+uint32 MyCancelKey;
 intMyPMChildSlot;

 /*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 78545da..5e623a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,7 +162,7 @@ extern PGDLLIMPORT int MyProcPid;
 extern PGDLLIMPORT pg_time_t MyStartTime;
 extern PGDLLIMPORT struct Port *MyProcPort;
 extern PGDLLIMPORT struct Latch *MyLatch;
-extern long MyCancelKey;
+extern uint32 MyCancelKey;
 extern int MyPMChildSlot;

 extern char OutputFileName[];

I'll send an update for that, as well as a solution for the pademelon
problem on the SCRAM thread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Tom Lane
I wrote:
> The cleanest fix might be to change those various "long" variables
> to uint32.  You'd have to think about how to handle the ntohl/htonl
> calls that are used on them, though.

Or actually, no, you wouldn't have to think very hard.  I was supposing
that those calls were declared to traffic in "long"s, but they aren't
and never have been, at least not since SUSv2:

uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);

So s/long/uint32/ would probably fix it just fine.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Tom Lane
Heikki Linnakangas  writes:
> On 10/18/2016 04:13 PM, Tom Lane wrote:
>> There's a smoking gun in the postmaster log:
>> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for 
>> process 18491

> Ok, I've reverted that commit for now. It clearly needs more thought, 
> because of this, and the pademelon failure discussed on the other thread.

I think that was an overreaction.  The problem is pretty obvious after
adding some instrumentation:

2016-10-18 09:57:47.508 EDT [21229] LOG:  wrong key (0x7B7E4D5E, expected 
0xF0F804017B7E4D5E) in cancel request for process 21228

To wit, the various cancel_key backend variables are declared as "long",
and the new code

if (!pg_strong_random(, sizeof(MyCancelKey)))

is therefore computing an 8-byte random value on 64-bit-long machines.
But only 4 bytes go to the client and come back.

The cleanest fix might be to change those various "long" variables
to uint32.  You'd have to think about how to handle the ntohl/htonl
calls that are used on them, though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Heikki Linnakangas

On 10/18/2016 04:13 PM, Tom Lane wrote:

Magnus Hagander  writes:

On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

The test executes "select pg_sleep(10)" and tries to cancel it. In recent
master builds, cancel seems to be ignored, and the statement lasts for 10
seconds.



My guess is it's related to this:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
That's certainly not intended to break things, but that was changed on Oct
17 and it relates to cancel keys.
What platform does the postgres server run on? Can can you check if query
cancel works on libpq or if it's completely broken?


I can confirm that query cancel is broken in HEAD on RHEL6.

regression=# select pg_sleep(10);
^CCancel request sent
... nothing happens for the balance of the 10 seconds ...
regression=#

There's a smoking gun in the postmaster log:

2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for 
process 18491


Ok, I've reverted that commit for now. It clearly needs more thought, 
because of this, and the pademelon failure discussed on the other thread.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Tom Lane
Magnus Hagander  writes:
> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
> sitnikov.vladi...@gmail.com> wrote:
>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
>> master builds, cancel seems to be ignored, and the statement lasts for 10
>> seconds.

> My guess is it's related to this:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
> That's certainly not intended to break things, but that was changed on Oct
> 17 and it relates to cancel keys.
> What platform does the postgres server run on? Can can you check if query
> cancel works on libpq or if it's completely broken?

I can confirm that query cancel is broken in HEAD on RHEL6.

regression=# select pg_sleep(10);
^CCancel request sent
... nothing happens for the balance of the 10 seconds ...
regression=# 

There's a smoking gun in the postmaster log:

2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for 
process 18491

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Vladimir Gordiychuk
>What platform does the postgres server run on?

Ubuntu
OS name: "linux", version: "3.19.0-66-generic", arch: "amd64", family:
"unix"

2016-10-18 11:05 GMT+03:00 Magnus Hagander :

>
>
> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
> sitnikov.vladi...@gmail.com> wrote:
>
>> Hi,
>>
>> In pgjdbc we have regular regression testing against "build from
>> master" PostgreSQL, and recent master builds fail for "statement cancel"
>> test.
>>
>> The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine,
>> then "statement cancel" started to fail.
>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
>> master builds, cancel seems to be ignored, and the statement lasts for 10
>> seconds.
>>
>> Exactly the same driver and test version works fine for 8.4, 9.1, 9.2,
>> 9.3, 9.4, 9.5, and 9.6:
>> Here's a sample build report: https://travis-ci.org/
>> pgjdbc/pgjdbc/builds/168444341
>>
>> Any hints what could be the issue?
>> Was the breakage intentional?
>>
>>
> My guess is it's related to this: https://git.postgresql.org/
> gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e
> 258ff3a77c
>
> That's certainly not intended to break things, but that was changed on Oct
> 17 and it relates to cancel keys.
>
> What platform does the postgres server run on? Can can you check if query
> cancel works on libpq or if it's completely broken?
>
>
> --
>  Magnus Hagander
>  Me: http://www.hagander.net/
>  Work: http://www.redpill-linpro.com/
>


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Magnus Hagander
On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Hi,
>
> In pgjdbc we have regular regression testing against "build from
> master" PostgreSQL, and recent master builds fail for "statement cancel"
> test.
>
> The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine,
> then "statement cancel" started to fail.
> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
> master builds, cancel seems to be ignored, and the statement lasts for 10
> seconds.
>
> Exactly the same driver and test version works fine for 8.4, 9.1, 9.2,
> 9.3, 9.4, 9.5, and 9.6:
> Here's a sample build report: https://travis-ci.org/
> pgjdbc/pgjdbc/builds/168444341
>
> Any hints what could be the issue?
> Was the breakage intentional?
>
>
My guess is it's related to this:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c

That's certainly not intended to break things, but that was changed on Oct
17 and it relates to cancel keys.

What platform does the postgres server run on? Can can you check if query
cancel works on libpq or if it's completely broken?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


[HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-18 Thread Vladimir Sitnikov
Hi,

In pgjdbc we have regular regression testing against "build from
master" PostgreSQL, and recent master builds fail for "statement cancel"
test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement
cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent
master builds, cancel seems to be ignored, and the statement lasts for 10
seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3,
9.4, 9.5, and 9.6:
Here's a sample build report:
https://travis-ci.org/pgjdbc/pgjdbc/builds/168444341

Any hints what could be the issue?
Was the breakage intentional?

Vladimir