Re: [HACKERS] Still something fishy in the fastpath lock stuff

2014-03-31 Thread Robert Haas
On Wed, Mar 26, 2014 at 12:59 AM, Robert Haas robertmh...@gmail.com wrote:
 It seems to me that the simplest thing to do might be
 just this:

 -static FastPathStrongRelationLockData *FastPathStrongRelationLocks;
 +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;

 Do you see any problem with that approach?

Hearing nothing, I went ahead and did this.  I see that the failure on
prairiedog only occurred once, so I don't know how we'll know whether
this fixed the problem that caused that failure or merely fixed a
problem that could cause a failure, but I'm not sure what else we can
really do at this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still something fishy in the fastpath lock stuff

2014-03-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Mar 26, 2014 at 12:59 AM, Robert Haas robertmh...@gmail.com wrote:
 It seems to me that the simplest thing to do might be
 just this:
 
 -static FastPathStrongRelationLockData *FastPathStrongRelationLocks;
 +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
 
 Do you see any problem with that approach?

 Hearing nothing, I went ahead and did this.

Sorry, I've been preoccupied with personal matters for the past little
while, and hadn't had time to think about this.  After a review of the
code it looks probably all right to me.

 I see that the failure on
 prairiedog only occurred once, so I don't know how we'll know whether
 this fixed the problem that caused that failure or merely fixed a
 problem that could cause a failure, but I'm not sure what else we can
 really do at this point.

Well, it's only one failure, but it occurred after prairiedog had run
a grand total of 20 buildfarm cycles on 9.2 and newer, so I'm thinking
that the probability of failure on that machine is more than epsilon.
If we don't see it again for a good long while then I'll believe that
this fixed it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still something fishy in the fastpath lock stuff

2014-03-26 Thread Heikki Linnakangas

On 03/26/2014 06:59 AM, Robert Haas wrote:

On Tue, Mar 25, 2014 at 11:58 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Robert Haas robertmh...@gmail.com writes:

I really think we
should change that rule; it leads to ugly code, and bugs.


No doubt, but are we prepared to believe that we have working compiler
barriers other than volatile?  (Which, I will remind you, is in the C
standard.  Unlike compiler barriers.)


I'm prepared to believe that it would take some time to be certain
that we've got that right on all compilers we support.  The most
common ones are easily dealt with, I think, but it might be harder to
verify the behavior on more obscure compilers.  But I'd rather deal
with bullet-proofing SpinLockAcquire() and SpinLockRelease() *in
general* than deal with bullet-proofing every call site individually,
which is what we have to do right now.


+1. To support porting to new compilers, we can fall back to e.g calling 
a dummy function through a function pointer, if we don't have a proper 
implementation.


Aside from the correctness issue: A while ago, while working on the 
xloginsert slots stuff, I looked at the assembler that gcc generated for 
ReserverXLogInsertLocation(). That function is basically:


SpinLockAcquire()
modify and read a few variables in shared memory
SpinLockRelease()
fairly expensive calculations based on values read

Gcc decided to move the release of the lock so that it became:

SpinLockAcquire()
modify and read a few variables in shared memory
fairly expensive calculations based on values read
SpinLockRelease()

I went through some effort while writing the patch to make sure the 
spinlock is held for as short duration as possible. But gcc undid that 
to some extent :-(.


I tried to put a compiler barrier there and run some short performance 
tests, but it didn't make any noticeable difference. So it doesn't seem 
matter in practice - maybe the CPU reorders things anyway, or the 
calculations are not expensive enough to matter after all.


I just looked at the assembler generated for LWLockAcquire, and a 
similar thing is happening there. The code is:


...
641 /* We are done updating shared state of the lock itself. */
642 SpinLockRelease(lock-mutex);
643
644 TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
645
646 /* Add lock to list of locks held by this backend */
647 held_lwlocks[num_held_lwlocks++] = l;
...

gcc reordered part of the held_lwlocks assignment after releasing the 
spinlock:


.L21:
.loc 1 647 0
movslq  num_held_lwlocks(%rip), %rax
addq$16, %r12
.LVL23:
.loc 1 652 0
testl   %ebx, %ebx
.loc 1 642 0
movb$0, (%r14)  --- SpinLockRelease
.loc 1 652 0
leal-1(%rbx), %r13d
.loc 1 647 0
leal1(%rax), %edx
movq%r14, held_lwlocks(,%rax,8)
.LVL24:
movl%edx, num_held_lwlocks(%rip)

The held_lwlocks assignment was deliberately put outside the 
spinlock-protected section, to minimize the time the spinlock is held, 
but the compiler moved some of it back in: the basic block .L21.


A compiler barrier would avoid that kind of reordering. Not using 
volatile would also allow the compiler to reorder and optimize the 
instructions *within* the spinlock-protected block, which might shave a 
few instructions while a spinlock is held. Granted, spinlock-protected 
blocks are small so there isn't much room for optimization, but still.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still something fishy in the fastpath lock stuff

2014-03-25 Thread Robert Haas
On Tue, Mar 25, 2014 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Buildfarm member prairiedog crashed on today's HEAD with this:

 TRAP: FailedAssertion(!(FastPathStrongRelationLocks-count[fasthashcode]  
 0), File: lock.c, Line: 1240)

 I have the core file, and gdb says:

 #0  0x90047dac in kill ()
 #1  0x9012d7b4 in abort ()
 #2  0x003b1c38 in ExceptionalCondition (conditionName=0x0, errorType=0x25 , 
 fileName=0x2 , lineNumber=109) at assert.c:54
 #3  0x002960bc in RemoveLocalLock (locallock=0x280b414) at lock.c:1240
 #4  0x002968a4 in LockReleaseAll (lockmethodid=1, allLocks=0 '\0') at 
 lock.c:2087
 #5  0x00299550 in ProcReleaseLocks (isCommit=1 '\001') at proc.c:752
 #6  0x003de2d4 in ResourceOwnerReleaseInternal (owner=0x2802184, 
 phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001') at 
 resowner.c:307
 #7  0x003de71c in ResourceOwnerRelease (owner=0x2802184, 
 phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001') at 
 resowner.c:212
 #8  0x00083b7c in CommitTransaction () at xact.c:1998
 #9  0x00083ea4 in CommitTransactionCommand () at xact.c:2713
 #10 0x002ab9b4 in finish_xact_command () at postgres.c:2408
 #11 0x002af2b0 in exec_simple_query (query_string=0x2807e1c CREATE TABLE t3 
 (name TEXT, n INTEGER);) at postgres.c:1076
 #12 0x002b08bc in PostgresMain (argc=41975324, argv=0xbfffdd3c, 
 dbname=0x2800b6c regression, username=0x51 ) at postgres.c:4004
 #13 0x0023d5bc in ServerLoop () at postmaster.c:4089
 #14 0x0023f11c in PostmasterMain (argc=6, argv=0x46d9d0) at postmaster.c:1222
 #15 0x001bb89c in main (argc=6, argv=0x2100710) at main.c:203

 Since this machine has only been running the buildfarm for a week,
 I rather imagine we will see more of these.  Thoughts?

 (This is a PPC machine, but only a single processor, so it's hard
 to see how memory ordering issues might enter into it ...)

Well, it's possible that the issue could be related to compiler
reordering, since it's still the rule that SpinLockAcquire/Release
must be memory barriers but need not be compiler barriers, and
FastPathStrongRelationLocks is not volatile-ized.  I really think we
should change that rule; it leads to ugly code, and bugs.  But to
determine whether that's the issue, we'd probably need to disassemble
the relevant code and see whether the compiler did in fact shift
things around.

But it seems equally possible that the bug is the result of some more
pedestrian error.  I don't have a great idea where to look for such a
mistake, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still something fishy in the fastpath lock stuff

2014-03-25 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Mar 25, 2014 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Since this machine has only been running the buildfarm for a week,
 I rather imagine we will see more of these.  Thoughts?
 
 (This is a PPC machine, but only a single processor, so it's hard
 to see how memory ordering issues might enter into it ...)

 Well, it's possible that the issue could be related to compiler
 reordering, since it's still the rule that SpinLockAcquire/Release
 must be memory barriers but need not be compiler barriers, and
 FastPathStrongRelationLocks is not volatile-ized.

Yeah, I was just about to complain about that.  What made you think
you could get away with ignoring that coding rule?

 I really think we
 should change that rule; it leads to ugly code, and bugs.

No doubt, but are we prepared to believe that we have working compiler
barriers other than volatile?  (Which, I will remind you, is in the C
standard.  Unlike compiler barriers.)

 But to
 determine whether that's the issue, we'd probably need to disassemble
 the relevant code and see whether the compiler did in fact shift
 things around.

I looked at the asm code around line 1240, and it seems all right, but
if this is a locking problem then the actual failure was probably in some
preceding spinlocked segment.  I haven't the patience to look at each one
of these segments, and it's sort of irrelevant anyway, because whether
this particular machine did reorder the asm code or not, the fact remains
that *this C code is broken*.  The fact that you don't like the existing
coding rules concerning spinlocks is not license to write unsafe code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Still something fishy in the fastpath lock stuff

2014-03-25 Thread Robert Haas
On Tue, Mar 25, 2014 at 11:58 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Well, it's possible that the issue could be related to compiler
 reordering, since it's still the rule that SpinLockAcquire/Release
 must be memory barriers but need not be compiler barriers, and
 FastPathStrongRelationLocks is not volatile-ized.

 Yeah, I was just about to complain about that.  What made you think
 you could get away with ignoring that coding rule?

Um, nothing.  It was just an oversight.  If the implication here is
that I deliberately ignore PostgreSQL coding rules, then I resent
that.  If the implication is that I don't understand them, the fact
that I made explicit reference to the existence of this precise rule
in my previous response would seem to militate against that
interpretation.

 I really think we
 should change that rule; it leads to ugly code, and bugs.

 No doubt, but are we prepared to believe that we have working compiler
 barriers other than volatile?  (Which, I will remind you, is in the C
 standard.  Unlike compiler barriers.)

I'm prepared to believe that it would take some time to be certain
that we've got that right on all compilers we support.  The most
common ones are easily dealt with, I think, but it might be harder to
verify the behavior on more obscure compilers.  But I'd rather deal
with bullet-proofing SpinLockAcquire() and SpinLockRelease() *in
general* than deal with bullet-proofing every call site individually,
which is what we have to do right now.

Keep in mind that we don't require the use of volatile for critical
sections involving LWLockAcquire() and LWLockRelease(), so we're
implicitly assume that those functions DO act as compiler barriers.  I
think we're mostly just hoping that the compiler isn't smart enough to
make deductions about the behavior of a function in another
translation unit, or that if it is the combination of
SpinLockAcquire() + SpinLockRelease() adds up to a compiler barrier
even if they aren't each individually a barrier.  Whatever the reason,
our coding rule is currently that you've got to volatile-ize all
references within sections protected by a spinlock, but not sections
protected by an lwlock.  Without intending to defend the fact that I
failed to do it correctly in this case, I don't find it surprising
that people flub that occasionally: this is not the first time someone
has made such an error and it doubtless won't be the last.  One of the
things I did to Andres's replication slot patch when reviewing it was
fix numerous errors of this type.  I don't think he'll be the last
person to get it wrong, either.

 I looked at the asm code around line 1240, and it seems all right, but
 if this is a locking problem then the actual failure was probably in some
 preceding spinlocked segment.  I haven't the patience to look at each one
 of these segments, and it's sort of irrelevant anyway, because whether
 this particular machine did reorder the asm code or not, the fact remains
 that *this C code is broken*.  The fact that you don't like the existing
 coding rules concerning spinlocks is not license to write unsafe code.

Tom, I never said otherwise.  Flaming me is neither necessary nor
productive.  It seems to me that the simplest thing to do might be
just this:

-static FastPathStrongRelationLockData *FastPathStrongRelationLocks;
+static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;

Do you see any problem with that approach?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers