Re: [HACKERS] [BUGS] BUG #6748: sequence value may be conflict in some cases

2012-07-24 Thread Tom Lane
I wrote:
 The tuple in buffers has log_cnt = 1, is_called = false, but the initial
 XLOG_SEQ_LOG record shows log_cnt = 0, is_called = true.  So if we crash
 at this point, after recovery it looks like one nextval() has already
 been done.  However, AlterSequence generates another XLOG_SEQ_LOG record
 based on what's in shared buffers, so after replay of that, we're back
 to the original state where it does not appear that any nextval() has
 been done.

 I'm of the opinion that this kluge needs to be removed; it's just insane
 that we're not logging the same state we leave in our buffers.

Attached is a WIP patch for that.  I had to leave nextval() logging
a different state from what it puts into buffers, since without that
we can't have the optimization of one-WAL-record-per-N-nextval-calls,
which seems worth preserving.  But I was able to get rid of the other
routines' doing it.  I think it's all right for nextval to do it because
the only difference between what it logs and what it leaves in the
buffer is that last_value is advanced and log_cnt is zero; in particular
there is not a difference in is_called state.

For awhile I was convinced that nextval() was broken in another way,
because it does MarkBufferDirty() before changing the buffer contents,
which sure *looks* wrong; what if a checkpoint writes the
not-yet-modified buffer contents and then clears the dirty bit?
However, I now think this is safe because we hold the buffer content
lock exclusively until we've finished making the changes.  A checkpoint
might observe the dirty bit set and attempt to write the page, but it
will block trying to get shared buffer content lock until we complete
the changes.  So that should be all right, and I've added some comments
to explain it, but if anyone thinks it's wrong speak up!

Another thing that I think is a real bug is that AlterSequence needs
to reset log_cnt to zero any time it changes any of the sequence
generation parameters, to ensure that the new values will be applied
promptly.  I've done that in the attached.

I also modified the code so that log_cnt is initialized to zero not one
when a sequence is created or reset to is_called = false.  This doesn't
make any functional difference given the change to force logging a WAL
record when is_called becomes set, but I think it's a good change
anyway: it doesn't make any sense anymore to imply that one sequence
value has been pre-reserved.  However, if anyone's particularly annoyed
by the first diff hunk in the regression tests, we could undo that.

Lastly, I modified read_seq_tuple (nee read_info) to return a HeapTuple
pointer to the sequence's tuple, and made use of that to make the rest
of the code simpler and less dependent on page-layout assumptions.
At this point this is only a code-beautification change; it was
functionally necessary in an earlier version of the patch, but I didn't
back it out when I got rid of the change that made it necessary.

Oh, there are also some hunks in here to add debug logging, which I was
using for testing.  I'm undecided whether to remove those or clean them
up some more and wrap them in something like #ifdef SEQUENCE_DEBUG.

One other point about the regression test diffs: formerly, the typical
state seen at line 191 of sequence.out was log_cnt = 32.  Now it is 31,
because the first nextval call increases log_cnt to 32 and then the
second decreases it to 31.  The failure mode described in
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01359.php
is no longer possible, because a checkpoint before the first nextval
call won't change its behavior.  But instead, if a checkpoint happens
*between* those two nextvals, the second nextval will decide to create
a WAL entry and it will increase log_cnt to 32.  So we still need the
variant expected file sequence_1.out, but now it shows log_cnt = 32
instead of the other way around.

Comments anyone?

regards, tom lane

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 34b74f6c3844a3a172cd660357f8ccbc48f57f3c..69054b546cec1bb71ff79b18702a0e1edb3a45fb 100644
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***
*** 38,44 
  /*
   * We don't want to log each fetching of a value from a sequence,
   * so we pre-log a few fetches in advance. In the event of
!  * crash we can lose as much as we pre-logged.
   */
  #define SEQ_LOG_VALS	32
  
--- 38,44 
  /*
   * We don't want to log each fetching of a value from a sequence,
   * so we pre-log a few fetches in advance. In the event of
!  * crash we can lose (skip over) as many values as we pre-logged.
   */
  #define SEQ_LOG_VALS	32
  
*** typedef struct SeqTableData
*** 73,79 
  	int64		cached;			/* last value already cached for nextval */
  	/* if last != cached, we have not used up all the cached values */
  	int64		increment;		/* copy of sequence's increment field */
! 	/* note that increment is zero until we 

Re: [HACKERS] [BUGS] BUG #6748: sequence value may be conflict in some cases

2012-07-23 Thread Tom Lane
meixiangm...@huawei.com writes:
 [ freshly-created sequence has wrong state after crash ]

I didn't believe this at first, but sure enough, it fails just as
described if you force a crash between the first and second nextval
calls for the sequence.  This used to work ...

The change that broke it turns out to be the ALTER SEQUENCE OWNED BY
call that we added to serial-column creation circa 8.2; although on
closer inspection I think any ALTER SEQUENCE before the first nextval
call would be problematic.  The real issue is the ancient kluge in
sequence creation that writes something different into the WAL log
than what it leaves behind in shared buffers:

/* We do not log first nextval call, so advance sequence here */
/* Note we are scribbling on local tuple, not the disk buffer */
newseq-is_called = true;
newseq-log_cnt = 0;

The tuple in buffers has log_cnt = 1, is_called = false, but the initial
XLOG_SEQ_LOG record shows log_cnt = 0, is_called = true.  So if we crash
at this point, after recovery it looks like one nextval() has already
been done.  However, AlterSequence generates another XLOG_SEQ_LOG record
based on what's in shared buffers, so after replay of that, we're back
to the original state where it does not appear that any nextval() has
been done.

I'm of the opinion that this kluge needs to be removed; it's just insane
that we're not logging the same state we leave in our buffers.  To do
that, we need to fix nextval() so that the first nextval call generates
an xlog entry; that is, if we are changing is_called to true we ought to
consider that as a reason to force an xlog entry.  I think way back when
we thought it was a good idea to avoid making two xlog entries during
creation and immediate use of a sequence, but considering all the other
xlog entries involved in creation of a sequence object, this is a pretty
silly optimization.  (Besides, it merely postpones the first
nextval-driven xlog entry from the first to the second nextval call.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers