Re: [HACKERS] XLOG_NO_TRAN and XLogRecord.xl_xid

2007-02-22 Thread Heikki Linnakangas

Florian G. Pflug wrote:

Hi

After futher reading I fear I have to bother you with another question ;-)
There is a flag XLOG_NO_TRAN passed via the info parameter to XLogInsert.

Now, for example the following comment in clog.c
/*
 * Write a TRUNCATE xlog record
 *
 * We must flush the xlog record to disk before returning --- see notes
 * in TruncateCLOG().
 *
 * Note: xlog record is marked as outside transaction control, since we
 * want it to be redone whether the invoking transaction commits or not.
 */
static void
WriteTruncateXlogRec(int pageno)
...

seems to imply that (some?) wal redoe records only actually get redone
if the transaction that caused them eventually comitted. But given the
way postgres MVCC works that doesn't make sense to me, and I also can't
find any code that would actually skip xlog entries.


That comment is a bit misleading, I agree. We don't skip xlog entries, 
they're always replayed.


The xid in the WAL record is used by some WAL resource managers to 
reconstruct the original data. For that purpose, it might as well not be 
in the header, but in the data portion.


It's also used in PITR to recover up to a certain transaction, and it's 
used to advance the next xid counter to the next unused xid after replay.



On a related note - Looking at e.g. heap_xlog_insert, it seems that
the orginal page (before the crash), and the one reconstructed via
heap_xlog_insert are only functionally equivalent, but not the same
byte-wise? At least this is what doing
HeapTupleHeaderSetCmin(htup, FirstCommandId);
seems to imply - surely the original command id could have been higher, no?


Yep, that's right. The reconstructed page is not always byte-to-byte 
identical to the original.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] XLOG_NO_TRAN and XLogRecord.xl_xid

2007-02-22 Thread Heikki Linnakangas

Heikki Linnakangas wrote:

Florian G. Pflug wrote:

seems to imply that (some?) wal redoe records only actually get redone
if the transaction that caused them eventually comitted. But given the
way postgres MVCC works that doesn't make sense to me, and I also can't
find any code that would actually skip xlog entries.


That comment is a bit misleading, I agree. We don't skip xlog entries, 
they're always replayed.


The xid in the WAL record is used by some WAL resource managers to 
reconstruct the original data. For that purpose, it might as well not be 
in the header, but in the data portion.


It's also used in PITR to recover up to a certain transaction, and it's 
used to advance the next xid counter to the next unused xid after replay.


Also, we skip clog update and writing the commit record if the 
transaction hasn't written any WAL records that are tied to the transaction.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] XLOG_NO_TRAN and XLogRecord.xl_xid

2007-02-22 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Florian G. Pflug wrote:
 * Note: xlog record is marked as outside transaction control, since we
 * want it to be redone whether the invoking transaction commits or not.

 That comment is a bit misleading, I agree. We don't skip xlog entries, 
 they're always replayed.

Yeah, this distinction is another bit of effectively-dead code left over
from Vadim's original plan of using WAL for UNDO.  I haven't worried
about ripping it out because it doesn't cost much and it seems that
distinguishing transactional from nontransactional changes might be
useful for log analysis if nothing else.

 Yep, that's right. The reconstructed page is not always byte-to-byte 
 identical to the original.

We don't worry about recovering cmin/cmax since only the originating
transaction would have cared.  I think physical location of tuples on
a page isn't reliably reproduced either.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match