Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-14 Thread Robert Haas
On Wed, Dec 12, 2012 at 12:31 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 On Wed, Dec 12, 2012 at 1:35 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
 Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a
 comment for your consideration to explain why we don't trust PD_ALL_VISIBLE
 in Hot standby for seq scans, but still trust VM for index-only scans.

 Sure.


 Here is a small patch that adds comments to heap scan code explaining
 why we don't trust the all-visible flag in the page, still continue to
 support index-only scans on hot standby.

Committed, with a few modifications to the last part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-12 Thread Pavan Deolasee
On Wed, Dec 12, 2012 at 1:35 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
 Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a
 comment for your consideration to explain why we don't trust PD_ALL_VISIBLE
 in Hot standby for seq scans, but still trust VM for index-only scans.

 Sure.


Here is a small patch that adds comments to heap scan code explaining
why we don't trust the all-visible flag in the page, still continue to
support index-only scans on hot standby.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


hot-standby-pd-all-visible-heapscan.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-11 Thread Robert Haas
On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a
 comment for your consideration to explain why we don't trust PD_ALL_VISIBLE
 in Hot standby for seq scans, but still trust VM for index-only scans.

Sure.

 Another piece of code that caught my attention is this (sorry for digging up
 topics that probably have been discussed to death before and I might not
 have paid attention to then):

 Whenever we clear PD_ALL_VISIBLE flag, we do that while holding the heap
 page lock and also WAL log that action (in fact, piggy-back with
 insert/update/delete). The only exception that I saw is in lazy_scan_heap()

  916 else if (PageIsAllVisible(page)  has_dead_tuples)
  917 {
  918 elog(WARNING, page containing dead tuples is marked as
 all-visible in relation \%s\ page %u,
  919  relname, blkno);
  920 PageClearAllVisible(page);
  921 MarkBufferDirty(buf);
  922 visibilitymap_clear(onerel, blkno, vmbuffer);
  923 }

 Obviously, this is not a case that we expect to occur. But if it does occur
 for whatever reason, then even though we will correct the page flag on
 master, this would never be populated to the standby. So the standby may
 continue to have that tiny bit set on the page, at least until another
 insert/update/delete happens on the page. Not sure if there is anything to
 worry about, but looks a bit strange. I looked at the archive but can't see
 any report of the warning, so may be we are safe after all.

The text of that warning message was different in previous releases,
and I have seen the older text come up.  I don't really want to invent
a new WAL record type just to cover that case, because that seems like
it's more likely to cause problems than to fix them.  We could
consider emitting an XLOG_HEAP_NEWPAGE record, perhaps.  I'm not sure
whether it's worth it, though.  This is only going to arise
(hopefully) if your master is corrupted - that little hunk of code is
really a $0.02 solution to a $1.00 problem.  I don't know that we want
to pretend that it's any more than a band-aid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Pavan Deolasee
I was looking at the following code in heapam.c:

 261 /*
 262  * If the all-visible flag indicates that all tuples on the page
are
 263  * visible to everyone, we can skip the per-tuple visibility
tests. But
 264  * not in hot standby mode. A tuple that's already visible to all
 265  * transactions in the master might still be invisible to a
read-only
 266  * transaction in the standby.
 267  */
 268 all_visible = PageIsAllVisible(dp) 
!snapshot-takenDuringRecovery;

Isn't the check for !snapshot-takenDuringRecovery redundant now in master
or whenever since we added crash-safety for VM ? In fact, this comment made
me think if we are really handling index-only scans correctly or not on the
Hot Standby. But apparently we are by forcing conflicting transactions to
abort before redoing VM bit set operation on the standby. The same
mechanism should protect us against the above case. Now I concede that the
entire magic around setting and clearing the page level all-visible bit and
the VM bit and our ability to keep them in sync is something I don't fully
understand, but I see that every operation that sets the page
level PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer
lock and emits a WAL record. So AFAICS  the conflict resolution logic will
take care of the above too.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Andres Freund
On 2012-12-04 18:38:11 +0530, Pavan Deolasee wrote:
 I was looking at the following code in heapam.c:

  261 /*
  262  * If the all-visible flag indicates that all tuples on the page
 are
  263  * visible to everyone, we can skip the per-tuple visibility
 tests. But
  264  * not in hot standby mode. A tuple that's already visible to all
  265  * transactions in the master might still be invisible to a
 read-only
  266  * transaction in the standby.
  267  */
  268 all_visible = PageIsAllVisible(dp) 
 !snapshot-takenDuringRecovery;

 Isn't the check for !snapshot-takenDuringRecovery redundant now in master
 or whenever since we added crash-safety for VM ? In fact, this comment made
 me think if we are really handling index-only scans correctly or not on the
 Hot Standby. But apparently we are by forcing conflicting transactions to
 abort before redoing VM bit set operation on the standby. The same
 mechanism should protect us against the above case. Now I concede that the
 entire magic around setting and clearing the page level all-visible bit and
 the VM bit and our ability to keep them in sync is something I don't fully
 understand, but I see that every operation that sets the page
 level PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer
 lock and emits a WAL record. So AFAICS  the conflict resolution logic will
 take care of the above too.

Yes, that really seems strange. As you say, were already relying on the
VM to be correct. I think it was simply missed that we can rely on this
since the conflict handling on all-visible was added in
3424bff90f40532527b9cf4f2ad9eaff750682f7.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Robert Haas
On Tue, Dec 4, 2012 at 8:08 AM, Pavan Deolasee pavan.deola...@gmail.com wrote:

 I was looking at the following code in heapam.c:

  261 /*
  262  * If the all-visible flag indicates that all tuples on the page
 are
  263  * visible to everyone, we can skip the per-tuple visibility tests.
 But
  264  * not in hot standby mode. A tuple that's already visible to all
  265  * transactions in the master might still be invisible to a
 read-only
  266  * transaction in the standby.
  267  */
  268 all_visible = PageIsAllVisible(dp) 
 !snapshot-takenDuringRecovery;

 Isn't the check for !snapshot-takenDuringRecovery redundant now in master
 or whenever since we added crash-safety for VM ? In fact, this comment made
 me think if we are really handling index-only scans correctly or not on the
 Hot Standby. But apparently we are by forcing conflicting transactions to
 abort before redoing VM bit set operation on the standby. The same mechanism
 should protect us against the above case. Now I concede that the entire
 magic around setting and clearing the page level all-visible bit and the VM
 bit and our ability to keep them in sync is something I don't fully
 understand, but I see that every operation that sets the page level
 PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer lock and
 emits a WAL record. So AFAICS  the conflict resolution logic will take care
 of the above too.

I wasn't sure whether that could be safely changed.  There's a subtle
distinction here: the PD_ALL_VISIBLE bit isn't the same as the
visibility map bit.  And, technically, the WAL record only fully
protects the setting of *the visibility map bit* not the
PD_ALL_VISIBLE page-level bit.  The purpose of that WAL logging is to
make sure that the page-level bit is never clear while the
visibility-map bit is set; it does not guarantee that the page-level
bit can never be set without issuing a WAL record.  So, for example,
it's perfectly possible for a crash on the master might leave the
page-level bit set while the VM bit is clear.  Now, if that page
somehow makes its way to the standby - via a base backup or a
full-page image - before the tuples it contains are all-visible
according to the standby's xmin horizon, we've got a problem.  Can
that happen?  It seems unlikely, but can we prove it's not possible?
Perhaps, but I wasn't sure.

Index-only scans are safe, because they're looking at the visibility
map itself, not the page-level bit, but the analysis is a little
murkier for sequential scans.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Andres Freund
On 2012-12-04 08:38:48 -0500, Robert Haas wrote:
 On Tue, Dec 4, 2012 at 8:08 AM, Pavan Deolasee pavan.deola...@gmail.com 
 wrote:
 
  I was looking at the following code in heapam.c:
 
   261 /*
   262  * If the all-visible flag indicates that all tuples on the page
  are
   263  * visible to everyone, we can skip the per-tuple visibility tests.
  But
   264  * not in hot standby mode. A tuple that's already visible to all
   265  * transactions in the master might still be invisible to a
  read-only
   266  * transaction in the standby.
   267  */
   268 all_visible = PageIsAllVisible(dp) 
  !snapshot-takenDuringRecovery;
 
  Isn't the check for !snapshot-takenDuringRecovery redundant now in master
  or whenever since we added crash-safety for VM ? In fact, this comment made
  me think if we are really handling index-only scans correctly or not on the
  Hot Standby. But apparently we are by forcing conflicting transactions to
  abort before redoing VM bit set operation on the standby. The same mechanism
  should protect us against the above case. Now I concede that the entire
  magic around setting and clearing the page level all-visible bit and the VM
  bit and our ability to keep them in sync is something I don't fully
  understand, but I see that every operation that sets the page level
  PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer lock and
  emits a WAL record. So AFAICS  the conflict resolution logic will take care
  of the above too.

 I wasn't sure whether that could be safely changed.  There's a subtle
 distinction here: the PD_ALL_VISIBLE bit isn't the same as the
 visibility map bit.  And, technically, the WAL record only fully
 protects the setting of *the visibility map bit* not the
 PD_ALL_VISIBLE page-level bit.  The purpose of that WAL logging is to
 make sure that the page-level bit is never clear while the
 visibility-map bit is set; it does not guarantee that the page-level
 bit can never be set without issuing a WAL record.  So, for example,
 it's perfectly possible for a crash on the master might leave the
 page-level bit set while the VM bit is clear.  Now, if that page
 somehow makes its way to the standby - via a base backup or a
 full-page image - before the tuples it contains are all-visible
 according to the standby's xmin horizon, we've got a problem.  Can
 that happen?  It seems unlikely, but can we prove it's not possible?
 Perhaps, but I wasn't sure.

Youre right, it currently seems to be possible, there's no LSN interlock
prohibiting this as far as I can see.

in lazy_scan_heap we do:

if (!PageIsAllVisible(page))
{
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, 
InvalidXLogRecPtr, vmbuffer,
  
visibility_cutoff_xid);
}

So buf can be written out independently from the xlog record that
visibilitymap_set emits. ISTM we should set the LSN of the heap page as
well as the one of the visibilitymap page. Not sure about the
performance impact of that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Robert Haas
On Tue, Dec 4, 2012 at 9:00 AM, Andres Freund and...@2ndquadrant.com wrote:
 Youre right, it currently seems to be possible, there's no LSN interlock
 prohibiting this as far as I can see.

Yeah, there certainly isn't that.  Now you could perhaps make an
argument that no operation that can propagate a set bit from master to
standby can arrive until after the standby's xmin horizon is
sufficiently current, but that does feel a little fragile to me...
even if it's true now, new WAL record types might break it, for
example.

 in lazy_scan_heap we do:

 if (!PageIsAllVisible(page))
 {
 PageSetAllVisible(page);
 MarkBufferDirty(buf);
 visibilitymap_set(onerel, blkno, 
 InvalidXLogRecPtr, vmbuffer,
   
 visibility_cutoff_xid);
 }

 So buf can be written out independently from the xlog record that
 visibilitymap_set emits. ISTM we should set the LSN of the heap page as
 well as the one of the visibilitymap page. Not sure about the
 performance impact of that.

It would result in a massive increase in WAL traffic when vacuuming an
insert-only table; that's why we didn't do it originally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Andres Freund
On 2012-12-04 09:33:28 -0500, Robert Haas wrote:
 On Tue, Dec 4, 2012 at 9:00 AM, Andres Freund and...@2ndquadrant.com wrote:
  Youre right, it currently seems to be possible, there's no LSN interlock
  prohibiting this as far as I can see.

 Yeah, there certainly isn't that.  Now you could perhaps make an
 argument that no operation that can propagate a set bit from master to
 standby can arrive until after the standby's xmin horizon is
 sufficiently current, but that does feel a little fragile to me...
 even if it's true now, new WAL record types might break it, for
 example.

I wouldn't want to rely on it...

  in lazy_scan_heap we do:
 
  if (!PageIsAllVisible(page))
  {
  PageSetAllVisible(page);
  MarkBufferDirty(buf);
  visibilitymap_set(onerel, blkno, 
  InvalidXLogRecPtr, vmbuffer,

  visibility_cutoff_xid);
  }
 
  So buf can be written out independently from the xlog record that
  visibilitymap_set emits. ISTM we should set the LSN of the heap page as
  well as the one of the visibilitymap page. Not sure about the
  performance impact of that.

 It would result in a massive increase in WAL traffic when vacuuming an
 insert-only table; that's why we didn't do it originally.

I wonder if we could solve that by having an in-memory-only LSN that
only interlocks the hint bit writes, but doesn't cause full page
writes...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Robert Haas
On Tue, Dec 4, 2012 at 10:38 AM, Andres Freund and...@2ndquadrant.com wrote:
 I wonder if we could solve that by having an in-memory-only LSN that
 only interlocks the hint bit writes, but doesn't cause full page
 writes...

It's not really a hint bit, because if it fails to get set when the
visibility map bit gets set, you've got queries returning wrong
answers, because the next insert/update/delete on the heap page will
fail to clear the visibility-map bit.

But leaving that aside, I think that might work.  You'd essentially be
preventing the page from being written out of shared_buffers until the
WAL record has hit the disk, and it seems like that should be
sufficient.  Whether it's worth adding that much mechanism for this
problem, I'm less sure about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby

2012-12-04 Thread Pavan Deolasee
On Tue, Dec 4, 2012 at 8:03 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Dec 4, 2012 at 9:00 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  Youre right, it currently seems to be possible, there's no LSN interlock
  prohibiting this as far as I can see.

 Yeah, there certainly isn't that.  Now you could perhaps make an
 argument that no operation that can propagate a set bit from master to
 standby can arrive until after the standby's xmin horizon is
 sufficiently current, but that does feel a little fragile to me...
 even if it's true now, new WAL record types might break it, for
 example.


Hmm. Yeah, I do not have guts to prove that either. I'll probably write up
a comment for your consideration to explain why we don't trust
PD_ALL_VISIBLE in Hot standby for seq scans, but still trust VM for
index-only scans.

Another piece of code that caught my attention is this (sorry for digging
up topics that probably have been discussed to death before and I might not
have paid attention to then):

 Whenever we clear PD_ALL_VISIBLE flag, we do that while holding the heap
page lock and also WAL log that action (in fact, piggy-back with
insert/update/delete). The only exception that I saw is in lazy_scan_heap()

 916 else if (PageIsAllVisible(page)  has_dead_tuples)
 917 {
 918 elog(WARNING, page containing dead tuples is marked as
all-visible in relation \%s\ page %u,
 919  relname, blkno);
 920 PageClearAllVisible(page);
 921 MarkBufferDirty(buf);
 922 visibilitymap_clear(onerel, blkno, vmbuffer);
 923 }

Obviously, this is not a case that we expect to occur. But if it does occur
for whatever reason, then even though we will correct the page flag on
master, this would never be populated to the standby. So the standby may
continue to have that tiny bit set on the page, at least until another
insert/update/delete happens on the page. Not sure if there is anything to
worry about, but looks a bit strange. I looked at the archive but can't see
any report of the warning, so may be we are safe after all.

Another side issue: The way we set visibility map bits in the VACUUM, we do
that in the first phase of the vacuum. Now the very fact that we have
selected the page for vacuuming, it will have one or more dead tuples. As
soon as we find a dead tuple in the page, we decide not to mark the page
all-visible and rightly so. But this page would not get marked all-visible
even if we remove all dead tuples in the second phase. Why don't we push
that work to the second phase or at least retry that again ? Of course, we
can't just do it that way without scanning the page again because new
tuples may get added or updated/deleted in the meanwhile. But that can be
addressed with some tricks - like tracking LSN of every such (which has
only dead and live tuples, but not recently dead or insert-in-progress or
delete-in-progress) and then comparing that LSN with the page LSN during
the second phase. If an update had happened in between, we will know that
and we won't deal with the visibility bits.

Another idea would be to track this intermediate state in the page header
itself, something like PD_ALL_VISIBLE_OR_DEAD. If an intermediate
update/delete/insert comes in, it will clear the bit. If we see the bit set
during the second phase of the vacuum, we will know that it contains only
dead and live tuples and the dead ones are being removed now. So we can
mark the page PD_ALL_VISIBLE.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee