Re: [HACKERS] Crash safe visibility map vs hint bits

2010-12-15 Thread Bruce Momjian
Heikki Linnakangas wrote:
 On 04.12.2010 09:14, jes...@krogh.cc wrote:
  There has been a lot discussion about index-only scans and how to make the 
  visibillity map crash safe. Then followed by a good discussion about hint 
  bits.
 
  What seems to be the main concern is the added wal volume and it makes me 
  wonder if there is a way in-between that looks more like hint bits.
 
  How about lazily wal-log the complete visibility map say every X minutes or 
  N amount of tuple updates and make the wal recovery jobs of rechecking 
  visibility of pages touched by the wal stream on recovery.
 
 If you WAL-log the visibility map changes after-the-fact, it doesn't 
 solve the race condition we're struggling with: the visibility map 
 change might hit the disk before the PD_ALL_VISIBLE to the heap page. If 
 you crash, you can end up with a situation where the PD_ALL_VISIBLE flag 
 on the heap page is not set, but the bit in the visibility map is. Which 
 causes serious issues later on.

Based on hacker emails and a discussion I had with Heikki while we were
in Germany, I have updated the index-only scans wiki to document a known
solution to making the visibility map crash-safe for use by index-only
scan use:


http://wiki.postgresql.org/wiki/Index-only_scans#Making_the_Visibility_Map_Crash-Safe

Making the Visibility Map Crash-Safe

Currently, a heap page that has all-visible tuples is marked by vacuum
as PD_ALL_VISIBLE and the visibility map (VM) bit is set. This is
currently unlogged, and a crash could require these to be set again.

The complexity is that for index-only scans, the VM bit has meaning, and
cannot be incorrectly set (though it can be incorrectly cleared because
that would just result in additional heap access). If both
PD_ALL_VISIBLE and the VM bit were to be set, and a crash resulted the
VM bit being written to disk, but not the PD_ALL_VISIBLE bit, a later
heap access that wrote a conditionally-visible row would not know to
clear the VM bit, causing incorrect results for index-only scans.

The solution is to WAL log the VM set bit activity. This will cause
full-page writes for the VM page, but this is much less than WAL-logging
each heap page because a VM page represents many heap pages. This
requires that the VM page not be written to disk until its VM-set WAL
record is fsynced to disk. Also, during crash recovering, reading the
VM-set WAL record would cause both the VM-set and heap PD_ALL_VISIBLE to
be set. 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Crash safe visibility map vs hint bits

2010-12-04 Thread jes...@krogh.cc
Den 4 Dec 2010 kl. 08:48 skrev Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com:

 On 04.12.2010 09:14, jes...@krogh.cc wrote:
 There has been a lot discussion about index-only scans and how to make the 
 visibillity map crash safe. Then followed by a good discussion about hint 
 bits.
 
 What seems to be the main concern is the added wal volume and it makes me 
 wonder if there is a way in-between that looks more like hint bits.
 
 How about lazily wal-log the complete visibility map say every X minutes or 
 N amount of tuple updates and make the wal recovery jobs of rechecking 
 visibility of pages touched by the wal stream on recovery.
 
 If you WAL-log the visibility map changes after-the-fact, it doesn't solve 
 the race condition we're struggling with: the visibility map change might hit 
 the disk before the PD_ALL_VISIBLE to the heap page. If you crash, you can 
 end up with a situation where the PD_ALL_VISIBLE flag on the heap page is not 
 set, but the bit in the visibility map is. Which causes serious issues later 
 on.

My imagination is probably not as good, but if you at time A wallog the 
complete map and at A+1 you update a tuple so the visibility bit is cleared but 
the map bit change does not happen due to a crash. Then at wal replay time you 
restore the map from time A and if the tuple change at A+1 is represented in 
the wal stream the you also update the visibility map.  This is the situation 
where the heap tuple hit disk but the map is left in a broken state?  Or is it 
a different similar looking situation?

The tuple change in the wal stream will require the system to reinspect the 
page anyway so there shouldn't be any additional disk io on replay due to this.

Jesper
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Crash safe visibility map vs hint bits

2010-12-04 Thread Heikki Linnakangas

On 04.12.2010 10:22, jes...@krogh.cc wrote:

Den 4 Dec 2010 kl. 08:48 skrev Heikki 
Linnakangasheikki.linnakan...@enterprisedb.com:

If you WAL-log the visibility map changes after-the-fact, it doesn't solve the 
race condition we're struggling with: the visibility map change might hit the 
disk before the PD_ALL_VISIBLE to the heap page. If you crash, you can end up 
with a situation where the PD_ALL_VISIBLE flag on the heap page is not set, but 
the bit in the visibility map is. Which causes serious issues later on.


My imagination is probably not as good, but if you at time A wallog the 
complete map and at A+1 you update a tuple so the visibility bit is cleared but 
the map bit change does not happen due to a crash. Then at wal replay time you 
restore the map from time A and if the tuple change at A+1 is represented in 
the wal stream the you also update the visibility map.  This is the situation 
where the heap tuple hit disk but the map is left in a broken state?  Or is it 
a different similar looking situation?


The problem is when a bit is *set* in the visibility map. Clearing a bit 
is not a problem, we already handle that reliably. If you set the flag 
on the heap page and set the bit on the visibility map page, and you 
don't emit a WAL record on either of those operations, the VM page might 
be flushed to disk before the heap page.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Crash safe visibility map vs hint bits

2010-12-04 Thread jes...@krogh.cc
 My imagination is probably not as good, but if you at time A wallog the 
 complete map and at A+1 you update a tuple so the visibility bit is cleared 
 but the map bit change does not happen due to a crash. Then at wal replay 
 time you restore the map from time A and if the tuple change at A+1 is 
 represented in the wal stream the you also update the visibility map.  This 
 is the situation where the heap tuple hit disk but the map is left in a 
 broken state?  Or is it a different similar looking situation?
 
 The problem is when a bit is *set* in the visibility map. Clearing a bit is 
 not a problem, we already handle that reliably. If you set the flag on the 
 heap page and set the bit on the visibility map page, and you don't emit a 
 WAL record on either of those operations, the VM page might be flushed to 
 disk before the heap page.

Ah got it, I thought there was an implicit wal stream representing the change. 
Which there isn't. 

My initial suggestion was actually to trash the map on recovery and write it 
safe out on stop and let it be lazily created/updated on reads.  But I can see 
that the different performance patterns from normal operation and fresh 
recovery can be hard to accept although it would be sufficient/acceptable for 
many of us. 

It is nice that a recovery brings the database into the same state as before 
the crash in all perspectives but in real would the application still have a 
huge performance drop due to cold shared buffers and cold page cache on the os.

A database wide select would the be needed to bring the map up to date. Would 
it be ok to update the inmemory bitmap as a sideeffect on selects?


Jesper
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Crash safe visibility map vs hint bits

2010-12-03 Thread jes...@krogh.cc
There has been a lot discussion about index-only scans and how to make the 
visibillity map crash safe. Then followed by a good discussion about hint bits.

What seems to be the main concern is the added wal volume and it makes me 
wonder if there is a way in-between that looks more like hint bits.

How about lazily wal-log the complete visibility map say every X minutes or N 
amount of tuple updates and make the wal recovery jobs of rechecking visibility 
of pages touched by the wal stream on recovery.

This seems a lot like the checkpoint mechanism but I can't see if it can just 
follow the same pattern directly.

This may also just demonstrate my total lack of understanding of PGs intervals.

Jesper



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Crash safe visibility map vs hint bits

2010-12-03 Thread Heikki Linnakangas

On 04.12.2010 09:14, jes...@krogh.cc wrote:

There has been a lot discussion about index-only scans and how to make the 
visibillity map crash safe. Then followed by a good discussion about hint bits.

What seems to be the main concern is the added wal volume and it makes me 
wonder if there is a way in-between that looks more like hint bits.

How about lazily wal-log the complete visibility map say every X minutes or N 
amount of tuple updates and make the wal recovery jobs of rechecking visibility 
of pages touched by the wal stream on recovery.


If you WAL-log the visibility map changes after-the-fact, it doesn't 
solve the race condition we're struggling with: the visibility map 
change might hit the disk before the PD_ALL_VISIBLE to the heap page. If 
you crash, you can end up with a situation where the PD_ALL_VISIBLE flag 
on the heap page is not set, but the bit in the visibility map is. Which 
causes serious issues later on.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers