Re: [HACKERS] crash-safe visibility map, take three

2011-01-07 Thread Jim Nasby
On Jan 5, 2011, at 8:10 PM, Robert Haas wrote:
 On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh jes...@krogh.cc wrote:
 Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
 serve?
 
 If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
 attempt to update the visibility map.  In theory, this is an important
 optimization to reduce contention on the visibility map page, since
 there are something like 64K heap pages per visibility map page.  In
 practice, I'm not sure in what workloads it matters or by how much.

What specific locking are you worried about? The page locks themselves? Isn't 
changing the bit essentially a single instruction operation?

This is sounding like premature optimization... ;)
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2011-01-07 Thread Robert Haas
On Fri, Jan 7, 2011 at 1:28 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 5, 2011, at 8:10 PM, Robert Haas wrote:
 On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh jes...@krogh.cc wrote:
 Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
 serve?

 If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
 attempt to update the visibility map.  In theory, this is an important
 optimization to reduce contention on the visibility map page, since
 there are something like 64K heap pages per visibility map page.  In
 practice, I'm not sure in what workloads it matters or by how much.

 What specific locking are you worried about? The page locks themselves? Isn't 
 changing the bit essentially a single instruction operation?

 This is sounding like premature optimization... ;)

I'm not quite invested enough in this to get worried about it, but if
I were, I'd probably start with the buffer content lock, and move on
to the buffer header spinlock and the buf mapping locks.  Changing the
bit is a single instruction once you've got the page pinned and
locked, but that's not free.

(And even if you could hypothetically figure out some clever lock-free
algorithm to avoid some of this work, there's still going to be cache
line contention, which is quite expensive as it turns out.  See the
relatively recent discussions of why our backend startup cost is so
high.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2011-01-05 Thread Jesper Krogh

On 2010-11-30 05:57, Robert Haas wrote:

Last week, I posted a couple of possible designs for making the
visibility map crash-safe, which did not elicit much comment.  Since
this is an important prerequisite to index-only scans, I'm trying
again.


The logic seems to be:

* If the visibillity map should be crash-safe if should be WAL-logged.
* PD_ALL_VISIBLE is currently not being WAL-logged when vacuum sets it.
* WAL logging the visibillity map bit is not that bad (small size).
* WAL-logging the PD_ALL_VISIBLE bit would can WAL-record for the entire
  relation to be written out (potentially huge).

Would the problem not be solved by not trying to keep the two bits in 
sync but

simply removing the PD_ALL_VISIBLE bit in the page-header in favor
for the bit in the visibillity map, that is now WAL-logged and thus safe 
to trust?


Then vacuum could emit WAL records for setting the visibillity map bits, 
combined

with changes on the page could clear it?

The question probably boils down to:

Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE 
bit serve?


I've probably just missed some logic?

Having index-only scans per-table ajustable would make quite some sense..

I have a couple of tables with a high turn-over rate that never get out 
of the OS-cache
anyway, the benefit of index-only scans are quite small, especially if 
they come with

additional overhead on INSERT/UPDATE/DELETE operations, whereas I also have
huge tables with a very small amount of changes. Just the saved IO of 
not having to

go to the heap in some cases would be good.

I could see some benefits in having the index-only scan work on 
tuple-level visibillity information

and not page-level, but that would require a bigger map
(allthough still less than 1% of the heap size if my calculations are 
correct), but
would enable visibillity testing of a tuple without going to the heap 
even other (unrelated)

changes happend on the same page.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2011-01-05 Thread Robert Haas
On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh jes...@krogh.cc wrote:
 Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
 serve?

If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
attempt to update the visibility map.  In theory, this is an important
optimization to reduce contention on the visibility map page, since
there are something like 64K heap pages per visibility map page.  In
practice, I'm not sure in what workloads it matters or by how much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2011-01-05 Thread Jesper Krogh

On 2011-01-06 03:10, Robert Haas wrote:

On Wed, Jan 5, 2011 at 3:22 PM, Jesper Kroghjes...@krogh.cc  wrote:

Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
serve?

If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
attempt to update the visibility map.  In theory, this is an important
optimization to reduce contention on the visibility map page, since
there are something like 64K heap pages per visibility map page.  In
practice, I'm not sure in what workloads it matters or by how much


If that is the only reason, I would suggest just making the visibillity map
pages more sparse. If you just had 500 or 1K heap pages per visibillity map
page, then it would not change a thing for the performance. With 1K heap 
pages

per VM map page a relation with 100GB of data would have a VM map of
the VM map of 100MB, even 100 heap pages per VM-map page would still
enable the database to have visibillity information of 100GB data stored 
in 1GB

of memory.

But it would need testing and benchmarking to find out.

Jesper

--
Jesper




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-03 Thread Florian Weimer
* Robert Haas:

 Those hint bit tests are a single machine instruction.  It's tough
 to beat that.  It's tough to get within two orders of magnitude.
 I'd like to, but I don't see how.

For some scans, it might be possible to hoist the checks out of inner
loops.  (At least in principle, I'm not sure how much that would
interfere with the executor architecture.)

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 Or maybe I do.  One other thing I've been thinking about with regard
 to hint bit updates is that we might choose to mark that are
 hint-bit-updated as untidy rather than dirty.  The background

Please review archives, you'll find the idea discussed and some patches
to implement it, by Simon. I suppose you could begin here:

  http://archives.postgresql.org/pgsql-patches/2008-06/msg00113.php

Regards,  
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Robert Haas
On Thu, Dec 2, 2010 at 6:37 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Robert Haas robertmh...@gmail.com writes:
 Or maybe I do.  One other thing I've been thinking about with regard
 to hint bit updates is that we might choose to mark that are
 hint-bit-updated as untidy rather than dirty.  The background

 Please review archives, you'll find the idea discussed and some patches
 to implement it, by Simon. I suppose you could begin here:

  http://archives.postgresql.org/pgsql-patches/2008-06/msg00113.php

Thanks for the pointer.  I guess that demonstrates that good ideas
will keep floating back up to the surface.  It seems like the idea was
met with generally positive feedback, except that most people seemed
to want a slightly simpler system than what Simon was proposing.  I
suspect that this is mostly suffering from a lack of round tuits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Jeff Davis
On Wed, 2010-12-01 at 23:22 -0500, Robert Haas wrote:
 Well, let's think about what we'd need to do to make CRCs work
 reliably.  There are two problems.
 
 1. [...]  If we CRC the entire page, the torn pages are never
 acceptable, so every action that modifies the page must be WAL-logged.
 
 2. Currently, we allow hint bits on a page to be updated while holding

[...]

The way I see it, here are the rules we are breaking, and why:

* We don't get an exclusive lock when dirtying a page with hint bits
- Why: we write while reading, and we want good concurrency.
- Why': because after a bulk load, we don't have any hint bits, and the
only way to get them set without VACUUM is to write while reading. I've
never been entirely sure why VACUUM isn't good enough in this case,
aside from the fact that a user might not run VACUUM (and autovacuum
might not either, if it was only a bulk load and no updates/deletes).

* We don't WAL log setting hint bits (which dirties a page)
- Why: because after a bulk load, we don't want to write the data a 4th
time

Hypothetically, if we had a bulk loading strategy, these problems would
go away, and we could follow the rules. Right? Is there a case other
than bulk loading which demands that we break these rules?

And, if we had a bulk loading path, we could probably get away with
writing the data only twice (today, we write it 3 times including the
hint bits) or maybe once if WAL archiving is off.

So, is there a case other than bulk loading for which we need to break
these rules? If not, perhaps we should consider bulk loading a different
problem, and simplify the design of all of these other features (and
allow new storage-touching features to come about, like CRCs, without
exponentially increasing the complexity with each one).

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Kevin Grittner
Jeff Davis pg...@j-davis.com wrote:
 
 And, if we had a bulk loading path, we could probably get away
 with writing the data only twice (today, we write it 3 times
 including the hint bits) or maybe once if WAL archiving is off.
 
If you're counting WAL writes, you're low.  If you don't go out of
your way to avoid it, you are likely to write the data to the table
once during the bulk load, a second time on first read to set the
hint bits, and a third time to freeze data to prevent wrap-around. 
The initial write may or may not be WAL-logged.  The freezing
typically is WAL-logged.
 
So, you can easily write the data to disk four or five times.  With
luck these are spread out uniformly or happen during off-peak
periods.  Unmanaged, a WAL-logged freeze of bulk-loaded data is
somewhat more likely to occur, however, during hours of heavy OLTP
load, because transaction numbers are consumed so quickly.
 
Currently, a VACUUM FREEZE after a bulk load collapses at least two
of those writes to one.  With luck, some pages might still be dirty
in cache, and you can save two of the writes.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Robert Haas
On Thu, Dec 2, 2010 at 2:01 PM, Jeff Davis pg...@j-davis.com wrote:
 * We don't get an exclusive lock when dirtying a page with hint bits
 - Why: we write while reading, and we want good concurrency.
 - Why': because after a bulk load, we don't have any hint bits, and the
 only way to get them set without VACUUM is to write while reading. I've
 never been entirely sure why VACUUM isn't good enough in this case,
 aside from the fact that a user might not run VACUUM (and autovacuum
 might not either, if it was only a bulk load and no updates/deletes).

 * We don't WAL log setting hint bits (which dirties a page)
 - Why: because after a bulk load, we don't want to write the data a 4th
 time

 Hypothetically, if we had a bulk loading strategy, these problems would
 go away, and we could follow the rules. Right? Is there a case other
 than bulk loading which demands that we break these rules?

I'm not really convinced that this problem is confined to bulk
loading.  Every INSERT or UPDATE results in a new tuple that may need
hit bits set and eventually to be frozen.  A bulk load is just a time
when you do lots of inserts all at once; it seems to me that a large
update would cause all the same problems, plus bloat.  The triple I/O
problem exists for small transactions as well (and isn't desirable
there either); it's just less noticeable because the second and third
writes are, like the first one, small.

 And, if we had a bulk loading path, we could probably get away with
 writing the data only twice (today, we write it 3 times including the
 hint bits) or maybe once if WAL archiving is off.

It seems to me that a COPY command executed in a transaction with no
other open snapshots writing to a table created or truncated within
the same transaction should be able to write frozen tuples from the
get-go, regardless of anything else we do.

 So, is there a case other than bulk loading for which we need to break
 these rules? If not, perhaps we should consider bulk loading a different
 problem, and simplify the design of all of these other features (and
 allow new storage-touching features to come about, like CRCs, without
 exponentially increasing the complexity with each one).

I don't think we're exponentially increasing complexity - I think
we're incrementally improving our algorithms.  If you want to propose
a bulk loading path, great.  Propose away!  But without something a
bit more concrete, I don't think it would be appropriate to hold off
making the visibility map crash-safe, on the off chance that our
design for so doing might complicate something else we want to do
later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Jeff Davis
On Thu, 2010-12-02 at 17:00 -0500, Robert Haas wrote:
 I'm not really convinced that this problem is confined to bulk
 loading.  Every INSERT or UPDATE results in a new tuple that may need
 hit bits set and eventually to be frozen.  A bulk load is just a time
 when you do lots of inserts all at once; it seems to me that a large
 update would cause all the same problems, plus bloat.

A big UPDATE does a lot of work, I don't see any near-term approach for
solving that. Avoiding WAL (and full page writes) for the updates to
PD_ALL_VISIBLE, hint bits, VM bits, etc., would probably be the least of
my concerns in that case.

   The triple I/O
 problem exists for small transactions as well (and isn't desirable
 there either); it's just less noticeable because the second and third
 writes are, like the first one, small.

Bulk loading poses some unique challenges because there is no
opportunity to set PD_ALL_VISIBLE or hint bits before the loading is
complete; and by that time, many checkpoints will have already happened,
and the pages have already hit disk. That means we need to re-read them,
modify them, and write them again (plus WAL, if we were following the
rules).

Small transactions don't suffer from the same problems. They generally
begin and end without an intervening checkpoint. That means that you
have an opportunity to set PD_ALL_VISIBLE or hint bits before the
checkpoint happens, thus avoiding unnecessary extra writes.

Additionally, small transaction workloads will generally have, to some
approximation, some working set of pages. So, even if you do a read of a
single tuple, write PD_ALL_VISIBLE and hint bits (thus dirtying the
page), there is a reasonable chance that someone will come by later and
do an insert/update/delete (thus forcing WAL anyway).

And if the small transaction workload is completely random and you
aren't touching the same pages between checkpoints, then setting hint
bits one-at-a-time is not a good strategy anyway. It would be much
better to do it in bulk with a VACUUM. And if VACUUM does anything
significant to a page, it's going to WAL anyway.

I'm having trouble seeing a case other than bulk-loading which causes a
real problem. Maybe a small-transaction workload with a few long-running
transactions? Even that doesn't seem so bad.

 It seems to me that a COPY command executed in a transaction with no
 other open snapshots writing to a table created or truncated within
 the same transaction should be able to write frozen tuples from the
 get-go, regardless of anything else we do.

Well, some transaction might pick up a snapshot between the time you
begin the copy and the time it commits. We'd need to prevent such a
transaction from actually reading the table.

 I don't think it would be appropriate to hold off
 making the visibility map crash-safe, on the off chance that our
 design for so doing might complicate something else we want to do
 later.

I'm not suggesting we hold off on it at all. To the contrary, I'm
suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM
bits, at least until a performance problem presents itself. That will
_simplify_ the design.

Then, when a performance problem does present itself for a certain use
case, we can see how to fix it. If many cases are affected, then we
might choose one of these more creative solutions that breaks the rules
in controlled ways, understanding the trade-offs. If only bulk loading
is affected, we might choose to address that case directly.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Robert Haas
On Thu, Dec 2, 2010 at 6:37 PM, Jeff Davis pg...@j-davis.com wrote:
 It seems to me that a COPY command executed in a transaction with no
 other open snapshots writing to a table created or truncated within
 the same transaction should be able to write frozen tuples from the
 get-go, regardless of anything else we do.

 Well, some transaction might pick up a snapshot between the time you
 begin the copy and the time it commits. We'd need to prevent such a
 transaction from actually reading the table.

Oh, hmm.  That's awkward.  I guess under present semantics it can see
the table - but not its contents - once the inserting transaction has
committed.  That stinks.

 I don't think it would be appropriate to hold off
 making the visibility map crash-safe, on the off chance that our
 design for so doing might complicate something else we want to do
 later.

 I'm not suggesting we hold off on it at all. To the contrary, I'm
 suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM
 bits, at least until a performance problem presents itself. That will
 _simplify_ the design.

 Then, when a performance problem does present itself for a certain use
 case, we can see how to fix it. If many cases are affected, then we
 might choose one of these more creative solutions that breaks the rules
 in controlled ways, understanding the trade-offs. If only bulk loading
 is affected, we might choose to address that case directly.

I don't think that you can seriously suggest that emitting that volume
of FPIs isn't going to be a problem immediately.  We have to have some
solution to that problem out of the gate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-02 Thread Jeff Davis
On Thu, 2010-12-02 at 19:06 -0500, Robert Haas wrote:
 I don't think that you can seriously suggest that emitting that volume
 of FPIs isn't going to be a problem immediately.  We have to have some
 solution to that problem out of the gate.

Fair enough. I think you understand my point, and it's easy enough to
remove complexity later if it makes sense to do so.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Bruce Momjian
Heikki Linnakangas wrote:
 On 01.12.2010 03:35, Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  Let's recap what happens when a VM bit is set: You set the
  PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
  usually isn't), and then set the bit in the VM while keeping the heap
  page locked.
 
  What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
  happen so the heap page is guaranteed to be on disk, then on next read,
  if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
  VM bit.
 
 Hmm, you'd somehow have to know if a checkpoint has happened since the 
 flag was set. And it might be a long wait, which makes it less 

Well, doesn't the page LSN tell you that already?

 attractive for index-only scans.

My assumption is that this page will remain read-only for a while, so I
don't see the big downside in a delay of max 5-10 minutes.  For sites
where pages go frequently in and out of read-only status, I don't think
index-only scans are every going to be a big win.  Long-running queries
alone are going to delay how quickly we can set PD_ALL_VISIBLE, so I
don't see an additional 5-10 minutes as a big problem.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Heikki Linnakangas

On 01.12.2010 15:39, Bruce Momjian wrote:

Heikki Linnakangas wrote:

On 01.12.2010 03:35, Bruce Momjian wrote:

Heikki Linnakangas wrote:

Let's recap what happens when a VM bit is set: You set the
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
usually isn't), and then set the bit in the VM while keeping the heap
page locked.


What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.


Hmm, you'd somehow have to know if a checkpoint has happened since the
flag was set. And it might be a long wait, which makes it less


Well, doesn't the page LSN tell you that already?


If we update the LSN when we set the flag, then we have to write the 
full-page-image. That's very expensive. If we don't update the LSN, then 
the LSN says nothing about when the flag was set.



attractive for index-only scans.


My assumption is that this page will remain read-only for a while, so I
don't see the big downside in a delay of max 5-10 minutes.  For sites
where pages go frequently in and out of read-only status, I don't think
index-only scans are every going to be a big win.  Long-running queries
alone are going to delay how quickly we can set PD_ALL_VISIBLE, so I
don't see an additional 5-10 minutes as a big problem.


Perhaps we could live with it, but it would be annoying to have to 
checkpoint after a data load, before index-only scans kick in.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Kevin Grittner
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 it would be annoying to have to checkpoint after a data load
 
Heck, in my world it's currently pretty much a necessity to run
VACUUM FREEZE ANALYZE on a table after a data load before it's
reasonable to expose the table to production use.  It would hardly
be an inconvenience to also run a CHECKPOINT.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Bruce Momjian
Heikki Linnakangas wrote:
 On 01.12.2010 15:39, Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  On 01.12.2010 03:35, Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  Let's recap what happens when a VM bit is set: You set the
  PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
  usually isn't), and then set the bit in the VM while keeping the heap
  page locked.
 
  What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
  happen so the heap page is guaranteed to be on disk, then on next read,
  if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
  VM bit.
 
  Hmm, you'd somehow have to know if a checkpoint has happened since the
  flag was set. And it might be a long wait, which makes it less
 
  Well, doesn't the page LSN tell you that already?
 
 If we update the LSN when we set the flag, then we have to write the 
 full-page-image. That's very expensive. If we don't update the LSN, then 
 the LSN says nothing about when the flag was set.

Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag?  OK,
please let me think some more.  Thanks.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 9:57 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 it would be annoying to have to checkpoint after a data load

 Heck, in my world it's currently pretty much a necessity to run
 VACUUM FREEZE ANALYZE on a table after a data load before it's
 reasonable to expose the table to production use.  It would hardly
 be an inconvenience to also run a CHECKPOINT.

The goal here is to make that better, not worse!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 10:36 AM, Bruce Momjian br...@momjian.us wrote:
 Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag?  OK,
 please let me think some more.  Thanks.

As far as I can tell, there are basically two viable solutions on the
table here.

1. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
bit in the visibility map page, bumping the LSN as usual, and (c) emit
a WAL record indicating the relation and block number.  On redo of
this record, set both the page-level bit and the visibility map bit.
The heap page may hit the disk before the WAL record, but that's OK;
it just might result in a little extra work until some subsequent
operation gets the visibility map bit set.  The visibility map page
page may hit the disk before the heap page, but that's OK too, because
the WAL record will already be on disk due to the LSN interlock.  If a
crash occurs before the heap page is flushed, redo will fix the heap
page.  (The heap page will get flushed as part of the next checkpoint,
if not sooner, so by the time the redo pointer advances past the WAL
record, there's no longer a risk.)

2. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the
bit in the visibility map page, bumping the LSN if a WAL record is
issued (which only happens sometimes, read on), and (c) emit a WAL
record indicating the chunk of 128 visibility map bits which
contains the bit we just set - but only if we're now dealing with a
new group of 128 visibility map bits or if a checkpoint has intervened
since the last such record we emitted.  On redo of this record, clear
the visibility map bits in each chunk.  The heap page may hit the disk
before the WAL record, but that's OK for the same reasons as in plan
#1.  The visibility map page may hit the disk before the heap page,
but that's OK too, because the WAL record will already be on disk to
due the LSN interlock.  If a crash occurs before the heap page makes
it to disk, then redo will clear the visibility map bits, leaving them
to be reset by a subsequent VACUUM.

As is typical with good ideas, neither of these seems terribly
complicated in retrospect.  Kudos to Heikki for thinking them up and
explaining them.

After some thought, I think that approach #1 is probably better,
because it propagates visibility map bits to the standby.  During
index-only scans, the standby will have to ignore them during HS
operation just as it currently ignores the PD_ALL_VISIBLE page-level
bit, but if and when the standby is promoted to master, it's important
to have those bits already set, both for index-only scans and also
because, absent that, the first autovacuum on each table will end up
scanning the whole things and dirtying tremendous gobs of data setting
all those bits, which is just the sort of ugly surprise that we don't
want to give people right after they've been forced to perform a
failover.

I think we can improve this a bit further by also introducing a
HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
FrozenXID.  This allows us to freeze tuples aggressively - if we want
- without losing any forensic information.  We can then modify the
above algorithm slightly, so that when we observe that a page is all
visible, we not only set PD_ALL_VISIBLE on the page but also
HEAP_XMIN_FROZEN on each tuple.  The WAL record marking the page as
all-visible then doubles as a WAL record marking it frozen,
eliminating the need to dirty the page yet again at anti-wraparound
vacuum time.  It'll still be a net increase in WAL volume (as Heikki
pointed out) but the added WAL volume is small compared with the I/O
involved in writing out the dirty heap pages (as Tom pointed out), so
it should hopefully be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Heikki Linnakangas

On 01.12.2010 18:25, Robert Haas wrote:

I think we can improve this a bit further by also introducing a
HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
FrozenXID.  This allows us to freeze tuples aggressively - if we want
- without losing any forensic information.  We can then modify the
above algorithm slightly, so that when we observe that a page is all
visible, we not only set PD_ALL_VISIBLE on the page but also
HEAP_XMIN_FROZEN on each tuple.


Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page 
header over the xmin/xmax on the tuples, we could simply not bother 
doing anti-wraparound vacuums for pages that have the flag set. I'm not 
sure what changes that would require outside heapam.c, as we'd have to 
be careful to not trust the xmin/xmax if the flag was set.


The first update on the page that clears the flag would need to freeze 
all the tuples in that scheme.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 As far as I can tell, there are basically two viable solutions on the
 table here.

 1. Every time we observe a page as all-visible, (a) set the
 PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
 bit in the visibility map page, bumping the LSN as usual, and (c) emit
 a WAL record indicating the relation and block number.  On redo of
 this record, set both the page-level bit and the visibility map bit.
 The heap page may hit the disk before the WAL record, but that's OK;

Um, no it isn't.  Suppose the heap page gets to disk but we crash before
the WAL record does.  Now we have a persistent state where the heap page
is marked PD_ALL_VISIBLE but the corresponding VM bit is not set.  The
VM bit will never become set, either, because operations on the heap
page will see PD_ALL_VISIBLE and assume it already is set.  This state
of affairs might be acceptable from a correctness standpoint, but not
from a performance standpoint.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page 
 header over the xmin/xmax on the tuples, we could simply not bother 
 doing anti-wraparound vacuums for pages that have the flag set. I'm not 
 sure what changes that would require outside heapam.c, as we'd have to 
 be careful to not trust the xmin/xmax if the flag was set.

That seems pretty ugly/dangerous.  If we're going to try to do something
here, I much prefer Robert's approach of marking each tuple in the tuple
header.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Heikki Linnakangas

On 01.12.2010 18:40, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

As far as I can tell, there are basically two viable solutions on the
table here.



1. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
bit in the visibility map page, bumping the LSN as usual, and (c) emit
a WAL record indicating the relation and block number.  On redo of
this record, set both the page-level bit and the visibility map bit.
The heap page may hit the disk before the WAL record, but that's OK;


Um, no it isn't.  Suppose the heap page gets to disk but we crash before
the WAL record does.  Now we have a persistent state where the heap page
is marked PD_ALL_VISIBLE but the corresponding VM bit is not set.  The
VM bit will never become set, either, because operations on the heap
page will see PD_ALL_VISIBLE and assume it already is set.  This state
of affairs might be acceptable from a correctness standpoint, but not
from a performance standpoint.


The next vacuum will fix it. We already handle that. It's no different 
from the situation where neither change makes it to the disk.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 01.12.2010 18:40, Tom Lane wrote:
 Um, no it isn't.  Suppose the heap page gets to disk but we crash before
 the WAL record does.  Now we have a persistent state where the heap page
 is marked PD_ALL_VISIBLE but the corresponding VM bit is not set.  The
 VM bit will never become set, either, because operations on the heap
 page will see PD_ALL_VISIBLE and assume it already is set.  This state
 of affairs might be acceptable from a correctness standpoint, but not
 from a performance standpoint.

 The next vacuum will fix it. We already handle that. It's no different 
 from the situation where neither change makes it to the disk.

Well, as long as we're careful not to optimize away setting the VM bit
on the grounds that PD_ALL_VISIBLE is already set, I suppose that will
work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I think we can improve this a bit further by also introducing a
 HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
 FrozenXID.  This allows us to freeze tuples aggressively - if we want
 - without losing any forensic information.

So far so good ...

 We can then modify the
 above algorithm slightly, so that when we observe that a page is all
 visible, we not only set PD_ALL_VISIBLE on the page but also
 HEAP_XMIN_FROZEN on each tuple.  The WAL record marking the page as
 all-visible then doubles as a WAL record marking it frozen,
 eliminating the need to dirty the page yet again at anti-wraparound
 vacuum time.

but this seems a lot more dubious/fragile.  The basic problem is that
it's not clear whether HEAP_XMIN_FROZEN is a hint bit or essential
data.  If you want to set it without the overhead of an LSN bump or a
possible FPI in WAL, then it's a hint bit.  But if you're using it to
protect clog truncation then it's essential data.  Perhaps you can make
this work but there are some nonobvious requirements:

1. Seeing PD_ALL_VISIBLE set does not excuse vacuum from having to
iterate through all the tuples on the page checking for
HEAP_XMIN_FROZEN.  This is because the non-logged update of the page
might have been torn on the way to disk, such that PD_ALL_VISIBLE got
set but not all of the FROZEN bits did.

2. During an anti-wraparound vacuum, you *need to* emit a WAL record
when setting HEAP_XMIN_FROZEN.  It's not a hint, any more than writing
FrozenXID is now.

Actually, #2 isn't even good enough.  What if vacuum passes over a page
and finds all the FROZEN bits set, but the reason they're set is that
somebody else updated them in hint fashion microseconds before?  It
seems possible that those bits might not make it to disk before a
subsequent crash.  The only way to be really sure those bits are set is
to emit a WAL record that says to set them, whether or not they seem to
be set already.  While the WAL record could be small, you'd need one for
every page, making the argument that this saves I/O somewhat dubious.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 11:40 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.12.2010 18:25, Robert Haas wrote:

 I think we can improve this a bit further by also introducing a
 HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
 FrozenXID.  This allows us to freeze tuples aggressively - if we want
 - without losing any forensic information.  We can then modify the
 above algorithm slightly, so that when we observe that a page is all
 visible, we not only set PD_ALL_VISIBLE on the page but also
 HEAP_XMIN_FROZEN on each tuple.

 Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page header
 over the xmin/xmax on the tuples, we could simply not bother doing
 anti-wraparound vacuums for pages that have the flag set. I'm not sure what
 changes that would require outside heapam.c, as we'd have to be careful to
 not trust the xmin/xmax if the flag was set.

 The first update on the page that clears the flag would need to freeze all
 the tuples in that scheme.

That seems more complex for no particular gain.  I guess it saves an
infomask bit, but I'm willing to burn one to reduce code complexity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I think we can improve this a bit further by also introducing a
 HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
 FrozenXID.  This allows us to freeze tuples aggressively - if we want
 - without losing any forensic information.

 So far so good ...

 We can then modify the
 above algorithm slightly, so that when we observe that a page is all
 visible, we not only set PD_ALL_VISIBLE on the page but also
 HEAP_XMIN_FROZEN on each tuple.  The WAL record marking the page as
 all-visible then doubles as a WAL record marking it frozen,
 eliminating the need to dirty the page yet again at anti-wraparound
 vacuum time.

 but this seems a lot more dubious/fragile.  The basic problem is that
 it's not clear whether HEAP_XMIN_FROZEN is a hint bit or essential
 data.  If you want to set it without the overhead of an LSN bump or a
 possible FPI in WAL, then it's a hint bit.  But if you're using it to
 protect clog truncation then it's essential data.  Perhaps you can make
 this work but there are some nonobvious requirements:

 1. Seeing PD_ALL_VISIBLE set does not excuse vacuum from having to
 iterate through all the tuples on the page checking for
 HEAP_XMIN_FROZEN.  This is because the non-logged update of the page
 might have been torn on the way to disk, such that PD_ALL_VISIBLE got
 set but not all of the FROZEN bits did.

Good point.  If we see the bit set in the visibility map set, it
should be safe to infer that the PD_ALL_VISIBLE bit and all
HEAP_XMIN_FROZEN bits are set.  But if the visibility map bit is NOT
set, we must check PD_ALL_VISIBLE and, whether it's set or not, each
individual HEAP_XMIN_FROZEN bit.

 2. During an anti-wraparound vacuum, you *need to* emit a WAL record
 when setting HEAP_XMIN_FROZEN.  It's not a hint, any more than writing
 FrozenXID is now.

 Actually, #2 isn't even good enough.  What if vacuum passes over a page
 and finds all the FROZEN bits set, but the reason they're set is that
 somebody else updated them in hint fashion microseconds before?  It
 seems possible that those bits might not make it to disk before a
 subsequent crash.  The only way to be really sure those bits are set is
 to emit a WAL record that says to set them, whether or not they seem to
 be set already.  While the WAL record could be small, you'd need one for
 every page, making the argument that this saves I/O somewhat dubious.

I think that we would only ever allow HEAP_XMIN_FROZEN to be set as
part of a WAL-logged operation.  Either we are marking the page
all-visible  - in which case we're emitting the new WAL record type
XLOG_HEAP_ALLVISIBLE - or we're freezing individual tuples on a page
where very old and very new tuples are intermixed - in which case we
emit the existing XLOG_HEAP2_FREEZE.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Jeff Davis
On Wed, 2010-12-01 at 11:25 -0500, Robert Haas wrote:
 1. Every time we observe a page as all-visible, (a) set the
 PD_ALL_VISIBLE bit on the page, without bumping the LSN;

...

 2. Every time we observe a page as all-visible, (a) set the
 PD_ALL_VISIBLE bit on the page, without bumping the LSN,

My concern here is that both of these proposals introduce something new
that is hint-bit-like, in the sense that we change a data page and
potentially write it to disk without necessarily writing WAL. This will
make it even harder to do CRCs in the future, which are also an
important feature.

I understand the motivation: we don't want to write a page to disk yet
one more time. I think this is only a problem for bulk loading, where it
does seem pretty ridiculous. But for cases other than bulk loading, it
doesn't seem like a problem (likely, there will be other changes to the
page anyway).

Is it worth taking a step back, and saying that we're trying to hard to
use the exact same mechanism for OLTP and bulk loading?

I know bulk loading poses a number of problems. But it might be easier
to make a passable bulk-loading mechanism (not necessarily a general
one) than to continue to fight the WAL-before-data invariant on every
new feature[1].

I'm not proposing anything in specific, just a potential re-framing of
the problem.

Regards,
Jeff Davis

[1] If we obey WAL-before-data everywhere WAL is required, the
correctness can be reasonably understood by mere mortals. But when we
start breaking it for special cases, observe how few people can even
participate in such discussions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 3:31 PM, Jeff Davis pg...@j-davis.com wrote:
 On Wed, 2010-12-01 at 11:25 -0500, Robert Haas wrote:
 1. Every time we observe a page as all-visible, (a) set the
 PD_ALL_VISIBLE bit on the page, without bumping the LSN;

 ...

 2. Every time we observe a page as all-visible, (a) set the
 PD_ALL_VISIBLE bit on the page, without bumping the LSN,

 My concern here is that both of these proposals introduce something new
 that is hint-bit-like, in the sense that we change a data page and
 potentially write it to disk without necessarily writing WAL. This will
 make it even harder to do CRCs in the future, which are also an
 important feature.

Well, there's an intermediate concept between logged and unlogged,
which is an update that is WAL-logged but doesn't require torn-page
protection because writing some but not all of the 512-byte sectors in
the page won't break anything.  We're using the LSN to track both
whether the page is logged and whether the update necessitates
torn-page protection.  That might be fixable.

As for CRCs, there's a pretty direct chain of inference here:

1. CRCs are hard (really impossible) because we have hint bits.
2. Hint bits are necessary because an old XID can't be viewed as
guaranteed committed.
3. An old XID can't be viewed as guaranteed committed because we clean
up aborted transactions lazily rather than eagerly.

Changing (3) would amount to a rewrite of our whole MVCC architecture.
 If we switched from per-tuple MVCC based on XIDs to per-page MVCC
based on LSNs and a rollback segment, all of this stuff would go out
the window.  Hint bits, gone.  Anti-wraparound VACUUM, gone.  CRCs,
feasible.  Visibility map... we might still need that, but the
page-level bits go away.

Of course, it would also create new problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
  If we switched from per-tuple MVCC based on XIDs to per-page MVCC
 based on LSNs and a rollback segment, all of this stuff would go out
 the window.  Hint bits, gone.  Anti-wraparound VACUUM, gone.  CRCs,
 feasible.  Visibility map... we might still need that, but the
 page-level bits go away.

 Of course, it would also create new problems.

Yup, we've seen that proposal before.  It's called Oracle.  There's
no good reason to believe that we'd have a net win after we were done
switching over ... not to mention the likelihood that they hold a ton
of patents about particular aspects of doing things that way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Jeff Davis
On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote:
 As for CRCs, there's a pretty direct chain of inference here:
 
 1. CRCs are hard (really impossible) because we have hint bits.

I would disagree with impossible. If we don't set hint bits during
reading; and when we do set them, we log them (including full page
writes); then we can do CRCs.

Those things have costs, but we might be willing to pay them if we had a
bulk loading strategy that avoids or mitigates the costs.

The reason we can't do CRCs now is because hint bits violate the
WAL-before-data rule; not because of hint bits themselves. We're talking
about adding another feature that breaks the rule, in a more complex way
than hint bits.

I just wanted to step back for a second and consider the problem from a
different angle before we committed to that.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Jim Nasby
On Dec 1, 2010, at 2:59 PM, Robert Haas wrote:
 2. Hint bits are necessary because an old XID can't be viewed as
 guaranteed committed.

Hmm... I thought hint bits were necessary because it's too expensive to query 
CLOG for every tuple. If my understanding is correct then if we fix the CLOG 
performance issue we don't need hint bits anymore.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 5:24 PM, Jeff Davis pg...@j-davis.com wrote:
 On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote:
 As for CRCs, there's a pretty direct chain of inference here:

 1. CRCs are hard (really impossible) because we have hint bits.

 I would disagree with impossible. If we don't set hint bits during
 reading; and when we do set them, we log them (including full page
 writes); then we can do CRCs.

 Those things have costs, but we might be willing to pay them if we had a
 bulk loading strategy that avoids or mitigates the costs.

 The reason we can't do CRCs now is because hint bits violate the
 WAL-before-data rule; not because of hint bits themselves. We're talking
 about adding another feature that breaks the rule, in a more complex way
 than hint bits.

 I just wanted to step back for a second and consider the problem from a
 different angle before we committed to that.

Well, let's think about what we'd need to do to make CRCs work
reliably.  There are two problems.

1. Currently, hint bits are not vulnerable to the torn-page problem,
because the hint bit change is to single byte, and neither of the two
possible values for the affected byte invalidate the contents of the
block.  Thus, they do not need to be WAL-logged - we're happy if they
all make it to disk, but if some or none of them make it to disk,
that's OK.  If we CRC the entire page, the torn pages are never
acceptable, so every action that modifies the page must be WAL-logged.

2. Currently, we allow hint bits on a page to be updated while holding
a shared-content lock; we also allow the page to be written while
holding only a shared-content lock.  This makes it a bit
nondeterministic whether the hint bit update is included in the write,
but we don't care.  If we were to compute a CRC and write that into
the page before writing it out to the OS, it would be unacceptable for
the page contents to change thereafter in any way.

So, to make CRCs work, we'd need to (a) WAL-log every hint bit update
and (b) change either hint bit updates or page write-outs to require
an exclusive content lock rather than a shared one.  The first would
result in an increase in I/O, while the second would result in a
reduction in concurrency.  Thinking about it a bit, I wonder if we
couldn't mitigate (b) quite a bit by adding a new level for buffer
content locks, share exclusive.  This would conflict with itself and
with exclusive but not with share locks, and would be required to set
hint bits or write the buffer.  When setting hint bits with only a
share lock, we'd attempt to do a non-blocking upgrade to share
exclusive.  If that failed - because someone else already held a
share-exclusive lock - we'd just skip the hint bit update.  I have no
idea what to do about (a), though.

*thinks some more*

Or maybe I do.  One other thing I've been thinking about with regard
to hint bit updates is that we might choose to mark that are
hint-bit-updated as untidy rather than dirty.  The background
writer could treat these pages as dirty, but checkpoints and backends
doing desperation-buffer-reclamation could treat them as clean.  This
would allow hint bit updates to trickle out to disk in the background,
without letting them bottleneck anything on the critical path.  Maybe
we could do this - if CRCs are enabled and we are the background
writer cleaning scan, write dirty buffers in the usual way and write
untidy buffers to a double-write buffer (to borrow a page from
InnoDB) along with the current LSN.  At the conclusion of the scan,
fsync() the double-write buffer and then write the buffers a second
time in the normal fashion if their mappings haven't changed and they
are still untidy.  On redo, when you reach an LSN recorded in the
double-write buffer, restore the FPI.  In general, a double-write
buffer is inferior to our existing FPI system, because you end up
needing to fsync both the double-write buffer and the WAL stream.  But
it might be OK in this case, if it's all happening as background work.

--

With respect to your concerns about this method, after some thought, I
think #2 isn't an issue at all, because I don't believe we can risk
having our update to HEAP_XMIN_FROZEN stomped on by someone else
trying to set HEAP_XMIN_COMMITTED, so I think that when making a page
all-visible we'll need an exclusive (or share-exclusive) content lock
anyway.  As to #1, I think we could restore the WAL-before-data rules
if we kept a bit somewhere in the buffer descriptor indicating whether
a given buffer has had an FPI since the last checkpoint.  Then,
perhaps, WAL records that are torn-page-safe could bump the TLI
without emitting a FPI.  The next WAL record to come along would be
able to determine that one was still needed.  Of course, to make CRCs
work with this, you still need to emit FPIs or use a double-write
buffer.  That sucks, and I don't know what to do about it.  Since our
current hint-bit updates are not WAL-logged, a CRC implementation over
it could 

Re: [HACKERS] crash-safe visibility map, take three

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 6:41 PM, Jim Nasby j...@nasby.net wrote:
 On Dec 1, 2010, at 2:59 PM, Robert Haas wrote:
 2. Hint bits are necessary because an old XID can't be viewed as
 guaranteed committed.

 Hmm... I thought hint bits were necessary because it's too expensive to query 
 CLOG for every tuple. If my understanding is correct then if we fix the CLOG 
 performance issue we don't need hint bits anymore.

My point is - in InnoDB, when an XID aborts, it's not removed from
their equivalent of the ProcArray until it has been fully rolled back.
 So if you see an XID prior to GlobalXmin, you don't need a hint bit
to tell you whether it's committed.  Go directly to yes.

It is also correct to say that it's too expensive to query CLOG for
every tuple.  But InnoDB isn't solving that by having a cheaper CLOG;
they're solving it by not having CLOG at all.  I doubt that it's
possible to make CLOG lookups so cheap that we don't need hint bits
any more.  Take a look at HeapTupleSatisfiesMVCC().  Those hint bit
tests are a single machine instruction.  It's tough to beat that.
It's tough to get within two orders of magnitude.  I'd like to, but I
don't see how.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Rob Wultsch
On Mon, Nov 29, 2010 at 9:57 PM, Robert Haas robertmh...@gmail.com wrote:
 1. Pin each visibility map page.  If any VM_BECOMING_ALL_VISIBLE bits
 are set, take the exclusive content lock for long enough to clear
 them.

I wonder what the performance hit will be to workloads with contention
and if this feature should be optional.

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Some care is needed with checkpoints. Setting visibility map bits in step 2
 is safe because crash recovery will replay the intent XLOG record and clear
 any incorrectly set bits. But if a checkpoint has happened after the intent
 XLOG record was written, that's not true. This can be avoided by checking
 RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
 since the last intent XLOG record was written.

It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?

 There's a small race condition in the way a visibility map bit is currently
 cleared. When a heap page is updated, it is locked, the update is
 WAL-logged, and the lock is released. The visibility map page is updated
 only after that. If the final vacuum XLOG record is written just after
 updating the heap page, but before the visibility map bit is cleared,
 replaying the final XLOG record will set a bit that should not have been
 set.

Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it.  Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over.  And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires.  Seems like we might be better off
writing that much less WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 06:57, Robert Haas wrote:
 I can't say I'm totally in love with any of these designs.  Anyone
 else have any ideas, or any opinions about which one is best?

 Well, the design I've been pondering goes like this:

Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.

Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 17:32, Robert Haas wrote:

On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Some care is needed with checkpoints. Setting visibility map bits in step 2
is safe because crash recovery will replay the intent XLOG record and clear
any incorrectly set bits. But if a checkpoint has happened after the intent
XLOG record was written, that's not true. This can be avoided by checking
RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
since the last intent XLOG record was written.


It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?


It's enough to hold an exclusive lock on the visibility map page. You 
have to set the bit first, and then check RedoRecPtr, and if it changed, 
write the XLOG record before releasing the lock. If RedoRecPtr changes 
any time before we check RedoRecPtr, we'll write the XLOG record so 
we're safe. If it changes after that, we're safe because the checkpoint 
will flush the updated heap page and visibility map page.



There's a small race condition in the way a visibility map bit is currently
cleared. When a heap page is updated, it is locked, the update is
WAL-logged, and the lock is released. The visibility map page is updated
only after that. If the final vacuum XLOG record is written just after
updating the heap page, but before the visibility map bit is cleared,
replaying the final XLOG record will set a bit that should not have been
set.


Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it.  Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over.  And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires.  Seems like we might be better off
writing that much less WAL.


Yeah, possibly. It also means that the set bits will not propagate to 
standby servers, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 06:57, Robert Haas wrote:
 I can't say I'm totally in love with any of these designs.  Anyone
 else have any ideas, or any opinions about which one is best?

 Well, the design I've been pondering goes like this:

 Wouldn't it be easier and more robust to just consider VM bit changes to
 be part of the WAL-logged actions?  That would include updating LSNs on
 VM pages and flushing VM pages to disk during checkpoint based on their
 LSN values.  All of these other schemes seem too complicated and not
 provably correct.

What WAL-logged actions?

The problem case is where a page has no tuples or line pointers that
need to be removed, and all we need to do is mark it all-visible.  We
don't current WAL-log anything in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 17:38, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 30.11.2010 06:57, Robert Haas wrote:

I can't say I'm totally in love with any of these designs.  Anyone
else have any ideas, or any opinions about which one is best?



Well, the design I've been pondering goes like this:


Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.


The vm bit can be set once all the tuples on the page become visible to 
everyone. There is no WAL-logged action at that point we could piggyback on.


Clearing the bit is already handled like that - replay of heap 
insert/update/delete records clear the visibility map bit.



Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.


Yeah, I'm not terribly excited about any of these schemes. The intent 
record seems like the simplest one, but even that is quite different 
from the traditional WAL-logging we do that it makes me slightly nervous.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 It seems like you'll need to hold some kind of lock between the time
 you examine RedoRecPtr and the time you actually examine the bit.
 WALInsertLock in shared mode, maybe?

 It's enough to hold an exclusive lock on the visibility map page. You have
 to set the bit first, and then check RedoRecPtr, and if it changed, write
 the XLOG record before releasing the lock. If RedoRecPtr changes any time
 before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it
 changes after that, we're safe because the checkpoint will flush the updated
 heap page and visibility map page.

Brilliant.  I assume that we need to call GetRedoRecPtr() after taking
the exclusive lock on the page, though?

 Yeah, possibly. It also means that the set bits will not propagate to
 standby servers, though.

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

Here's one more idea:

The trivial solution to this is to WAL-log setting the visibility map 
bit, like we WAL-log any other operation. Lock the heap page, lock the 
visibility map page, write WAL-record, and release locks. That works, 
but the problem is that it creates quite a lot of new WAL traffic.


We could reduce the WAL traffic by simply updating multiple pages at a 
time. Lock N pages, lock the visibility map page, write one WAL record, 
and release locks. If N=10, for example, we only need to WAL-log a 
couple of bytes per page, so the WAL volume should be acceptable. The 
downside is that you need to keep more pages locked at the same time, 
but maybe that's not too bad.


This wouldn't require anything special, which means fewer hard-to-debug 
visibility  recovery bugs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 17:38, Tom Lane wrote:
 Wouldn't it be easier and more robust to just consider VM bit changes to
 be part of the WAL-logged actions?  That would include updating LSNs on
 VM pages and flushing VM pages to disk during checkpoint based on their
 LSN values.  All of these other schemes seem too complicated and not
 provably correct.

 The vm bit can be set once all the tuples on the page become visible to 
 everyone. There is no WAL-logged action at that point we could piggyback on.

So you start emitting a WAL entry for the act of setting the VM bit
(and I guess the page header hint bit would be included in that too).

 Yeah, I'm not terribly excited about any of these schemes. The intent 
 record seems like the simplest one, but even that is quite different 
 from the traditional WAL-logging we do that it makes me slightly nervous.

I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 The trivial solution to this is to WAL-log setting the visibility map 
 bit, like we WAL-log any other operation. Lock the heap page, lock the 
 visibility map page, write WAL-record, and release locks. That works, 
 but the problem is that it creates quite a lot of new WAL traffic.

How much is quite a lot?  Do we have any real reason to think that
this solution is unacceptable performance-wise?

I'd also suggest that if you want to prevent torn-page syndrome on VM
pages (and if you want to rely on their LSN values, you do) then you
probably don't have any choice anyway.  VM pages will have to adhere to
the same write-full-page-on-first-mod-after-checkpoint rule as any other
page.  I'd guess that this will swamp any savings from cutesy schemes
for reducing the number of WAL records.

 We could reduce the WAL traffic by simply updating multiple pages at a 
 time. Lock N pages, lock the visibility map page, write one WAL record, 
 and release locks.

I don't think that will work, because you have to hold the lock on a
page from the time you check that it's all-visible to the time you apply
the update.  The loss of concurrency against updates would be pretty
bad, and I think you'd be creating significant risk of deadlocks from
holding multiple buffer locks at once.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 How much is quite a lot?  Do we have any real reason to think that
 this solution is unacceptable performance-wise?

Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

But you did just dirty a gigabyte of data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 How much is quite a lot?  Do we have any real reason to think that
 this solution is unacceptable performance-wise?

 Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
 you XLOG setting the bit on each page, you'll need to write 128K WAL
 records, each containing a 12-byte relfilenode and a 4-byte block
 offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

 But you did just dirty a gigabyte of data.

Oh, but it's worse than that.  When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
 So now you've turned 1GB of data to write into 2+ GB of data to
write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:22, Robert Haas wrote:

On Tue, Nov 30, 2010 at 11:16 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

How much is quite a lot?  Do we have any real reason to think that
this solution is unacceptable performance-wise?


Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.


Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per 
page.



But you did just dirty a gigabyte of data.


Good point.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:10, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

Yeah, I'm not terribly excited about any of these schemes. The intent
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.


I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.


That's why you need to check the RedoRecPtr when you set the bit. If it 
has changed, ie. a checkpoint has happened, the set bit step will write 
a new intent record.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Oh, but it's worse than that.  When you XLOG a WAL record for each of
 those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
 write.

No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 18:10, Tom Lane wrote:
 I'm not convinced it works at all.  Consider write intent record,
 checkpoint, set bit, crash before completing vacuum.  There will be
 no second intent record at which you could clean up if things are
 inconsistent.

 That's why you need to check the RedoRecPtr when you set the bit. If it 
 has changed, ie. a checkpoint has happened, the set bit step will write 
 a new intent record.

Oh, you explained the proposal poorly then.  I thought you meant recheck
and write another intent record just once, immediately before sending
the final xlog record.

It still seems rickety and not clearly correct, especially when you
start thinking about all the other constraints we have on xlog behavior
(eg, does this work while taking a base backup).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 That's definitely sucky, but in some ways it would be more complicated
 if they did, because I don't think all-visible on the master implies
 all-visible on the standby.

Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.

I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Oh, but it's worse than that.  When you XLOG a WAL record for each of
 those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
 write.

 No, because only the first mod of each VM page would trigger a full page
 write, at least assuming a reasonable ordering of the operations.

I'm not worried about the full-page writes from updating the
visibility map - I'm worried about the full-page writes from updating
the heap.  It doesn't matter a whit if we fail to set a bit in the
visibility map. What matters is if we DO set the bit in the visibility
map but FAIL TO set the bit in the heap, because then a subsequent
update to the heap page won't check the visibility map and clear the
bit.  The *heap* updates are the ones that have to be guaranteed to
make it to disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:40, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.


Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.


Note that this is only a problem for *hot* standby. After failover, all 
the tuples that were visible to everyone in the master are also visible 
to all new transactions in the standby.


We dealt with this in 9.0 already, with the killed flag in index 
tuples and the PD_ALL_VISIBLE flag in heap scans. We simply don't 
believe them in hot standby mode, and check visibility even if the flag 
is set.



I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.


Something like that would certainly be nice. With index-only scans, it 
can be a big disappointment if you can't do an index-only scan in hot 
standby.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:33, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

Oh, but it's worse than that.  When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
write.


No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.


If you change the LSN on the heap pages, you have to write full page 
images of those as well.


Let's recap what happens when a VM bit is set: You set the 
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it 
usually isn't), and then set the bit in the VM while keeping the heap 
page locked.


Can we get away with not setting the LSN on the heap page, even though 
we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
can be flushed to disk before the WAL record, but I think that's fine 
because it's OK to have the flag set in the heap page even if the VM bit 
is not set.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 That's definitely sucky, but in some ways it would be more complicated
 if they did, because I don't think all-visible on the master implies
 all-visible on the standby.

 Ouch.  That seems like it could shoot down all these proposals.  There
 definitely isn't any way to make VM crash-safe if there is no WAL-driven
 mechanism for setting the bits.

Heikki's intent method works fine, because the WAL record only clears
the visibility map bits on redo; it never sets them.

 I guess what we need is a way to delay the application of such a WAL
 record on the slave until it's safe, which means the record also has to
 carry some indication of the youngest XMIN on the page.

I'm unexcited about inventing more ways to delay XLOG application on
the standby.  We have enough of those already.

We could actually allow the slave to set the visibility map bits based
on its own xmin horizon.  The only problem is that you wouldn't be
able to write the intent XLOG records.  I suppose you could have a
separate file which is just used to store the intent records, and
designate a range of very-high numbered LSNs to mean blocks of the
intent file rather than a position in the regular WAL stream.  VACUUM
is so much fun on the master, let's have it on the standby too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Can we get away with not setting the LSN on the heap page, even though 
 we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
 can be flushed to disk before the WAL record, but I think that's fine 
 because it's OK to have the flag set in the heap page even if the VM bit 
 is not set.

Why is that fine?  It's certainly not fine from the standpoint of
someone wondering why his index-only scan performs so badly.

I think all this hair-splitting about cases where it's okay to have one
bit set and not the other is misguided.  To me, crash-safety of the VM
means that its copy of the page-header bit is right.  Period.  Yes, it
will cost something to ensure that; so what?  If we don't get more than
enough compensating performance gain from index-only scans, the whole
patch is going to end up reverted.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ouch.  That seems like it could shoot down all these proposals.  There
 definitely isn't any way to make VM crash-safe if there is no WAL-driven
 mechanism for setting the bits.

 Heikki's intent method works fine, because the WAL record only clears
 the visibility map bits on redo; it never sets them.

Uh, no, because he also had that final WAL record that would set the
bits.

 We could actually allow the slave to set the visibility map bits based
 on its own xmin horizon.

Not in a crash-safe way, which is exactly the problem here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:49 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 30.11.2010 18:33, Tom Lane wrote:

 Robert Haasrobertmh...@gmail.com  writes:

 Oh, but it's worse than that.  When you XLOG a WAL record for each of
 those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
 write.

 No, because only the first mod of each VM page would trigger a full page
 write, at least assuming a reasonable ordering of the operations.

 If you change the LSN on the heap pages, you have to write full page images
 of those as well.

 Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE
 flag on the heap page (assuming it's not set already, it usually isn't), and
 then set the bit in the VM while keeping the heap page locked.

 Can we get away with not setting the LSN on the heap page, even though we
 set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be
 flushed to disk before the WAL record, but I think that's fine because it's
 OK to have the flag set in the heap page even if the VM bit is not set.

I don't immediately see why that wouldn't work.  As long as you bump
the LSN on the visibility map page, and also bump the LSN of the
visibility map page every time you clear a bit, I think you should be
OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Can we get away with not setting the LSN on the heap page, even though
 we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page
 can be flushed to disk before the WAL record, but I think that's fine
 because it's OK to have the flag set in the heap page even if the VM bit
 is not set.

 Why is that fine?  It's certainly not fine from the standpoint of
 someone wondering why his index-only scan performs so badly.

 I think all this hair-splitting about cases where it's okay to have one
 bit set and not the other is misguided.  To me, crash-safety of the VM
 means that its copy of the page-header bit is right.  Period.  Yes, it
 will cost something to ensure that; so what?  If we don't get more than
 enough compensating performance gain from index-only scans, the whole
 patch is going to end up reverted.

We're not going to double the cost of VACUUM to get index-only scans.
And that's exactly what will happen if you do full-page writes of
every heap page to set a single bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ouch.  That seems like it could shoot down all these proposals.  There
 definitely isn't any way to make VM crash-safe if there is no WAL-driven
 mechanism for setting the bits.

 Heikki's intent method works fine, because the WAL record only clears
 the visibility map bits on redo; it never sets them.

 Uh, no, because he also had that final WAL record that would set the
 bits.

Well, as already discussed upthread, that WAL record causes some other
problems, so make it Heikki's intent method, without the final WAL
record that breaks things.

 We could actually allow the slave to set the visibility map bits based
 on its own xmin horizon.

 Not in a crash-safe way, which is exactly the problem here.

Brilliant selective quoting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 We're not going to double the cost of VACUUM to get index-only scans.
 And that's exactly what will happen if you do full-page writes of
 every heap page to set a single bit.

It's ridiculous to claim that that doubles the cost of VACUUM.  In the
worst case, it will add 25% to the cost of setting an all-visible bit on
a page where there is no other work to do.  (You already are writing out
the heap page and the VM page, plus a WAL image of the heap page, so a
WAL image of the VM page adds 25%.  But only if you did not set any
other bits on the same VM page, which is probably not a real common
case.)  Given that VACUUM has a lot of other cleanup besides visibility
bit setting, I'm not convinced that this would even be noticeable.

I think the burden is on people who are proposing complicated mechanisms
to show that there's actually a strong need for them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 We're not going to double the cost of VACUUM to get index-only scans.
 And that's exactly what will happen if you do full-page writes of
 every heap page to set a single bit.

 It's ridiculous to claim that that doubles the cost of VACUUM.  In the
 worst case, it will add 25% to the cost of setting an all-visible bit on
 a page where there is no other work to do.  (You already are writing out
 the heap page and the VM page,

True.

 plus a WAL image of the heap page, so a

False.  That is exactly what we are NOT doing now and what we must
find a way to avoid doing.

 WAL image of the VM page adds 25%.  But only if you did not set any
 other bits on the same VM page, which is probably not a real common
 case.)  Given that VACUUM has a lot of other cleanup besides visibility
 bit setting, I'm not convinced that this would even be noticeable.

 I think the burden is on people who are proposing complicated mechanisms
 to show that there's actually a strong need for them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It's ridiculous to claim that that doubles the cost of VACUUM.  In the
 worst case, it will add 25% to the cost of setting an all-visible bit on
 a page where there is no other work to do.  (You already are writing out
 the heap page and the VM page,

 True.

 plus a WAL image of the heap page, so a

 False.  That is exactly what we are NOT doing now and what we must
 find a way to avoid doing.

I do not accept that argument.  You can't make an omelette without
breaking eggs, and the cost of index-only scans is going to be that
it costs more to get the visibility bits set in the first place.

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 19:22, Tom Lane wrote:

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)


You have to write a full-page image if you update the LSN, because 
otherwise the next update that comes along will not write a full page image.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But having said that, I wonder whether we need a full-page image for
 a WAL-logged action that is known to involve only setting a single bit
 and updating LSN.  Would omitting the FPI be any more risky than what
 happens now (ie, the page does get written back to disk at some point,
 without any image from which it can be rewritten if the write fails...)

That's pretty much exactly what Heikki proposed 35 minutes ago, and
you objected 6 minutes later.  I still think it might work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 12:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But having said that, I wonder whether we need a full-page image for
 a WAL-logged action that is known to involve only setting a single bit
 and updating LSN.  Would omitting the FPI be any more risky than what
 happens now (ie, the page does get written back to disk at some point,
 without any image from which it can be rewritten if the write fails...)

 That's pretty much exactly what Heikki proposed 35 minutes ago, and
 you objected 6 minutes later.  I still think it might work.

Oh, I see the difference now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 19:22, Tom Lane wrote:
 But having said that, I wonder whether we need a full-page image for
 a WAL-logged action that is known to involve only setting a single bit
 and updating LSN.

 You have to write a full-page image if you update the LSN, because 
 otherwise the next update that comes along will not write a full page image.

Um.  Drat.  I was thinking about the replay side, where I think it would
actually work --- but you're right, it would break the logic on the
generation side.  Unless you want to put in some kind of flag saying
this was only a visibility bit update, any bigger update still needs
to write an FPI.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Bruce Momjian
Heikki Linnakangas wrote:
 On 30.11.2010 18:33, Tom Lane wrote:
  Robert Haasrobertmh...@gmail.com  writes:
  Oh, but it's worse than that.  When you XLOG a WAL record for each of
  those pages, you're going to trigger full-page writes for all of them.
So now you've turned 1GB of data to write into 2+ GB of data to
  write.
 
  No, because only the first mod of each VM page would trigger a full page
  write, at least assuming a reasonable ordering of the operations.
 
 If you change the LSN on the heap pages, you have to write full page 
 images of those as well.
 
 Let's recap what happens when a VM bit is set: You set the 
 PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it 
 usually isn't), and then set the bit in the VM while keeping the heap 
 page locked.

What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 01.12.2010 03:35, Bruce Momjian wrote:

Heikki Linnakangas wrote:

Let's recap what happens when a VM bit is set: You set the
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
usually isn't), and then set the bit in the VM while keeping the heap
page locked.


What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.


Hmm, you'd somehow have to know if a checkpoint has happened since the 
flag was set. And it might be a long wait, which makes it less 
attractive for index-only scans.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] crash-safe visibility map, take three

2010-11-29 Thread Robert Haas
Last week, I posted a couple of possible designs for making the
visibility map crash-safe, which did not elicit much comment.  Since
this is an important prerequisite to index-only scans, I'm trying
again.

http://archives.postgresql.org/pgsql-hackers/2010-11/msg01474.php
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01529.php

Here's a third possible design.

Instead of representing each page with a single bit in the visibility
map, use two bits.  One bit indicates whether all tuples on the page
are all-visible (same as the current bit) - call this VM_ALL_VISIBLE.
The other bit is only used during VACUUM and indicates whether VACUUM
is trying to set the all-visible bit - call this
VM_BECOMING_ALL_VISIBLE.  We observe the rule that any operation that
clears PD_ALL_VISIBLE must clear both the VM_ALL_VISIBLE and
VM_BECOMING_ALL_VISIBLE bits for that page in the visibility map.
VACUUM precedes as follows:

1. Pin each visibility map page.  If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock for long enough to clear
them.
2. Scan the heap.  When a page is observed to be all-visible, set
VM_BECOMING_ALL_VISIBLE and PD_ALL_VISIBLE.
3. Loop over shared buffers and write out every page to the OS which
belongs to the target relation, was marked all-visible in step 2, and
is still dirty.  Note that this may require a WAL flush.
4. fsync() the heap.
5. Pin each visibility map page.  If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock, clear each such bit, set the
corresponding VM_ALL_VISIBLE bits and XLOG the page.

One might actually want to do steps 2-5 incrementally, in 1GB chunks,
so that you don't fsync() too much of the relation all at once.

If you tilt your head just right, the recurring problem in all of this
is that the heap page and the visibility map page can go to disk in
either order, and we have no way of knowing which one.  A more radical
solution to this problem (and, thus, a fourth possible design) would
be to add a field to the buffer descriptor allowing one page to wire
another page into shared buffers.  If the value is 0, it's the number
of a buffer it's currently wiring.  If the value is 0, it's the
number of other buffers that have wired this buffer.  A buffer both
wire another buffer and itself be wired at the same time.  If the
value is =0, everything's normal.  To make this work, though, you'd
have to complicate the checkpoint logic pretty considerably - make
sure all the unwired buffers are written and fsync'd first, thus
unwiring the remaining ones to be written and fsync'd in a second
pass; and there are also possible problems with very small relations,
where the number of wired buffers might grow to an uncomfortably high
percentage of the total.  Waving my hands at all this complexity, you
could then make the heap pages wire the visibility map pages.

I can't say I'm totally in love with any of these designs.  Anyone
else have any ideas, or any opinions about which one is best?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-29 Thread Heikki Linnakangas

On 30.11.2010 06:57, Robert Haas wrote:

I can't say I'm totally in love with any of these designs.  Anyone
else have any ideas, or any opinions about which one is best?


Well, the design I've been pondering goes like this:

At vacuum:

1. Write an intent XLOG record listing a chunk of visibility map bits 
that are not currently set, that we are going to try to set. A chunk of 
say 100 bits would be about right.


2. Scan the 100 heap pages as we currently do, setting the visibility 
map bits as we go.


3. After the scan, lock the visibility map page, check which of the bits 
that we set in step 2 are still set (concurrent updates might've cleared 
some), and write a final XLOG record listing the set bits. This step 
isn't necessary for correctness, BTW, but without it you lose all the 
set bits if you crash before next checkpoint.


At replay, when we see the intent XLOG record, clear all the bits listed 
in it. This ensures that if we crashed and some of the visibility map 
bits were flushed to disk but the corresponding changes to the heap 
pages were not, the bits are cleared. When we see the final XLOG record, 
we set the bits.


Some care is needed with checkpoints. Setting visibility map bits in 
step 2 is safe because crash recovery will replay the intent XLOG record 
and clear any incorrectly set bits. But if a checkpoint has happened 
after the intent XLOG record was written, that's not true. This can be 
avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG 
record if it has changed since the last intent XLOG record was written.


There's a small race condition in the way a visibility map bit is 
currently cleared. When a heap page is updated, it is locked, the update 
is WAL-logged, and the lock is released. The visibility map page is 
updated only after that. If the final vacuum XLOG record is written just 
after updating the heap page, but before the visibility map bit is 
cleared, replaying the final XLOG record will set a bit that should not 
have been set.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers