Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Bruce Momjian
On Thu, May 30, 2013 at 09:47:22AM -0400, Robert Haas wrote:
 Well, as Heikki points out, I think that's unacceptably dangerous.
 Loss or corruption of a single visibility map page means possible loss
 of half a gigabyte of data.
 
 Also, if we go that route, looking at the visibility map is no longer
 an optimization; it's essential for correctness.  We can't decide to
 skip it when it seems expensive, for example, as Jeff was proposing.

Isn't the visibility map already required for proper return results as
we use it for index-only scans.  I think the optimization-only ship has
sailed.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Andres Freund
On 2013-05-31 13:14:13 -0400, Bruce Momjian wrote:
 On Thu, May 30, 2013 at 09:47:22AM -0400, Robert Haas wrote:
  Well, as Heikki points out, I think that's unacceptably dangerous.
  Loss or corruption of a single visibility map page means possible loss
  of half a gigabyte of data.
  
  Also, if we go that route, looking at the visibility map is no longer
  an optimization; it's essential for correctness.  We can't decide to
  skip it when it seems expensive, for example, as Jeff was proposing.
 
 Isn't the visibility map already required for proper return results as
 we use it for index-only scans.  I think the optimization-only ship has
 sailed.

At the moment we can remove it without causing corruption. If we were to
use it for freezing we couldn't anymore. So there's a difference - how
big it is I am not sure.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Josh Berkus

 Isn't the visibility map already required for proper return results as
 we use it for index-only scans.  I think the optimization-only ship has
 sailed.
 
 At the moment we can remove it without causing corruption. If we were to
 use it for freezing we couldn't anymore. So there's a difference - how
 big it is I am not sure.

Depends on your definition of corruption, really.

But yes, right now, the vismap can lose bits without causing any
corruption, and making all-frozen depend on it would eliminate that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Bruce Momjian
On Fri, May 31, 2013 at 10:28:12AM -0700, Josh Berkus wrote:
 
  Isn't the visibility map already required for proper return results as
  we use it for index-only scans.  I think the optimization-only ship has
  sailed.
  
  At the moment we can remove it without causing corruption. If we were to
  use it for freezing we couldn't anymore. So there's a difference - how
  big it is I am not sure.
 
 Depends on your definition of corruption, really.
 
 But yes, right now, the vismap can lose bits without causing any
 corruption, and making all-frozen depend on it would eliminate that.

Roberts statement was:

 Loss or corruption of a single visibility map page means possible loss
 of half a gigabyte of data.

Certainly unidentified corruption of a visibility map page could easily
cause incorrect results.  So, technically, _adding_ bits would cause
corruption.


-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Josh Berkus
Bruce,

 Roberts statement was:
 
 Loss or corruption of a single visibility map page means possible loss
 of half a gigabyte of data.

I fail to be alarmed at this; currently losing a single page of the clog
causes just as widespread corruption (worse, actually, since it's not
confined to one table).  It does point to the eventual need to checksum
these things, though.

 Certainly unidentified corruption of a visibility map page could easily
 cause incorrect results.  So, technically, _adding_ bits would cause
 corruption.

Yes, that's already true.  I'm pointing out that if we depend on the
vismap for all-frozen, then losing bits *also* causes corruption, so
that's something we need to test for.  Right now, there is no possible
corruption from losing bits; we simply end up scannning more pages than
we have to.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Bruce Momjian
On Fri, May 31, 2013 at 11:00:19AM -0700, Josh Berkus wrote:
 Bruce,
 
  Roberts statement was:
  
  Loss or corruption of a single visibility map page means possible loss
  of half a gigabyte of data.
 
 I fail to be alarmed at this; currently losing a single page of the clog
 causes just as widespread corruption (worse, actually, since it's not
 confined to one table).  It does point to the eventual need to checksum
 these things, though.
 
  Certainly unidentified corruption of a visibility map page could easily
  cause incorrect results.  So, technically, _adding_ bits would cause
  corruption.
 
 Yes, that's already true.  I'm pointing out that if we depend on the
 vismap for all-frozen, then losing bits *also* causes corruption, so
 that's something we need to test for.  Right now, there is no possible
 corruption from losing bits; we simply end up scannning more pages than
 we have to.

Right, and it is hard to see that losing and adding are somehow
more/less likely, so it seems we already realy on the visibility map
being correct.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Stephen Frost
* Bruce Momjian (br...@momjian.us) wrote:
 Right, and it is hard to see that losing and adding are somehow
 more/less likely, so it seems we already realy on the visibility map
 being correct.

Yes, but there's also a way to get *back* to a valid state if things go
south with the visibility map.  It's similar to an index today- sure,
you might get corruption, and that might give you wrong results, but if
you detect it, it's easy to correct- rebuild the index, drop the
visibility map, etc.  With CRCs, it'll be even easier to detect and
discover when you need to do such a thing.

Corruption in the heap is certainly bad too- you may lose some records,
but you could zero those pages out and move on with only losing 8k or
what-have-you.  Clog corruption is certainly a bad thing, but if nearly
everything in your DB is already frozen, it's not as bad as it *could*
be.  I wonder if you could rebuild a portion of clog from the WAL..

There are a lot of different tradeoffs here, certainly.  It's certainly
nice that single bit or single page corruption can often be recovered
from by rebuilding an index or rebuilding the visbility map or just
nuking a few records from a table.  CRCs help with identifying when
corruption has happened, but they do nothing for what to *do* when it
happens, and restore completely from backup isn't a great answer when
you've got terrabytes of data.

Where I'm going with this whole thing is simply that I do worry a bit
about using a bitmap for freeze, or similar, information and not being
able to reconstruct that bitmap from the heap.  Perhaps that's overly
paranoid, but, well, we also write the same data out to disk in multiple
places multiple times- some might call that paranoid too. ;)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Josh Berkus
On 05/31/2013 12:22 PM, Stephen Frost wrote:
 Where I'm going with this whole thing is simply that I do worry a bit
 about using a bitmap for freeze, or similar, information and not being
 able to reconstruct that bitmap from the heap.  Perhaps that's overly
 paranoid, but, well, we also write the same data out to disk in multiple
 places multiple times- some might call that paranoid too. ;)

On the other hand, we could combine Heikki's proposal (epoch numbers in
the page header) together with using the visibility map for pages we
don't need to vacuum freeze, and get vastly better behavior without
increasing corruption risk that I can see.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Stephen Frost
* Josh Berkus (j...@agliodbs.com) wrote:
 On 05/31/2013 12:22 PM, Stephen Frost wrote:
  Where I'm going with this whole thing is simply that I do worry a bit
  about using a bitmap for freeze, or similar, information and not being
  able to reconstruct that bitmap from the heap.  Perhaps that's overly
  paranoid, but, well, we also write the same data out to disk in multiple
  places multiple times- some might call that paranoid too. ;)
 
 On the other hand, we could combine Heikki's proposal (epoch numbers in
 the page header) together with using the visibility map for pages we
 don't need to vacuum freeze, and get vastly better behavior without
 increasing corruption risk that I can see.

That was actually my thinking as well..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Robert Haas
On Fri, May 31, 2013 at 1:44 PM, Bruce Momjian br...@momjian.us wrote:
 On Fri, May 31, 2013 at 10:28:12AM -0700, Josh Berkus wrote:
  Isn't the visibility map already required for proper return results as
  we use it for index-only scans.  I think the optimization-only ship has
  sailed.
 
  At the moment we can remove it without causing corruption. If we were to
  use it for freezing we couldn't anymore. So there's a difference - how
  big it is I am not sure.

 Depends on your definition of corruption, really.

 But yes, right now, the vismap can lose bits without causing any
 corruption, and making all-frozen depend on it would eliminate that.

 Roberts statement was:

 Loss or corruption of a single visibility map page means possible loss
 of half a gigabyte of data.

 Certainly unidentified corruption of a visibility map page could easily
 cause incorrect results.  So, technically, _adding_ bits would cause
 corruption.

Adding bits could cause tuples that ought to be invisible to be
treated as visible.  Currently, removing bits is harmless (except to
performance), but if we used the VM bit to indicate whether the page
was frozen in lieu of actually freezing it, a cleared bit would
potentially cause vacuum to nuke everything on that page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-31 Thread Robert Haas
On Fri, May 31, 2013 at 3:45 PM, Josh Berkus j...@agliodbs.com wrote:
 On 05/31/2013 12:22 PM, Stephen Frost wrote:
 Where I'm going with this whole thing is simply that I do worry a bit
 about using a bitmap for freeze, or similar, information and not being
 able to reconstruct that bitmap from the heap.  Perhaps that's overly
 paranoid, but, well, we also write the same data out to disk in multiple
 places multiple times- some might call that paranoid too. ;)

 On the other hand, we could combine Heikki's proposal (epoch numbers in
 the page header) together with using the visibility map for pages we
 don't need to vacuum freeze, and get vastly better behavior without
 increasing corruption risk that I can see.

Yeah, I was thinking about that as well.  In fact, under either
Heikki's proposal or Andres's proposal or my variants of either one of
them, anti-wraparound vacuums no longer need to scan all-visible
pages.  Under Andres's proposal (and variants), all-visible pages are
ipso facto frozen and therefore need not be scanned for freezing.  But
under Heikki's proposal (and variants), anti-wraparound vacuums only
need to remove dead tuples; freezing live ones is a no-op.  And
all-visible pages don't contain any dead tuples, so we're right back
in the same place.[1]

Where things diverge a little is what you when an anti-wraparound scan
encounters a page that isn't all-visible and can't be marked
all-visible.  Under the XID epoch family of proposals, we need to
truncate any dead tuples on the page to line pointers, and that's it.
Under the treat all-visible as frozen family of proposals, we *also*
need to do old-style freezing on any aged but live tuples on the page.
 So the XID epoch saves write I/O in this case, because we don't
dirty pages or write WAL just because we see a tuple with a high XID
age.  Only pages that contain actual dead tuples get dirtied.

So on further reflection, I'm not seeing any advantage to combining
the two proposals.  The XID epoch approach has the complications of
requiring page format changes and consuming space on every page,
although I think I may have worked out a variant approach that will
avoid that (which no one has commented on, so maybe I'm
hallucinating).  But in other respects it seems better all around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] I said upthread that Heikki's idea would require a separate freeze
map, but now I think I was wrong.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 12:06 AM, Jeff Davis pg...@j-davis.com wrote:
 AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we
 eliminate one write cycle; that is, we won't dirty the page once to
 hint it and then again to mark it all-visible.  But as of 9.3, that
 should really only be a problem in the insert-only case.  And in that
 case, my proposal to consider all-visible pages as frozen would be a
 huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every
 page in the heap, rather than XLOG_HEAP_FREEZE.

 Agreed.

Just to quantify that a bit more, I ran this command a couple of times:

dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum'

And also this one:

dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum freeze'

In the first one, the vacuum at the end takes about 25 seconds.  In
the second one, it takes about 15 minutes, during which time there's
one CPU core running at about 10%; the remainder of the time is spent
waiting for disk I/O.  A little follow-up testing shows that the
vacuum emits 88MB of WAL, while the vacuum freeze emits 13GB of WAL.

This is on the 16-core, 64-thread IBM POWER box with the following
non-default configuration settings:

shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '

Andres' proposal for freezing at the same time we mark pages
all-visible relies on emitting FPIs when we mark pages all-visible,
but I hope that the test above is convincing evidence that it would be
*really* expensive for some users.  My proposal to consider
all-visible pages as frozen avoids that cost, but as far as I can see,
it also requires PD_ALL_VISIBLE to stick around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Andres Freund
On 2013-05-30 07:54:38 -0400, Robert Haas wrote:
 On Thu, May 30, 2013 at 12:06 AM, Jeff Davis pg...@j-davis.com wrote:
  AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we
  eliminate one write cycle; that is, we won't dirty the page once to
  hint it and then again to mark it all-visible.  But as of 9.3, that
  should really only be a problem in the insert-only case.  And in that
  case, my proposal to consider all-visible pages as frozen would be a
  huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every
  page in the heap, rather than XLOG_HEAP_FREEZE.
 
  Agreed.
 
 Just to quantify that a bit more, I ran this command a couple of times:
 
 dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
 sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum'
 
 And also this one:
 
 dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
 sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum freeze'
 
 In the first one, the vacuum at the end takes about 25 seconds.  In
 the second one, it takes about 15 minutes, during which time there's
 one CPU core running at about 10%; the remainder of the time is spent
 waiting for disk I/O.  A little follow-up testing shows that the
 vacuum emits 88MB of WAL, while the vacuum freeze emits 13GB of WAL.
 
 This is on the 16-core, 64-thread IBM POWER box with the following
 non-default configuration settings:
 
 shared_buffers = 8GB
 maintenance_work_mem = 1GB
 synchronous_commit = off
 checkpoint_segments = 300
 checkpoint_timeout = 15min
 checkpoint_completion_target = 0.9
 log_line_prefix = '%t [%p] '
 
 Andres' proposal for freezing at the same time we mark pages
 all-visible relies on emitting FPIs when we mark pages all-visible,
 but I hope that the test above is convincing evidence that it would be
 *really* expensive for some users.  My proposal to consider
 all-visible pages as frozen avoids that cost

I think I basically suggested treating all visible as frozen, didn't I?
If not, I had lost sync between my fingers and my thoughts which happens
too often ;).
You had noticed that my proposed was lacking a bit around when we omit
FPIs for the page while setting all-visible, but we both thought that we
may find a workaround that - which looking at the page level flag first
basically is.

As far as I understand the trick basically is that we can rely on an FPI
being logged when an action unsetting ALL_VISIBLE is performed. That
all-visible would then make sure the hint-bits marking indvidual tuples
as frozen would hit disk. For that we need to add some more work though,
consider:

1) write tuples on a page
2) freeze page by setting ALL_VISIBLE and setting hint
bits. Setting ALL_VISIBLE is wall logged
3) crash
4) replay ALL_VISIBLE, set it on the page level. The individual tuples
   are *not* guaranteed to be marked frozen.
5) update tuple on the page unsetting all visible. Emits an FPI which
   does *not* have the tuples marked as frozen.

Easy enough and fairly cheap to fix by having a function that checks
that updates the hint bits on a page when unsetting all visible since we
can just set it for all pre-existing tuples.

 but as far as I can see, it also requires PD_ALL_VISIBLE to stick
 around.

Now, I am far from being convinced its a good idea to get rid of
PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
legal for the page level ALL_VISIBLE being set while the corresponding
visibilitymap one isn't there's not much prohibiting us fundamentally
from looking in the vm when we need to know whether the page is all
visible, is there?
To the contrary, this actually seems to be a pretty good case for Jeff's
proposed behaviour since it would allow freezing while only writing the
vm?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 15:12, Andres Freund wrote:

Now, I am far from being convinced its a good idea to get rid of
PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
legal for the page level ALL_VISIBLE being set while the corresponding
visibilitymap one isn't there's not much prohibiting us fundamentally
from looking in the vm when we need to know whether the page is all
visible, is there?


Hmm, so you're suggesting that the visibility map would be *required* to 
interpret the pages correctly. Ie. if you just zapped the visibility 
map, you'd lose critical information and the heap would appear to be 
corrupt. I guess that's possible, but it makes me quite uneasy. At the 
moment, it's relieving to know that it's always safe to just truncate 
the visibility map in case of emergency.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Andres Freund
On 2013-05-30 15:34:04 +0300, Heikki Linnakangas wrote:
 On 30.05.2013 15:12, Andres Freund wrote:
 Now, I am far from being convinced its a good idea to get rid of
 PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
 legal for the page level ALL_VISIBLE being set while the corresponding
 visibilitymap one isn't there's not much prohibiting us fundamentally
 from looking in the vm when we need to know whether the page is all
 visible, is there?
 
 Hmm, so you're suggesting that the visibility map would be *required* to
 interpret the pages correctly. Ie. if you just zapped the visibility map,
 you'd lose critical information and the heap would appear to be corrupt. I
 guess that's possible, but it makes me quite uneasy. At the moment, it's
 relieving to know that it's always safe to just truncate the visibility map
 in case of emergency.

I didn't say its a good idea, just that I don't think Robert's
conclusion is necessarily valid. But requiring only the few kbytes of
the vm to be written instead of all of the heap during freezeing (or
whatever we would call it) has quite some allure, I admit that.

But I think that should be a separate project to reeingineering how
freezing works.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 8:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 As far as I understand the trick basically is that we can rely on an FPI
 being logged when an action unsetting ALL_VISIBLE is performed. That
 all-visible would then make sure the hint-bits marking indvidual tuples
 as frozen would hit disk. For that we need to add some more work though,
 consider:

 1) write tuples on a page
 2) freeze page by setting ALL_VISIBLE and setting hint
 bits. Setting ALL_VISIBLE is wall logged
 3) crash
 4) replay ALL_VISIBLE, set it on the page level. The individual tuples
are *not* guaranteed to be marked frozen.
 5) update tuple on the page unsetting all visible. Emits an FPI which
does *not* have the tuples marked as frozen.

 Easy enough and fairly cheap to fix by having a function that checks
 that updates the hint bits on a page when unsetting all visible since we
 can just set it for all pre-existing tuples.

Basically, yes, though I would say infomask bits rather than hint
bits, since not all of them are only hints, and this case would not
be merely a hint.

 but as far as I can see, it also requires PD_ALL_VISIBLE to stick
 around.

 Now, I am far from being convinced its a good idea to get rid of
 PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
 legal for the page level ALL_VISIBLE being set while the corresponding
 visibilitymap one isn't there's not much prohibiting us fundamentally
 from looking in the vm when we need to know whether the page is all
 visible, is there?
 To the contrary, this actually seems to be a pretty good case for Jeff's
 proposed behaviour since it would allow freezing while only writing the
 vm?

Well, as Heikki points out, I think that's unacceptably dangerous.
Loss or corruption of a single visibility map page means possible loss
of half a gigabyte of data.

Also, if we go that route, looking at the visibility map is no longer
an optimization; it's essential for correctness.  We can't decide to
skip it when it seems expensive, for example, as Jeff was proposing.

There's another thing that's bothering me about this whole discussion,
too.  If looking at another page for the information we need to make
visibility decisions is so cheap that we need not be concerned with
it, then why do we need hint bits?  I realize that it's not quite the
same thing, because CLOG doesn't have as much locality of access as
the visibility map; you're guaranteed to find all the information you
need for a single heap page on a single VM page.  Also, CLOG is
per-tuple, not per-page, and we get a decent speed-up from checking
once for the whole page rather than for each tuple.  Nonetheless, I
think the contrast between Jeff's tests,  which aren't showing much
impact from the increased visibility map traffic, and previous
hint-bit removal tests, which have crashed and burned, may be caused
in part by the fact that our algorithms and locking regimen for
shared_buffers are much more sophisticated than for SLRU.  I'm not
eager to have our design decisions driven by that gap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-29 Thread Robert Haas
On Wed, May 29, 2013 at 1:11 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-05-28 at 19:51 -0400, Robert Haas wrote:
  If we just wanted to reduce read cost, why not just take a simpler
  approach and give the visibility map a isfrozen bit?  Then we'd know
  which pages didn't need rescanning without nearly as much complexity.

 That would break pg_upgrade, which would have to remove visibility map
 forks when upgrading.  More importantly, it would require another
 round of complex changes to the write-ahead logging in this area.
 It's not obvious to me that we'd end up ahead of where we are today,
 although perhaps I am a pessimist.

 If we removed PD_ALL_VISIBLE, then this would be very simple, right? We
 would just follow normal logging rules for setting the visible or frozen
 bit.

I don't see how that makes Josh's proposal any simpler.  His proposal
was to change, in a backward-incompatible fashion, the contents of the
visibility map.  Getting rid of PD_ALL_VISIBLE will not eliminate that
backward-incompatibility.  Neither will it eliminate the need to keep
the visibility/freeze map in sync with the heap itself.  Whether we
get rid of PD_ALL_VISIBLE or not, we'll still have to go look at every
type of WAL record that clears the visibility map bit and make it
clear both the visibility and freeze bits.  We'll still need a WAL
record to set the visibility map bit, just as we do today, and we'll
also need a new WAL record type (or a change to the existing WAL
record type) to set the all-frozen bit, when applicable.

Now, independently of Josh's proposal, we could change PD_ALL_VISIBLE
to emit FPIs for every heap page it touches.  For pages that have been
hit by updates or deletes, this would be pretty much free, in 9.3,
since the PD_ALL_VISIBLE bit will probably be set at the same time
we're setting dead line pointers to unused, which is a WAL-logged
operation anyway.  However, for pages that have been hit only by
inserts, this would emit many extra FPIs.

Again independently of Josh's proposal, we could eliminate
PD_ALL_VISIBLE.  This would require either surrendering the
optimization whereby sequential scans can skip visibility checks on
individual tuples within the page, or referring to the visibility map
to get the bit that way.  I know you tested this and couldn't measure
an impact, but with all respect I find that result hard to accept.
Contention around buffer locks and pins is very real; why should it
matter on other workloads and not matter on this one?  It would also
require page modifications prior to consistency to clear the
visibility map bit unconditionally, instead of only when
PD_ALL_VISIBLE is set on the page (though I think it'd be OK to pay
that price if it ended there).

AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we
eliminate one write cycle; that is, we won't dirty the page once to
hint it and then again to mark it all-visible.  But as of 9.3, that
should really only be a problem in the insert-only case.  And in that
case, my proposal to consider all-visible pages as frozen would be a
huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every
page in the heap, rather than XLOG_HEAP_FREEZE.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-29 Thread Jeff Davis
On Wed, 2013-05-29 at 22:46 -0400, Robert Haas wrote:
 Again independently of Josh's proposal, we could eliminate
 PD_ALL_VISIBLE.  This would require either surrendering the
 optimization whereby sequential scans can skip visibility checks on
 individual tuples within the page, or referring to the visibility map
 to get the bit that way.  I know you tested this and couldn't measure
 an impact, but with all respect I find that result hard to accept.
 Contention around buffer locks and pins is very real; why should it
 matter on other workloads and not matter on this one?

The number of pins required during a sequential scan without my patch
is:

   PAGES_IN_HEAP

The number of pins required during a sequential scan with my patch is:

   PAGES_IN_HEAP + ceil(PAGES_IN_HEAP/HEAP_PAGES_PER_VM_PAGE)

Because HEAP_PAGES_PER_VM_PAGE is huge, the second term only matters
when N is very small and the ceil is significant. So, I simply elected
not to use the VM if the table is less than 32 pages in size. For such
small tables, the benefit of using a page-at-a-time visibility check was
not apparent in my tests anyway.

 AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we
 eliminate one write cycle; that is, we won't dirty the page once to
 hint it and then again to mark it all-visible.  But as of 9.3, that
 should really only be a problem in the insert-only case.  And in that
 case, my proposal to consider all-visible pages as frozen would be a
 huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every
 page in the heap, rather than XLOG_HEAP_FREEZE.

Agreed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-22 Thread Jeff Davis
On Mon, 2013-01-21 at 12:00 +0200, Heikki Linnakangas wrote:
 On 21.01.2013 11:10, Jeff Davis wrote:
  That confuses me. The testing was to show it didn't hurt other workloads
  (like scans or inserts/updates/deletes); so the best possible result is
  that they don't show signs either way.
 
 I went back to look at the initial test results that demonstrated that 
 keeping the pin on the VM buffer mitigated the overhead of pinning the 
 vm page. The obvious next question is, what is the impact when that's 
 inefficient, ie. when you update pages across different 512 MB sections, 
 so that the vm pin has to be dropped and reacquired repeatedly.
 
 I tried to construct a worst case scenario for that:

I confirmed this result in a single connection (no concurrency). I used
a shared_buffers of 2GB so that the whole table would fit.

Also, I fixed a bug that I noticed along the way, which was an
uninitialized variable. New version attached.

FWIW, I'm considering this patch to be rejected; I just didn't want to
leave a patch with a bug in it.

Regards,
Jeff Davis


rm-pd-all-visible-20130122.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-21 Thread Jeff Davis
On Mon, 2013-01-21 at 12:49 +0530, Pavan Deolasee wrote:

 At the minimum your patch will need to have one additional buffer
 pinned for every K  8192 * 8 heap pages.

I assume it's the same K I referred to when responding to Robert: the
max number of heap buffers we read before we unpin and repin the VM
buffer. Right now it's unlimited, because there is currently no need to
have it at all -- I was just offering the solution in case we did want
to do VM page cleanup in the future or something.

  For most cases, the value of K will be large enough to ignore the
 overhead, but for small values of K, I'm not sure if we can say that
 with confidence.

It's a constant, and we can easily set it high enough that it wouldn't
matter. There's no reason at all to choose a small value for K. Consider
that an index root page pin is held for the entire scan, and we don't
have a problem with that.

 Of course, for very small tables the real contention might be
 somewhere else and so this change may not show up anything on the
 benchmarks. Doesn't your own tests for 33 page tables confirm that
 there is indeed contention for this case ? I agree its not huge, but I
 don't know if there are workloads where it will matter.

That appears to happen because of the fraction of pinned pages being too
high (aside: it was fairly challenging to construct a test where that
happened). I believe it was mostly solved by adding a threshold to use
the VM, and I can probably solve it completely by doing some more
experiments and finding a better threshold value.
 
 I understand that my patch is probably not going in, 
 
 Sorry about that. I know how that feels. But we need some more reasons
 to justify this change, especially because the performance numbers
 themselves are not showing any signs either way.

That confuses me. The testing was to show it didn't hurt other workloads
(like scans or inserts/updates/deletes); so the best possible result is
that they don't show signs either way.

But yes, I see that others are not interested in the benefits offered by
the patch, which is why I'm giving up on it. If people are concerned
about the costs, then I can fix those; but there's nothing I can do to
increase the benefits.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-21 Thread Heikki Linnakangas

On 21.01.2013 11:10, Jeff Davis wrote:

That confuses me. The testing was to show it didn't hurt other workloads
(like scans or inserts/updates/deletes); so the best possible result is
that they don't show signs either way.


I went back to look at the initial test results that demonstrated that 
keeping the pin on the VM buffer mitigated the overhead of pinning the 
vm page. The obvious next question is, what is the impact when that's 
inefficient, ie. when you update pages across different 512 MB sections, 
so that the vm pin has to be dropped and reacquired repeatedly.


I tried to construct a worst case scenario for that:

create unlogged table testtable (i int4);
insert into testtable select generate_series(1, 1500);
insert into testtable select generate_series(1, 1500);
create index testtable_index on testtable (i);

When you traverse tuples using that index, the tuples will come 
alternating from low-numbered pages and high-numbered pages, which 
defeats keeping the last vm page pinned. To test, I ran this:


set enable_bitmapscan=off; set enable_seqscan=off;
begin;
delete from testtable where i = 0;
rollback;

I repeated a few times with and without the patch 
(rm-pd-all-visible-0118.patch). According to \timing, the delete takes 
about 12.6 seconds without the patch, and 15.3 seconds with it.


This is a worst-case scenario, and the slowdown isn't huge, so maybe 
it's a worthwhile tradeoff. But it shows that removing PD_ALL_VISIBLE is 
not completely free.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-21 Thread Robert Haas
On Mon, Jan 21, 2013 at 1:52 AM, Jeff Davis pg...@j-davis.com wrote:
  Of course, there is an argument that this patch will
 simplify the code, but I'm not sure if its enough to justify the
 additional contention which may or may not show up in the benchmarks
 we are running, but we know its there.

 What additional contention? How do you know it's there?

We know it's there because every additional page that you access has
to be looked up and locked and the memory that contains it brought
into the CPU cache.  If you look up and lock more pages, you WILL have
more contention - both for memory, and for locks - and more runtime
cost.  Whether or not you can currently detect that contention and/or
runtime cost experimentally is another matter.  Failure to do so could
indicate either that you haven't got the right test case - Heikki
seems to have found one that works - or that it's being masked by
other things, but might not be always.

To raise an example which I believe is similar to this one, consider
Kevin Grittner's work on SSI.  He set himself a goal that SSI should
impose a performance regression of no more than 2% - and he met that
goal, at the time the code was committed.  But then, my read
scalability work during the 9.2 cycle blew the lid off of read
performance for certain workloads, and now SSI is FAR slower on those
workloads.   His code always had a scalability problem, but you
couldn't measure it experimentally before I did that work, because
there were equally bad scalability problems elsewhere.  Now there
aren't, and you can, and I have.  We might well have refused to commit
that patch with the locking scheme it uses today if my scalability
work had been done first - or perhaps we would have decided that the
case where the difference is large is too narrow to be worth worrying
about, but I think it would have at least provoked discussion.

My biggest concern about the visibility map is that it will act as a
focal point for contention.  Map half a gigabyte of heap down to 1
page of visibility map and it's easy to think that you could have some
situation in which that page gets very, very hot.  We know that cache
line stealing is a significant cost of ProcArray manipulation, which
is why the PGXACT patch makes a significant difference in write
throughput.  Now your argument seems to be that we can't measure any
such effect here, but maybe we're just not measuring the right thing.
Heikki's example reminded me that I was concerned about the cost of
repeatedly pinning different VM buffers during an index-only scan, if
the relation were very large.  That's seems easy to justify on the
grounds that you're saving so much I/O and memory traffic that the
pins will seem cheap by comparison ... but they don't, always.  IIRC,
an index-only scan on a fully-cached relation is not necessarily
faster than a regular index-scan, even if the heap is entirely
all-visible.

I realize these examples aren't directly applicable to this case, so
I'm not sure if they help to advance the discussion or not.  I advance
them only as evidence that simple tests don't always manage to capture
the real costs and benefits of a feature or optimization, and that
predicting the system-wide effects of changes in this area is hard.
For a large benefit I think we would perhaps bite the bullet and take
our chances, but in my (human and imperfect) judgement the benefit
here is not large enough to justify it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Robert Haas
On Thu, Jan 17, 2013 at 5:50 PM, Jeff Davis pg...@j-davis.com wrote:
 If the without interruption part becomes a practical problem, it seems
 fairly easy to fix: drop the pin and pick it up again once every K
 pages. Unless I'm missing something, this is a minor concern.

I think probably so.

 Test plan:

   1. Take current patch (without skip VM check for small tables
 optimization mentioned above).
   2. Create 500 tables each about 1MB.
   3. VACUUM them all.
   4. Start 500 connections (one for each table)
   5. Time the running of a loop that executes a COUNT(*) on that
 connection's table 100 times.

 I think shared_buffers=64MB is probably appropriate. We want some memory
 pressure so that it has to find and evict pages to satisfy the queries.
 But we don't want it to be totally exhausted and unable to even pin a
 new page; that really doesn't tell us much except that max_connections
 is too high.

 Sound reasonable?

Well, it's certainly a data point, but each table in that case has 128
pages, so the effect is still pretty small.  The place where you're
going to run into trouble is when many of those tables have 4 pages
each, or 2 pages each, or 1 page each.

 All of which is to say that I think this patch is premature.  If we
 adopt something like this, we're likely never going to revert back to
 the way we do it now, and whatever cache-pressure or other costs this
 approach carries will be hear to stay - so we had better think awfully
 carefully before we do that.

 What about this patch makes it hard to undo/rework in the future?

Well, if you have a bunch of code, and it preserves property X at all
times, it is pretty easy to decide that new code need no longer
preserve property X.  It is essentially a no-op.  The reverse, going
through a bunch of existing code that does not consistently preserve
property X and making it do so, is always much harder, because you've
got to audit all the code you've already got.  I don't want to undo
the work that's been done on this over the last four years without a
really good reason, and I'm not seeing one.

 My mistake. I believe that is already fixed, and certainly not a
 fundamental issue.

It at least calls for a repetition of any performance testing that has
already been done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Robert Haas
On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis pg...@j-davis.com wrote:
 So, I attached a new version of the patch that doesn't look at the VM
 for tables with fewer than 32 pages. That's the only change.

That certainly seems worthwhile, but I still don't want to get rid of
this code.  I'm just not seeing a reason why that's something that
desperately needs to be done.  I don't think this is a barrier to
anything else we want to do, and it might well be that the situations
where this patch would hurt us are currently masked by other
bottlenecks, but won't be always.  Right now, the vast majority of
heap updates don't need to pin the visibility map page; with this
change, all of them do.  Now, I accept that your test results show
that that doesn't matter, but how can that not be an artifact of some
kind?  Can we really credit that accessing two pages costs no more
than accessing one?  To what countervailing factor could we plausibly
attribute that?

Now, even if it costs more in some narrow sense, the difference might
not be enough to matter.  But without some big gain on the other side,
why tinker with it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis pg...@j-davis.com wrote:
 So, I attached a new version of the patch that doesn't look at the VM
 for tables with fewer than 32 pages. That's the only change.

 That certainly seems worthwhile, but I still don't want to get rid of
 this code.  I'm just not seeing a reason why that's something that
 desperately needs to be done.

Yeah, I'm having the same problem.  Despite Jeff's test results, I can't
help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some
workloads, and it's not obvious to me what benefit we get from dropping
it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Pavan Deolasee
On Mon, Jan 21, 2013 at 8:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis pg...@j-davis.com wrote:
 So, I attached a new version of the patch that doesn't look at the VM
 for tables with fewer than 32 pages. That's the only change.

 That certainly seems worthwhile, but I still don't want to get rid of
 this code.  I'm just not seeing a reason why that's something that
 desperately needs to be done.

 Yeah, I'm having the same problem.  Despite Jeff's test results, I can't
 help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some
 workloads, and it's not obvious to me what benefit we get from dropping
 it.

I tend to agree. When I looked at the patch, I thought since its
removing a WAL record (and associated redo logic), it has some
additional value. But that was kind of broken (sorry, I haven't looked
at the latest patch if Jeff fixed it without requiring to reinstate
the WAL logging). I also thought that bug invalidates the performance
numbers reported. Of course, there is an argument that this patch will
simplify the code, but I'm not sure if its enough to justify the
additional contention which may or may not show up in the benchmarks
we are running, but we know its there.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Jeff Davis
On Sun, 2013-01-20 at 22:19 -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis pg...@j-davis.com wrote:
  So, I attached a new version of the patch that doesn't look at the VM
  for tables with fewer than 32 pages. That's the only change.
 
  That certainly seems worthwhile, but I still don't want to get rid of
  this code.  I'm just not seeing a reason why that's something that
  desperately needs to be done.
 
 Yeah, I'm having the same problem.  Despite Jeff's test results, I can't
 help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some
 workloads, and it's not obvious to me what benefit we get from dropping
 it.

OK. I respectfully disagree with the arguments I've seen so far, but we
can all move on.

I already had some more test results (again, thanks to Nathan Boley), so
I finished them up and attached them to the bottom of this email for the
archives (there's always hope, right?).

Regards,
Jeff Davis


Test goal: is 32 is an appropriate threshold for using the VM after we
remove PD_ALL_VISIBLE?

Test setup: 500 31-page tables and 500 33-page tables. Test recent build
against patched version, with varying shared_buffers. The first test is
500 connections each doing 20 iterations of COUNT(*) against the 500
31-page tables. The next test is the same, except against the 33-page
tables.

Test results:

  (There were a few outliers I discarded as being too fast. It always
happened in the first run, and I strongly suspect the connections began
unevenly, leading to lower concurrency. They didn't seem to favor one
build over another.)

master, 31-page (first column is shared_buffers, second is range of
seconds):
32MB:  5.8 -  6.1
64MB:  3.1 -  3.7
96MB:  1.7 -  2.2
   112MB:  0.6 -  1.1
   128MB:  0.4 -  0.4
   256MB:  0.4 -  0.4

patch, 31-page (doesn't use VM because 31 is below threshold):
32MB:  5.1 -  5.9  
64MB:  3.4 -  3.8
96MB:  1.7 -  2.0
   112MB:  0.7 -  1.1
   128MB:  0.4 -  0.5
   256MB:  0.4 -  0.5

master, 33-page:
32MB:  5.9 -  7.0
64MB:  4.2 -  4.7
96MB:  2.4 -  2.8
   112MB:  1.2 -  1.6
   128MB:  0.5 -  0.5
   256MB:  0.4 -  0.5

patch, 33-page (does use VM because 33 is above threshold):
32MB:  6.2 -  7.2
64MB:  4.4 -  4.7
96MB:  2.8 -  3.0
   112MB:  1.7 -  1.8
   128MB:  0.5 -  1.0
   256MB:  0.4 -  0.5

Conclusion:

32 pages is a low enough threshold that skipping the VM is
insignificant.

When the 500 tables are 33 pages, and it does use the VM, we do see a
measurable cost under cache pressure. The penalty is fairly small (10%
ballpark), and this is a worst-case, so I don't think it's a problem.
From the last test results, we know it gets back to even by the time the
tables are 128 pages (1MB). So it could be that there is a slightly
higher threshold (between 32 and 128) that would be slightly better. But
given how specific this case is and how small the penalty is, I think 32
is a fine threshold.

Also, to reiterate, I focused my tests almost entirely on scans, though
some of the concern was around inserts/updates/deletes. I tried, but was
unable to show any difference on those tests. I suspect that the
bottleneck is elsewhere.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Jeff Davis
On Mon, 2013-01-21 at 11:27 +0530, Pavan Deolasee wrote:
 I tend to agree. When I looked at the patch, I thought since its
 removing a WAL record (and associated redo logic), it has some
 additional value. But that was kind of broken (sorry, I haven't looked
 at the latest patch if Jeff fixed it without requiring to reinstate
 the WAL logging). I also thought that bug invalidates the performance
 numbers reported.

I revalidated the performance numbers after reinstating the WAL logging.
No difference (which is unsurprising, since my tests were read-only).

  Of course, there is an argument that this patch will
 simplify the code, but I'm not sure if its enough to justify the
 additional contention which may or may not show up in the benchmarks
 we are running, but we know its there.

What additional contention? How do you know it's there?

The design is to keep the VM page pinned, and to read it without
requiring locks (like index-only scans already do). So I don't see any
obvious additional contention there, unless you're talking about the
work the CPU does for cache coherency (which did not show up in any of
my tests).

I understand that my patch is probably not going in, but I would like to
be clear about what is a known practical problem, what is a theoretical
problem that has eluded my testing capabilities, and what is no problem
at all.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-20 Thread Pavan Deolasee
On Mon, Jan 21, 2013 at 12:22 PM, Jeff Davis pg...@j-davis.com wrote:


 On Mon, 2013-01-21 at 11:27 +0530, Pavan Deolasee wrote:
   Of course, there is an argument that this patch will
  simplify the code, but I'm not sure if its enough to justify the
  additional contention which may or may not show up in the benchmarks
  we are running, but we know its there.

 What additional contention? How do you know it's there?


At the minimum your patch will need to have one additional buffer pinned
for every K  8192 * 8 heap pages. For most cases, the value of K will be
large enough to ignore the overhead, but for small values of K, I'm not
sure if we can say that with confidence. Of course, for very small tables
the real contention might be somewhere else and so this change may not show
up anything on the benchmarks. Doesn't your own tests for 33 page tables
confirm that there is indeed contention for this case ? I agree its not
huge, but I don't know if there are workloads where it will matter.


 I understand that my patch is probably not going in,


Sorry about that. I know how that feels. But we need some more reasons to
justify this change, especially because the performance numbers themselves
are not showing any signs either way. One could have argued that this saves
us a valuable page level bit, but with pg_upgrade etc it has become hard to
re-utilize page level bits for other purposes and we don't yet have a
pressing need for more bits.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-18 Thread Jeff Davis
On Thu, 2013-01-17 at 14:53 -0800, Jeff Davis wrote:
 Test plan:
 
   1. Take current patch (without skip VM check for small tables
 optimization mentioned above).
   2. Create 500 tables each about 1MB.
   3. VACUUM them all.
   4. Start 500 connections (one for each table)
   5. Time the running of a loop that executes a COUNT(*) on that
 connection's table 100 times.

Done, with a few extra variables. Again, thanks to Nathan Boley for
lending me the 64-core box. Test program attached.

I did both 1MB tables and 1 tuple tables, but I ended up throwing out
the 1-tuple table results. First of all, as I said, that's a pretty easy
problem to solve, so not really what I want to test. Second, I had to do
so many iterations that I don't think I was testing anything useful. I
did see what might have been a couple differences, but I would need to
explore in more detail and I don't think it's worth it, so I'm just
reporting on the 1MB tables.

For each test, each of 500 connections runs 10 iterations of a COUNT(*)
on it's own 1MB table (which is vacuumed and has the VM bit set). The
query is prepared once. The table has only an int column.

The variable is shared_buffers, going from 32MB (near exhaustion for 500
connections) to 2048MB (everything fits).

The last column is the time range in seconds. I included the range this
time, because there was more variance in the runs but I still think they
are good test results.

master:
32MB: 16.4 - 18.9
64MB: 16.9 - 17.3
   128MB: 17.5 - 17.9
   256MB: 14.7 - 15.8
   384MB:  8.1 -  9.3
   448MB:  4.3 -  9.2
   512MB:  1.7 -  2.2
   576MB:  0.6 -  0.6
  1024MB:  0.6 -  0.6
  2048MB:  0.6 -  0.6

patch:
32MB: 16.8 - 17.6
64MB: 17.1 - 17.5
   128MB: 17.2 - 18.0
   256MB: 14.8 - 16.2
   384MB:  8.0 - 10.1
   448MB:  4.6 -  7.2
   512MB:  2.0 -  2.6
   576MB:  0.6 -  0.6
  1024MB:  0.6 -  0.6
  2048MB:  0.6 -  0.6

Conclusion:

I see about what I expect: a precipitous drop in runtime after
everything fits in shared_buffers (500 1MB tables means the inflection
point around 512MB makes a lot of sense). There does seem to be a
measurable difference right around that inflection point, but it's not
much. Considering that this is the worst case that I could devise, I am
not too concerned about this.

However, it is interesting to see that there really is a lot of
maintenance work being done when we need to move pages in and out of
shared buffers. I'm not sure that it's related to the freelists though.

For the extra pins to really be a problem, I think a much higher
percentage of the buffers would need to be pinned. Since the case we are
worried about involves scans (if it involved indexes, that would already
be using more than one pin per scan), then that means the only way to
get to a high percentage of pinned buffers is by having very small
tables. But we don't really need to use the VM when scanning very small
tables (the overhead would be elsewhere), so I think we're OK.

So, I attached a new version of the patch that doesn't look at the VM
for tables with fewer than 32 pages. That's the only change.

Regards,
Jeff Davis

#include libpq-fe.h
#include stdlib.h
#include stdio.h
#include sys/time.h

#define QSIZE 256

void
test(char *query, int procnum, int niter)
{
  PGconn	*conn;
  PGresult	*result;
  int		 i;
  
  conn = PQconnectdb(host=/tmp dbname=postgres);
  if (PQstatus(conn) != CONNECTION_OK)
{
  fprintf(stderr, connection failed!\n);
  exit(1);
}

  result = PQprepare(conn, q, query, 0, NULL);
  if (PQresultStatus(result) != PGRES_COMMAND_OK)
{
  fprintf(stderr, PREPARE failed: %s, PQerrorMessage(conn));
  PQclear(result);
  exit(1);
}
  PQclear(result);

  for (i = 0; i  niter; i++)
{
  result = PQexecPrepared(conn, q, 0, NULL, NULL, NULL, 0);
  if (PQresultStatus(result) != PGRES_TUPLES_OK)
	{
	  fprintf(stderr, EXECUTE PREPARED failed: %s\n, PQerrorMessage(conn));
	  PQclear(result);
	  exit(1);
	}
  PQclear(result);
}

  PQfinish(conn);
}

int
main(int argc, char *argv[])
{
  int	 niter;
  int	 nprocs;
  char	 query[QSIZE];
  int	 i;
  pid_t *procs;
  struct timeval tv1, tv2;

  if (argc != 3)
{
  fprintf(stderr, expected 3 arguments, got %d\n, argc);
  exit(1);
}

  nprocs = atoi(argv[1]);
  niter = atoi(argv[2]);

  procs = malloc(sizeof(pid_t) * nprocs);

  gettimeofday(tv1, NULL);

  for (i = 0; i  nprocs; i++)
{
  pid_t pid = fork();
  if (pid == 0)
	{
	  snprintf(query, QSIZE, SELECT COUNT(*) FROM mb_%d;, i);
	  test(query, i, niter);
	  exit(0);
	}
  else
	{
	  procs[i] = pid;
	}
}

  for (i = 0; i  nprocs; i++)
{
  int status;
  waitpid(procs[i], status, 0);
  if (!WIFEXITED(status))
	{
	  fprintf(stderr, child did not exit!\n, argc);
	  exit(1);
	}
  if (WEXITSTATUS(status) != 0)
	{
	  fprintf(stderr, child exited with status %d\n, WEXITSTATUS(status));
	  exit(1);
	}
}

  gettimeofday(tv2, NULL);

  free(procs);

  if 

Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Simon Riggs
On 17 January 2013 03:02, Jeff Davis pg...@j-davis.com wrote:

 Rebased patch attached. No significant changes.

Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them? That will help decide whether
to commit.

Thanks

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Pavan Deolasee
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 17 January 2013 03:02, Jeff Davis pg...@j-davis.com wrote:

  Rebased patch attached. No significant changes.

 Jeff, can you summarise/collate why we're doing this, what concerns it
 raises and how you've dealt with them? That will help decide whether
 to commit.


+1. On another thread Set visibility map bit after HOT prune, Robert
mentioned that its not such a good idea to remove the PD_ALL_VISIBLE
flag because it helps us to reduce the contention on the VM page,
especially when we need to reset the VM bit. Here is an excerpt from
Robert's comment that thread:

Sure, but you're zipping rather blithely past the disadvantages of
such an approach.  Jeff Davis recently proposed getting rid of
PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism
about that; this proposal has the same problems.  One of the major
benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts,
updates, and deletes to the page can ignore the visibility map.  That
means that a server under heavy concurrency is much less likely to
encounter contention on the visibility map blocks.  Now, maybe that's
not really a problem, but I sure haven't seen enough evidence to make
me believe it.  If it's really true that PD_ALL_VISIBLE needn't fill
this role, then Heikki wasted an awful lot of time implementing it,
and I wasted an awful lot of time keeping it working when I made the
visibility map crash-safe for IOS.  That could be true, but I tend to
think it isn't.

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Abhijit Menon-Sen
At 2013-01-17 08:41:37 +, si...@2ndquadrant.com wrote:

 Jeff, can you summarise/collate why we're doing this, what concerns it
 raises and how you've dealt with them?

Since I was just looking at the original patch and discussion, and since
Pavan has posted an excerpt from one objection to it, here's an excerpt
from Jeff's original post titled Do we need so many hint bits?

http://www.postgresql.org/message-id/1353026577.14335.91.camel@sussancws0025

Also, I am wondering about PD_ALL_VISIBLE. It was originally
introduced in the visibility map patch, apparently as a way to know
when to clear the VM bit when doing an update. It was then also used
for scans, which showed a significant speedup. But I wonder: why not
just use the visibilitymap directly from those places? It can be
used for the scan because it is crash safe now (not possible
before). And since it's only one lookup per scanned page, then I
don't think it would be a measurable performance loss there.
Inserts/updates/deletes also do a significant amount of work, so
again, I doubt it's a big drop in performance there -- maybe under a
lot of concurrency or something.

The benefit of removing PD_ALL_VISIBLE would be significantly
higher. It's quite common to load a lot of data, and then do some
reads for a while (setting hint bits and flushing them to disk), and
then do a VACUUM a while later, setting PD_ALL_VISIBLE and writing
all of the pages again. Also, if I remember correctly, Robert went
to significant effort when making the VM crash-safe to keep the
PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed
before?

There was considerable discussion after this (accessible through the
archives link above), which I won't attempt to summarise.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Pavan Deolasee
On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen a...@2ndquadrant.comwrote:



 There was considerable discussion after this (accessible through the
 archives link above), which I won't attempt to summarise.


I thought Robert made those comments after considerable discussions on
Jeff's approach. So he probably still stands by his objections or at least
not satisfied/seen the numbers.

Now that I look at the patch, I wonder if there is another fundamental
issue with the patch. Since the patch removes WAL logging for the VM set
operation, this can happen:

1. Vacuum kicks in and clears all dead tuples in a page and decides that
its all-visible
2. Vacuum WAL-logs the cleanup activity and marks the page dirty
3. Vacuum sets the visibility bit and marks the VM page dirty
4. Say the VM page gets written to the disk. The heap page is not yet
written neither the WAL log corresponding to the cleanup operation
5. CRASH

After recovery, the VM bit will remain set because the VM page got written
before the crash. But since heap page's cleanup WAL did not made to the
disk, those operations won't be replayed. The heap page will be left with
not-all-visible tuples in that case and its not a good state to be in.

The original code does not have this problem because the VM set WAL gets
written after the heap page cleanup WAL. So its guaranteed that the VM bit
will be set during recovery only if the cleanup WAL is replayed too (there
is more magic than what meets the eye and I think its not fully
documented).

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 May be you've already addressed that concern with the proven
 performance numbers, but I'm not sure.

It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.

Jeff's approach of holding the VM pins for longer certainly mitigates
some of the damage, in the sense that it reduces buffer lookups and
pin/unpin cycles - and might be worth doing independently of the rest
of the patch if we think it's a win.  Index-only scans already use a
similar optimization, so extending it to inserts, updates, and deletes
is surely worth considering.  The main question in my mind is whether
there are any negative consequences to holding a VM buffer pin for
that long without interruption.  The usual consideration - namely,
blocking vacuum - doesn't apply here because vacuum does not take a
cleanup lock on the visibility map page, only the heap page, but I'm
not sure if there are others.

The other part of the issue is cache pressure. I don't think I can say
it better than what Tom already wrote:

# I'd be worried about the case of a lot of sessions touching a lot of
# unrelated tables.  This change implies doubling the number of buffers
# that are held pinned by any given query, and the distributed overhead
# from that (eg, adding cycles to searches for free buffers) is what you
# ought to be afraid of.

I agree that we ought to be afraid of that.  A pgbench test isn't
going to find a problem in this area because there you have a bunch of
sessions all accessing the same small group of tables.  To find a
problem of the type above, you'd need lots of backends accessing lots
of different, small tables.  That's not the use case we usually
benchmark, but I think there are a fair number of people doing things
like that - for example, imagine a hosting provider or web application
with many databases or schemas on a single instance.  AFAICS, Jeff
hasn't tried to test this scenario.

Now, on the flip side, we should also be thinking about what we would
gain from this patch, and what other ways there might be to achieve
the same goals.  As far as I can see, the main gain is that if you
bulk-load a table, don't vacuum it right away, get all the hint bits
set by some other mechanism, and then vacuum the table, you'll only
read the whole table instead of rewriting the whole table.  So ISTM
that, for example, if we adopted the idea of making HOT prune set
visibility-map bits, most of the benefit of this change evaporates,
but whatever costs it may have will remain.  There are other possible
ways of getting the same benefit as well - for example, I've been
thinking for a while now that we should try to find some judicious way
of vacuuming insert-only tables, perhaps only in small chunks when
there's nothing else going on.  That wouldn't as clearly obsolete this
patch, but it would address a very similar use case and would also
preset hint bits, which would help a lot of people.  Some of the ideas
we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if
we can do an early freeze without losing forensic information, we can
roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the
page into a single write.

All of which is to say that I think this patch is premature.  If we
adopt something like this, we're likely never going to revert back to
the way we do it now, and whatever cache-pressure or other costs this
approach carries will be hear to stay - so we had better think awfully
carefully before we do that.  And, FWIW, I don't believe that there is
sufficient time in this release cycle to carefully test this patch and
the other alternative designs that aim toward the same ends.  Even if
there were, this is exactly the sort of thing that should be committed
at the beginning of a release cycle, not the end, so as to allow
adequate time for discovery of unforeseen consequences before the code
ends up out in the wild.

Of course, there's another issue here too, which is that as Pavan
points out, the page throws crash-safety out the window, which breaks
index-only scans (if you have a crash).  HEAP_XLOG_VISIBLE is intended
principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch
rips it out even though its purpose is to remove the latter, not the
former.  Removing PD_ALL_VISIBLE eliminates the need to keep the
visibility-map bit consist with PD_ALL_VISIBLE, but it does not
eliminate the need to keep PD_ALL_VISIBLE consistent with the page
contents.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 16:53, Robert Haas wrote:

On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
pavan.deola...@gmail.com  wrote:

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.


It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.


The idea was to avoid clearing the bit in the VM page on every update, 
when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is 
not set. I assumed the traffic and contention on the VM page would be a 
killer otherwise. I don't remember if I ever actually tested that 
though. Maybe I was worrying about nothing and hitting the VM page on 
every update is ok.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
 Now that I look at the patch, I wonder if there is another fundamental
 issue with the patch. Since the patch removes WAL logging for the VM
 set operation, this can happen:
 
Thank you. I think I was confused by this comment here:

When we *set* a visibility map during VACUUM, we must write WAL.  This
may seem counterintuitive, since the bit is basically a hint: if it is
clear, it may still be the case that every tuple on the page is visible
to all transactions; we just don't know that for certain.  The
difficulty is that there are two bits which are typically set together:
the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit.
If a crash occurs after the visibility map page makes it to disk and
before the updated heap page makes it to disk, redo must set the bit on
the heap page. Otherwise, the next insert, update, or delete on the heap
page will fail to realize that the visibility map bit must be cleared,
possibly causing index-only scans to return wrong answers.

Which lead me to believe that I could just rip out the WAL-related code
if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness
doesn't have to do with the WAL directly, it's because the VM page's LSN
is not bumped past the LSN of the related heap page cleanup, so it can
be written too early.

Of course, the way to bump the LSN is to write WAL for the
visibilitymap_set operation. But that would be a very simple WAL
routine, rather than the complex one that exists without the patch.

I suppose we could even try to bump the LSN without writing WAL somehow,
but it doesn't seem worth reasoning through that (setting a VM bit is
rare enough).

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote:
 On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
  Now that I look at the patch, I wonder if there is another fundamental
  issue with the patch. Since the patch removes WAL logging for the VM
  set operation, this can happen:
  
 Thank you. 

New patch attached with simple WAL logging.

Regards,
Jeff Davis



rm-pd-all-visible-20130117.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Simon Riggs
On 17 January 2013 15:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 17.01.2013 16:53, Robert Haas wrote:

 On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
 pavan.deola...@gmail.com  wrote:

 May be you've already addressed that concern with the proven
 performance numbers, but I'm not sure.


 It would be nice to hear what Heikki's reasons were for adding
 PD_ALL_VISIBLE in the first place.


 The idea was to avoid clearing the bit in the VM page on every update, when
 the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set.
 I assumed the traffic and contention on the VM page would be a killer
 otherwise. I don't remember if I ever actually tested that though. Maybe I
 was worrying about nothing and hitting the VM page on every update is ok.

Presumably we remember the state of the VM so we can skip the re-visit
after every write?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 17:14 +0200, Heikki Linnakangas wrote:
 I don't remember if I ever actually tested that 
 though. Maybe I was worrying about nothing and hitting the VM page on 
 every update is ok.

I tried, but was unable to show really anything at all, even without
keeping the VM page pinned. I think the bottleneck is elsewhere;
although I am keeping the page pinned in this patch to prevent it from
becoming a bottleneck.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 19:58 +, Simon Riggs wrote:
 Presumably we remember the state of the VM so we can skip the re-visit
 after every write?

That was not a part of my patch, although I remember that you mentioned
that previously and I thought it could be a good way to mitigate a
problem if it ever came up.

However, the tests I did didn't show any problem there. The tests were
somewhat noisy, so perhaps I was doing something wrong, but it didn't
appear that looking at the VM page for every update was a bottleneck.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 09:53 -0500, Robert Haas wrote:
 The main question in my mind is whether
 there are any negative consequences to holding a VM buffer pin for
 that long without interruption.  The usual consideration - namely,
 blocking vacuum - doesn't apply here because vacuum does not take a
 cleanup lock on the visibility map page, only the heap page, but I'm
 not sure if there are others.

If the without interruption part becomes a practical problem, it seems
fairly easy to fix: drop the pin and pick it up again once every K
pages. Unless I'm missing something, this is a minor concern.

 The other part of the issue is cache pressure. I don't think I can say
 it better than what Tom already wrote:
 
 # I'd be worried about the case of a lot of sessions touching a lot of
 # unrelated tables.  This change implies doubling the number of buffers
 # that are held pinned by any given query, and the distributed overhead
 # from that (eg, adding cycles to searches for free buffers) is what you
 # ought to be afraid of.
 
 I agree that we ought to be afraid of that.

It's a legitimate concern, but I think being afraid goes to far (more
below).

 A pgbench test isn't
 going to find a problem in this area because there you have a bunch of
 sessions all accessing the same small group of tables.  To find a
 problem of the type above, you'd need lots of backends accessing lots
 of different, small tables.  That's not the use case we usually
 benchmark, but I think there are a fair number of people doing things
 like that - for example, imagine a hosting provider or web application
 with many databases or schemas on a single instance.  AFAICS, Jeff
 hasn't tried to test this scenario.

The concern here is over a lot of different, small tables (e.g.
multi-tenancy or something similar) as you say. If we're talking about
nearly empty tables, that's easy to fix: just don't use the VM on tables
less than N pages.

You could say that small tables are really 1-10MB each, and you could
have a zillion of those. I will try to create a worst-case here and see
what numbers come out. Perhaps the extra time to look for a free buffer
does add up.

Test plan:

  1. Take current patch (without skip VM check for small tables
optimization mentioned above).
  2. Create 500 tables each about 1MB.
  3. VACUUM them all.
  4. Start 500 connections (one for each table)
  5. Time the running of a loop that executes a COUNT(*) on that
connection's table 100 times.

I think shared_buffers=64MB is probably appropriate. We want some memory
pressure so that it has to find and evict pages to satisfy the queries.
But we don't want it to be totally exhausted and unable to even pin a
new page; that really doesn't tell us much except that max_connections
is too high.

Sound reasonable?

 Now, on the flip side, we should also be thinking about what we would
 gain from this patch, and what other ways there might be to achieve
 the same goals.

One thing to keep in mind is that the current code to maintain a
crash-safe PD_ALL_VISIBLE and VM bit is quite complex and doesn't play
by the normal rules. If you want to talk about distributed costs, that
has some very real distributed costs in terms of development effort. For
instance, my checksums patch took me extra time to write (despite the
patch being the simplest checksums design on the table) and will take
others extra time to review.

So, if the only things keeping that code in place are theoretical fears,
let's take them one by one and see if they are real problems or not.

 All of which is to say that I think this patch is premature.  If we
 adopt something like this, we're likely never going to revert back to
 the way we do it now, and whatever cache-pressure or other costs this
 approach carries will be hear to stay - so we had better think awfully
 carefully before we do that.

What about this patch makes it hard to undo/rework in the future?

  Even if
 there were, this is exactly the sort of thing that should be committed
 at the beginning of a release cycle, not the end, so as to allow
 adequate time for discovery of unforeseen consequences before the code
 ends up out in the wild.

I'm concerned that such a comment at this stage will cut review early,
which could prevent also it from being committed early in 9.4.

 Of course, there's another issue here too, which is that as Pavan
 points out, the page throws crash-safety out the window

My mistake. I believe that is already fixed, and certainly not a
fundamental issue.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-16 Thread Jeff Davis
Rebased patch attached. No significant changes.

Regards,
Jeff Davis


rm-pd-all-visible-20130116.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-12-05 Thread Jeff Davis
On Fri, 2012-11-30 at 13:16 -0800, Jeff Davis wrote:
 I tried for quite a while to show any kind of performance difference
 between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24
 if you count HT).
 
 Three patches in question:
   1. Current unpatched master
   2. patch that naively always checks the VM page, pinning and unpinning
 each time
   3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in
 this patch though -- new version forthcoming)

New patch attached.

Nathan Boley kindly lent me access to a 64-core box, and that shows a
much more interesting result. The previous test (on the 12-core)
basically showed no difference between any of the patches.

Now, I see why on the 64 core box: the interesting region seems to be
around 32 concurrent connections.

The left column is the concurrency, and the right is the runtime. This
test was for concurrent scans of a 350MB table (each process did 4 scans
and quit). Test program attached.

Patch 1 (scan test):

001 004.299533
002 004.434378
004 004.708533
008 004.518470
012 004.487033
016 004.513915
024 004.765459
032 006.425780
048 007.089146
064 007.908850
072 009.461419
096 013.098646
108 015.278592
128 019.797206

Patch 2 (scan test):

001 004.385206
002 004.596340
004 004.616684
008 004.832248
012 004.858336
016 004.689959
024 005.016797
032 006.857642
048 012.049407
064 025.774772
072 032.680710
096 059.147500
108 083.654806
128 120.350200

Patch 3 (scan test):

001 004.464991
002 004.95
004 004.562364
008 004.649633
012 004.628159
016 004.518748
024 004.768348
032 004.834177
048 007.003305
064 008.242714
072 009.732261
096 013.231056
108 014.996977
128 020.488570

As you can see, patch #2 starts to show a difference at around 32 and
completely falls over by 48 connections. This is expected because it's
the naive approach that pins the VM page every time it needs it.

Patch #1 and #3 are effectively the same, subsequent runs (and with more
measurements around concurrency 32) show that the differences are just
noise (which seems to be greater around the inflection point of 32). All
of the numbers that seem to show any difference can end up with patch #1
better or patch #3 better, depending on the run.

I tried the delete test, too, but I still couldn't see any difference.
(I neglected to mention in my last email: I aborted after each delete so
that it would be repeatable). The inflection point there is
significantly lower, so I assume it must be contending over something
else. I tried making the table unlogged to see if that would change
things, but it didn't change much. This test only scales linearly to
about 8 or so. Or, there could be something wrong with my test.

So, I conclude that contention is certainly a problem for scans for
patch #2, but patch #3 seems to fix that completely by holding the
buffer pins. The deletes are somewhat inconclusive, but I just can't see
a difference.

Holding more pins does have a distributed cost in theory, as Tom points
out, but I don't know where to begin testing that. We'll have to make a
decision between (a) maintaining the extra complexity and doing the
extra page writes involved with PD_ALL_VISIBLE; or (b) holding onto one
extra pin per table being scanned. Right now, if PD_ALL_VISIBLE did not
exist, it would be pretty hard to justify putting it in as far as I can
tell.

Regards,
Jeff Davis

#include libpq-fe.h
#include stdlib.h
#include stdio.h
#include sys/time.h

void
test(char *query, int niter)
{
  PGconn	*conn;
  PGresult	*result;
  int		 i;

  conn = PQconnectdb(host=/tmp dbname=postgres);
  if (PQstatus(conn) != CONNECTION_OK)
{
  fprintf(stderr, connection failed!\n);
  exit(1);
}

  result = PQprepare(conn, , query, 0, NULL);
  if (PQresultStatus(result) != PGRES_COMMAND_OK)
{
  fprintf(stderr, PREPARE failed: %s, PQerrorMessage(conn));
  PQclear(result);
  exit(1);
}
  PQclear(result);

  for (i = 0; i  niter; i++)
{
  result = PQexecPrepared(conn, , 0, NULL, NULL, NULL, 0);
  if (PQresultStatus(result) != PGRES_TUPLES_OK)
	{
	  fprintf(stderr, EXECUTE PREPARED failed: %s\n, PQerrorMessage(conn));
	  PQclear(result);
	  exit(1);
	}
  PQclear(result);
}

  PQfinish(conn);
}

int
main(int argc, char *argv[])
{
  int	 niter;
  int	 nprocs;
  char	*query = SELECT COUNT(*) FROM foo;
  int	 i;
  pid_t *procs;
  struct timeval tv1, tv2;

  if (argc != 3)
{
  fprintf(stderr, expected 3 arguments, got %d\n, argc);
  exit(1);
}

  nprocs = atoi(argv[1]);
  niter = atoi(argv[2]);

  procs = malloc(sizeof(pid_t) * nprocs);

  gettimeofday(tv1, NULL);

  for (i = 0; i  nprocs; i++)
{
  pid_t pid = fork();
  if (pid == 0)
	{
	  test(query, niter);
	  exit(0);
	}
  else
	{
	  procs[i] = pid;
	}
}

  for (i = 0; i  nprocs; i++)
{
  int status;
  waitpid(procs[i], status, 0);
  if (!WIFEXITED(status))
	{
	  fprintf(stderr, child did not exit!\n, argc);
	  exit(1);
	

Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-30 Thread Jeff Davis
On Mon, 2012-11-26 at 16:55 -0600, Merlin Moncure wrote:
  Based on previous measurements, I think there *is* contention pinning
  the root of an index.  Currently, I believe it's largely overwhelmed
  by contention from other sources, such as the buffer manager lwlocks
  and the very-evil ProcArrayLock.  However, I believe that as we fix
  those problems, this will start to percolate up towards the top of the
  heap.
 
 Yup -- it (buffer pin contention on high traffic buffers) been caught
 in the wild -- just maintaining the pin count was enough to do it in
 at least one documented case.  Pathological workloads demonstrate
 contention today and there's no good reason to assume it's limited
 index root nodes -- i'm strongly suspicious buffer spinlock issues are
 behind some other malfeasance we've seen recently.

I tried for quite a while to show any kind of performance difference
between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24
if you count HT).

Three patches in question:
  1. Current unpatched master
  2. patch that naively always checks the VM page, pinning and unpinning
each time
  3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in
this patch though -- new version forthcoming)

I tested from 1 to 64 concurrent connections.

One test was just concurrent scans of a ~400MB table.

The other test was a DELETE FROM foo WHERE ctid BETWEEN '(N,0)' AND
'(N,256)' where N is the worker number in the test program. The table
here is only about 2 MB. The idea is that the scan will happen quickly,
leading to many quick deletes, but the deletes won't actually touch the
same pages (aside from the VM page). So, this was designed to be
uncontended except for pinning the VM page.

On the scan test, it was really hard to see any difference in the test
noise, but I may have seen about a 3-4% degradation between patch #1 and
patch #2 at higher concurrencies. It was difficult for me to reproduce
this result -- it usually wouldn't show up. I didn't see any difference
between patch #1 and patch #3.

On the delete test I detected no difference between #1 and #2 at all.

I think someone with access to a larger box may need to test this. Or,
if someone has a more specific suggestion about how I can create a worst
case, then let me know.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-26 Thread Jeff Davis
On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote:
 I'd be worried about the case of a lot of sessions touching a lot of
 unrelated tables.  This change implies doubling the number of buffers
 that are held pinned by any given query, and the distributed overhead
 from that (eg, adding cycles to searches for free buffers) is what you
 ought to be afraid of.

That's a good point. Doubling might be an exaggeration if indexes are
involved, but it's still a concern. The cost of this might be difficult
to measure though.

 Another possibly important point is that reducing the number of
 pin/unpin cycles for a given VM page might actually hurt the chances of
 it being found in shared buffers, because IIRC the usage_count is bumped
 once per pin/unpin.  That algorithm is based on the assumption that
 buffer pins are not drastically different in lifespan, but I think you
 just broke that for pins on VM pages.

If doing a bunch of simple key lookups using an index, then the root of
the index page is only pinned once per query, but we expect that to stay
in shared buffers. I see the VM page as about the same: one pin per
query (or maybe a couple for large tables).

I don't see how the lifetime of the pin matters a whole lot in this
case; it's more about the rate of pins/unpins, right?

 I'm not particularly concerned about devising solutions for these
 problems, though, because I think this idea is a loser from the get-go;
 the increase in contention for VM pages is alone going to destroy any
 possible benefit.

Your intuition here is better than mine, but I am still missing
something here. If we keep the buffer pinned, then there will be very
few pin/unpin cycles here, so I don't see where the contention would
come from (any more than there is contention pinning the root of an
index).

Do you still think I need a shared lock here or something? If so, then
index-only scans have a pretty big problem right now, too.

I'll try to quantify some of these effects you've mentioned, and see how
the numbers turn out. I'm worried that I'll need more than 4 cores to
show anything though, so perhaps someone with a many-core box would be
interested to test this out?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-26 Thread Robert Haas
On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis pg...@j-davis.com wrote:
 Your intuition here is better than mine, but I am still missing
 something here. If we keep the buffer pinned, then there will be very
 few pin/unpin cycles here, so I don't see where the contention would
 come from (any more than there is contention pinning the root of an
 index).

Based on previous measurements, I think there *is* contention pinning
the root of an index.  Currently, I believe it's largely overwhelmed
by contention from other sources, such as the buffer manager lwlocks
and the very-evil ProcArrayLock.  However, I believe that as we fix
those problems, this will start to percolate up towards the top of the
heap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-26 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote:
 Another possibly important point is that reducing the number of
 pin/unpin cycles for a given VM page might actually hurt the chances of
 it being found in shared buffers, because IIRC the usage_count is bumped
 once per pin/unpin.

 If doing a bunch of simple key lookups using an index, then the root of
 the index page is only pinned once per query, but we expect that to stay
 in shared buffers. I see the VM page as about the same: one pin per
 query (or maybe a couple for large tables).

Hmmm ... that seems like a valid analogy.  I may be worried about
nothing as far as this point goes.

 Do you still think I need a shared lock here or something? If so, then
 index-only scans have a pretty big problem right now, too.

There's still the issue of whether the IOS code is safe in machines with
weak memory ordering.  I think that it probably is safe, but the
argument for it in the current code comment is wrong; most likely, a
correct argument has to depend on read/write barriers associated with
taking snapshots.  I think what you ought to do is work through that,
fix the existing comment, and then see whether the same argument works
for what you want to do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-26 Thread Jeff Davis
On Mon, 2012-11-26 at 16:10 -0500, Tom Lane wrote:
 There's still the issue of whether the IOS code is safe in machines with
 weak memory ordering.  I think that it probably is safe, but the
 argument for it in the current code comment is wrong; most likely, a
 correct argument has to depend on read/write barriers associated with
 taking snapshots.  I think what you ought to do is work through that,
 fix the existing comment, and then see whether the same argument works
 for what you want to do.

As a part of the patch, I did change the comment, and here's what I came
up with:

  * Note on Memory Ordering Effects: visibilitymap_test does not lock
  * the visibility map buffer, and therefore the result we read here
  * could be slightly stale.  However, it can't be stale enough to
  * matter.
  *
  * We need to detect clearing a VM bit due to an insert right away,
  * because the tuple is present in the index page but not visible. The
  * reading of the TID by this scan (using a shared lock on the index
  * buffer) is serialized with the insert of the TID into the index
  * (using an exclusive lock on the index buffer). Because the VM bit is
  * cleared before updating the index, and locking/unlocking of the
  * index page acts as a full memory barrier, we are sure to see the
  * cleared bit if we see a recently-inserted TID.
  *
  * Deletes do not update the index page (only VACUUM will clear out the
  * TID), so the clearing of the VM bit by a delete is not serialized
  * with this test below, and we may see a value that is significantly
  * stale. However, we don't care about the delete right away, because
  * the tuple is still visible until the deleting transaction commits or
  * the statement ends (if it's our transaction). In either case, the
  * lock on the VM buffer will have been released (acting as a write
  * barrier) after clearing the bit. And for us to have a snapshot that
  * includes the deleting transaction (making the tuple invisible), we
  * must have acquired ProcArrayLock after that time, acting as a read
  * barrier.
  *
  * It's worth going through this complexity to avoid needing to lock
  * the VM buffer, which could cause significant contention.

And I updated the comment in visibilitymap.c as well (reformatted for
this email):

To test a bit in the visibility map, most callers should have a pin on
the VM buffer, and at least a shared lock on the data buffer. Any
process that clears the VM bit must have an exclusive lock on the data
buffer, so that will serialize access to the appropriate bit. Because
lock acquisition and release are full memory barriers, then there is no
danger of seeing the state of the bit before it was last cleared.
Callers that don't have the data buffer yet, such as an index only scan
or a VACUUM that is skipping pages, must handle the concurrency as
appropriate.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-26 Thread Merlin Moncure
On Mon, Nov 26, 2012 at 3:03 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis pg...@j-davis.com wrote:
 Your intuition here is better than mine, but I am still missing
 something here. If we keep the buffer pinned, then there will be very
 few pin/unpin cycles here, so  I don't see where the contention would
 come from (any more than there is contention pinning the root of an
 index).

 Based on previous measurements, I think there *is* contention pinning
 the root of an index.  Currently, I believe it's largely overwhelmed
 by contention from other sources, such as the buffer manager lwlocks
 and the very-evil ProcArrayLock.  However, I believe that as we fix
 those problems, this will start to percolate up towards the top of the
 heap.

Yup -- it (buffer pin contention on high traffic buffers) been caught
in the wild -- just maintaining the pin count was enough to do it in
at least one documented case.  Pathological workloads demonstrate
contention today and there's no good reason to assume it's limited
index root nodes -- i'm strongly suspicious buffer spinlock issues are
behind some other malfeasance we've seen recently.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-25 Thread Jeff Davis
On Wed, 2012-11-21 at 18:25 -0800, Jeff Davis wrote:
 Follow up to discussion:
 http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php
 
 I worked out a patch that replaces PD_ALL_VISIBLE with calls to
 visibilitymap_test. It rips out a lot of complexity, with a net drop of
 about 300 lines (not a lot, but some of that code is pretty complex).

Updated patch attached.

Now it tries to keep the VM buffer pinned during scans, inserts,
updates, and deletes. This should avoid increased contention pinning the
VM pages, but performance tests are required.

For updates, it currently only tries to hold a pin on the VM buffer for
the page of the original tuple. For HOT updates, that's always the same
as the new buffer anyway. For cold updates, we could also try to keep a
pin on the buffer for the new tuple, but right now I don't see an
obvious need for that complexity. It may plausibly be a problem when
doing a bulk update on a freshly-loaded table.

It occurred to me that it might be difficult to test this patch without
a fairly large test case. A big assumption of my patch is that there
will be locality of access (and the VM page you already have a pin on is
likely to be needed the next time), which is obvious during a scan but
not so obvious during I/U/D. But a single 8K VM page represents some 60K
pages, or about 500MB of data. So anything less than that means that
there is only one VM page, and locality is trivial... it seems like any
test on a table less than 5GB would not be fair.

Then again, if a 5GB table is being randomly accessed, an extra pin is
unlikely to matter. Also, without locality, the contention would not be
nearly as bad either. I'm still pretty unclear what the worst case for
this patch is supposed to look like.

Regards,
Jeff Davis



rm-pd-all-visible-20121125.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2012-11-25 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 Now it tries to keep the VM buffer pinned during scans, inserts,
 updates, and deletes. This should avoid increased contention pinning the
 VM pages, but performance tests are required.
 ...
 Then again, if a 5GB table is being randomly accessed, an extra pin is
 unlikely to matter. Also, without locality, the contention would not be
 nearly as bad either. I'm still pretty unclear what the worst case for
 this patch is supposed to look like.

I'd be worried about the case of a lot of sessions touching a lot of
unrelated tables.  This change implies doubling the number of buffers
that are held pinned by any given query, and the distributed overhead
from that (eg, adding cycles to searches for free buffers) is what you
ought to be afraid of.

Another possibly important point is that reducing the number of
pin/unpin cycles for a given VM page might actually hurt the chances of
it being found in shared buffers, because IIRC the usage_count is bumped
once per pin/unpin.  That algorithm is based on the assumption that
buffer pins are not drastically different in lifespan, but I think you
just broke that for pins on VM pages.

I'm not particularly concerned about devising solutions for these
problems, though, because I think this idea is a loser from the get-go;
the increase in contention for VM pages is alone going to destroy any
possible benefit.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Removing PD_ALL_VISIBLE

2012-11-21 Thread Jeff Davis
Follow up to discussion:
http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php

I worked out a patch that replaces PD_ALL_VISIBLE with calls to
visibilitymap_test. It rips out a lot of complexity, with a net drop of
about 300 lines (not a lot, but some of that code is pretty complex).

The patch is quite rough, and I haven't yet added the code to keep the
VM buffer pins around longer, but I still don't see any major problems.
I'm fairly certain that scans will not be adversely affected, and I
think that inserts/updates/deletes should be fine as well.

The main worry I have with inserts/updates/deletes is that there will be
less spatial locality for allocating new buffers or for modifying
existing buffers. So keeping a pinned VM buffer might not help as much,
because it might need to pin a different one, anyway.

Adding to January commitfest, as it's past 11/15.

Regards,
Jeff Davis


rm-pd-all-visible-20121121.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers