Re: [HACKERS] hint bit cache v6

2011-07-05 Thread Merlin Moncure
On Thu, Jun 30, 2011 at 3:42 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure mmonc...@gmail.com wrote:
 I think the basic problem is that the cache pages are so large.  It's
 hard to make them smaller because that increases the cost of accessing
 the cache, as you already noted, but at the same time, making an
 eviction decision on a cache that holds 64K entries based on 100 data
 points seems like waving around a pretty large-caliber weapon in a
 fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
 it's hard to avoid the niggling fear that someone might accidentally
 get shot.

 Right...it's essentially a rolling statistical sampling of transaction
 IDs based on past acceses.  Going from say, 100 to 1000 will probably
 drop your errror margin quite a bit but I really wonder how benefit
 you can get from going past that.

 An interesting idea might be to forcibly cause a replacement every 100
 tuples (or every 1000 tuples) and see if that causes a noticeable
 performance regression.  If it doesn't, that's a good data point.

 To test this I disabled the cache check on top of
 HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
 bogus xid (so every tuple scanned was treated as a 'miss'.  Scanning 1
 million records in memory over and over went up a few percentage
 points -- converging on about 280ms instead of 270ms.   This is with
 bumped to 1000 miss array:

 oprofile said:
 regular hinted tuple case:
 120      10.2041  advance_aggregates
 107       9.0986  heapgettup_pagemode
 77        6.5476  ExecProject
 74        6.2925  heapgetpage
 72        6.1224  ExecProcNode
 72        6.1224  ExecScan
 69        5.8673  advance_transition_function
 66        5.6122  heap_getnext
 58        4.9320  HeapTupleSatisfiesMVCC

 hinted tuple with pathological cache entry:
 111       8.9588  advance_aggregates
 109       8.7974  heapgettup_pagemode
 85        6.8604  ExecProject
 81        6.5375  heapgetpage
 77        6.2147  ExecScan
 70        5.6497  ExecStoreTuple
 70        5.6497  HeapTupleSatisfiesMVCC

 this was a fairly short profiling run but the numbers are fairly
 consistent.  the replacement is fairly cheap...sorting 1000 ints
 doesn't take all that long. 100 is even faster.

 I think the sour spot for this whole idea is likely to be the case
 where you have a workload that is 90% read and 10% write, or something
 like that.  On an all-read workload (like pgbench -S), you're
 hopefully going to converge to a state where all the hint-bits are
 set, once VACUUM kicks in.  And on an all-write workload I think that
 you might lose the effect in the noise.  Not sure if we have an easy
 way to set up such a workload with pgbench, though.

 it's trivial to test with pgbench -- just use the random number
 facility to generate an id for some table and update where random() 
 .9.   However this does not generate nearly enough 'misses' to
 exercise cache replacement in any meaningful way.  My workstation vm
 can punch out about 5k transactions/sec.  With only 10% getting a new
 xid and writing back a tuple to the table only 500 xids are getting
 generated/second.  At that rate it takes quite a while to burn through
 the 256k transactions the cache ranges over.  Also this case is not
 bound by scan performance anyways making profiling it a little silly
 -- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
 workloads.  Scan heavy loads is where this matters, and scan heavy
 loads naturally tend to generate less xids per tuple scanned.

 Just for the sake of argument, let's suppose we made an array with 64K
 elements, each one representing one 64K chunk of the XID space.  Each
 element is a 4-byte unsigned integer, so the array takes up 256kB of
 memory... probably not good, but I'm just thinking out loud here.
 Every time we're asked about an XID, we bump the corresponding
 counter.  Periodically (say, every 64K probes) we zip through all the
 counters and consider performing a cache replacement.  We look at the
 not-already-cached page that has the highest counter value and compare
 it to the counter value for the cached pages.  If it is higher and the
 difference exceeds some safety margin (to protect against hysteresis),
 then we do the replacement.  I think something like this would allow
 you to have a lot more information available to make a direct
 apples-to-apples comparison at cache replacement time, as compared
 with what you have now.

 yeah -- what you've done here is basically two things: 1. by mapping
 static ranges you get to skip the sort (but not the scan) during the
 replacement, and 2. by vastly expanding the sampling size you
 eliminate thrashing via noise in the rolling sample.  This comes at a
 significant memory cost and loss of some nimbleness in the cache (i'm
 not completely sure more aggressive replacement is 'bad') 

Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 12:31 AM, Jim Nasby j...@nasby.net wrote:
 Would it be reasonable to keep a second level cache that store individual 
 XIDs instead of blocks? That would provide protection for XIDs that are 
 extremely common but don't have a good fit with the pattern of XID ranges 
 that we're caching. I would expect this to happen if you had a transaction 
 that touched a bunch of data (ie: bulk load or update) some time ago (so the 
 other XIDs around it are less likely to be interesting) but not old enough to 
 have been frozen yet. Obviously you couldn't keep too many XIDs in this 
 secondary cache, but if you're just trying to prevent certain pathological 
 cases then hopefully you wouldn't need to keep that many.

Maybe, but I think that's probably still papering around the problem.
I'd really like to find an algorithm that bounds how often we can
flush a page out of the cache to some number of tuples significantly
greater than 100.  The one I suggested yesterday has that property,
for example, although it may have other problems I'm not thinking of.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Merlin Moncure
On Wed, Jun 29, 2011 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:
 With the algorithm as coded, to fully flush the cache you'd have to
 find a series of *unhinted* tuples distributed across minimum of four
 64k wide page ranges, with the number of tuples being over the 5%
 noise floor.  Furthermore, to eject a cache page you have to have more
 hits than a page already in the cache got -- hits are tracked on the
 existing pages and the candidate pages are effectively with the
 existing pages based on hits with the best four picked.  Also, is it
 reasonable to expect the cache to help in these types of cases
 anyways?

 *scratches head*

 Well, surely you must need to reset the counters for the pages you've
 chosen to include in the cache at some point.  If you don't, then
 there's a strong likelihood that after some period of time the pages
 you have in there will become so firmly nailed into the cache that
 they can never be evicted.

yup -- hit counts are reset on replacement.

 Yeah, I am inclined to think that going very far in that direction is
 going to be a big pile of fail.  TBH, I'm somewhat surprised that you
 managed to get what you already have to work without a performance
 regression.  That code path is extremely hot, and injecting more
 complexity seems like it ought to be a loser.  For purposes of this
 review, I'm taking it as given that you've tested this carefully, but
 I'll admit to lingering skepticism.

You are absolutely right to be skeptical.  Here's the rub -- it's
incredibly difficult to contrive a set of circumstances where the
cache is aggressively replaced, and even if you can somehow coerce
that to happen, the underlying data is permanently altered so that it
can't happen again.  Cheating by mucking around with the xid or
altering the visibility routines to force replacement isn't really
fair.  AFAICT, unhinted tuples tend to be both recent and grouped into
a fairly narrow transaction range basically always.  Sooner or later,
at least for tables that matter, a vacuum is going to roll around and
smear the bits back into the table.  My point is that even for the
worst case I could think of -- pgbench continually jamming records
into the table, replacements tend to be quite rare.  Let's assume
though I'm wrong and the pathological case is likely to happen.

 I think the basic problem is that the cache pages are so large.  It's
 hard to make them smaller because that increases the cost of accessing
 the cache, as you already noted, but at the same time, making an
 eviction decision on a cache that holds 64K entries based on 100 data
 points seems like waving around a pretty large-caliber weapon in a
 fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
 it's hard to avoid the niggling fear that someone might accidentally
 get shot.

Right...it's essentially a rolling statistical sampling of transaction
IDs based on past acceses.  Going from say, 100 to 1000 will probably
drop your errror margin quite a bit but I really wonder how benefit
you can get from going past that.

 Just for the sake of argument, let's suppose we made an array with 64K
 elements, each one representing one 64K chunk of the XID space.  Each
 element is a 4-byte unsigned integer, so the array takes up 256kB of
 memory... probably not good, but I'm just thinking out loud here.
 Every time we're asked about an XID, we bump the corresponding
 counter.  Periodically (say, every 64K probes) we zip through all the
 counters and consider performing a cache replacement.  We look at the
 not-already-cached page that has the highest counter value and compare
 it to the counter value for the cached pages.  If it is higher and the
 difference exceeds some safety margin (to protect against hysteresis),
 then we do the replacement.  I think something like this would allow
 you to have a lot more information available to make a direct
 apples-to-apples comparison at cache replacement time, as compared
 with what you have now.

yeah -- what you've done here is basically two things: 1. by mapping
static ranges you get to skip the sort (but not the scan) during the
replacement, and 2. by vastly expanding the sampling size you
eliminate thrashing via noise in the rolling sample.  This comes at a
significant memory cost and loss of some nimbleness in the cache (i'm
not completely sure more aggressive replacement is 'bad')  -- although
that could be mitigated by replacing at more frequent intervals.

 Now, the big problem here is that a 256kB array is probably too big,
 but because we don't want to blow out the lower-level CPU caches and
 also because every time we consider performing a cache replacement
 we're going to want to zero the whole thing, and that has to be fast.
 But we probably don't really need the array to cover the entire XID
 space.  vacuum_freeze_min_age defaults to 50 million transactions, and
 vacuum_freeze_table_age defaults to 150 million transactions.  If we
 only concern ourselves 

Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure mmonc...@gmail.com wrote:
 I think the basic problem is that the cache pages are so large.  It's
 hard to make them smaller because that increases the cost of accessing
 the cache, as you already noted, but at the same time, making an
 eviction decision on a cache that holds 64K entries based on 100 data
 points seems like waving around a pretty large-caliber weapon in a
 fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
 it's hard to avoid the niggling fear that someone might accidentally
 get shot.

 Right...it's essentially a rolling statistical sampling of transaction
 IDs based on past acceses.  Going from say, 100 to 1000 will probably
 drop your errror margin quite a bit but I really wonder how benefit
 you can get from going past that.

An interesting idea might be to forcibly cause a replacement every 100
tuples (or every 1000 tuples) and see if that causes a noticeable
performance regression.  If it doesn't, that's a good data point.

I think the sour spot for this whole idea is likely to be the case
where you have a workload that is 90% read and 10% write, or something
like that.  On an all-read workload (like pgbench -S), you're
hopefully going to converge to a state where all the hint-bits are
set, once VACUUM kicks in.  And on an all-write workload I think that
you might lose the effect in the noise.  Not sure if we have an easy
way to set up such a workload with pgbench, though.

 Just for the sake of argument, let's suppose we made an array with 64K
 elements, each one representing one 64K chunk of the XID space.  Each
 element is a 4-byte unsigned integer, so the array takes up 256kB of
 memory... probably not good, but I'm just thinking out loud here.
 Every time we're asked about an XID, we bump the corresponding
 counter.  Periodically (say, every 64K probes) we zip through all the
 counters and consider performing a cache replacement.  We look at the
 not-already-cached page that has the highest counter value and compare
 it to the counter value for the cached pages.  If it is higher and the
 difference exceeds some safety margin (to protect against hysteresis),
 then we do the replacement.  I think something like this would allow
 you to have a lot more information available to make a direct
 apples-to-apples comparison at cache replacement time, as compared
 with what you have now.

 yeah -- what you've done here is basically two things: 1. by mapping
 static ranges you get to skip the sort (but not the scan) during the
 replacement, and 2. by vastly expanding the sampling size you
 eliminate thrashing via noise in the rolling sample.  This comes at a
 significant memory cost and loss of some nimbleness in the cache (i'm
 not completely sure more aggressive replacement is 'bad')  -- although
 that could be mitigated by replacing at more frequent intervals.

Well, that gets at an interesting question: how often SHOULD we
replace cache pages?  My gut is that it's 10K-100K and your gut is
that it's 100-1000, which is a hundred-fold difference.  Seems like we
need some kind of emprical data to decide what actually works well in
real life.

 Your range counting strategy might work -- I think to do it right so
 that you don't have to map the entire range is going to require two
 levels of ranges such that you activate a very wide range first before
 looking at 64k subranges. That way you could reduce memory consumption
 significantly without sorting during the replacement.

I don't think you need two levels of ranges.  Just keep track of the
minimum and maximum XID covered by the array (these must always be 64M
transactions apart, say, if the array has 1024 entries, and each cache
page covers 64K XIDs).  When you see a given XID, and it's between the
minimum and the maximum, then bump the counter in bucket XID/64K.  If
you see an XID that is older than the minimum, then ignore it; we
won't consider devoting cache pages to really old XIDs.  If you see an
XID that is newer than the maximum, then just increase the minimum and
maximum by enough to put the new XID in range.  For every 64K
increment by which you advance the minimum and maximum, you need to
zero one entry in the array (since it's no longer covering the same
portion of the XID space) but all the other entries remain intact.

(There is a small problem here of how to work out how to initially set
the minimum and maximum to something sensible, and to make sure that
they don't get out of whack if a backend sits idle for a few billion
transactions, but that seems like it should be solvable.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Merlin Moncure
On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure mmonc...@gmail.com wrote:
 I think the basic problem is that the cache pages are so large.  It's
 hard to make them smaller because that increases the cost of accessing
 the cache, as you already noted, but at the same time, making an
 eviction decision on a cache that holds 64K entries based on 100 data
 points seems like waving around a pretty large-caliber weapon in a
 fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
 it's hard to avoid the niggling fear that someone might accidentally
 get shot.

 Right...it's essentially a rolling statistical sampling of transaction
 IDs based on past acceses.  Going from say, 100 to 1000 will probably
 drop your errror margin quite a bit but I really wonder how benefit
 you can get from going past that.

 An interesting idea might be to forcibly cause a replacement every 100
 tuples (or every 1000 tuples) and see if that causes a noticeable
 performance regression.  If it doesn't, that's a good data point.

To test this I disabled the cache check on top of
HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
bogus xid (so every tuple scanned was treated as a 'miss'.  Scanning 1
million records in memory over and over went up a few percentage
points -- converging on about 280ms instead of 270ms.   This is with
bumped to 1000 miss array:

oprofile said:
regular hinted tuple case:
120  10.2041  advance_aggregates
107   9.0986  heapgettup_pagemode
776.5476  ExecProject
746.2925  heapgetpage
726.1224  ExecProcNode
726.1224  ExecScan
695.8673  advance_transition_function
665.6122  heap_getnext
584.9320  HeapTupleSatisfiesMVCC

hinted tuple with pathological cache entry:
111   8.9588  advance_aggregates
109   8.7974  heapgettup_pagemode
856.8604  ExecProject
816.5375  heapgetpage
776.2147  ExecScan
705.6497  ExecStoreTuple
705.6497  HeapTupleSatisfiesMVCC

this was a fairly short profiling run but the numbers are fairly
consistent.  the replacement is fairly cheap...sorting 1000 ints
doesn't take all that long. 100 is even faster.

 I think the sour spot for this whole idea is likely to be the case
 where you have a workload that is 90% read and 10% write, or something
 like that.  On an all-read workload (like pgbench -S), you're
 hopefully going to converge to a state where all the hint-bits are
 set, once VACUUM kicks in.  And on an all-write workload I think that
 you might lose the effect in the noise.  Not sure if we have an easy
 way to set up such a workload with pgbench, though.

it's trivial to test with pgbench -- just use the random number
facility to generate an id for some table and update where random() 
.9.   However this does not generate nearly enough 'misses' to
exercise cache replacement in any meaningful way.  My workstation vm
can punch out about 5k transactions/sec.  With only 10% getting a new
xid and writing back a tuple to the table only 500 xids are getting
generated/second.  At that rate it takes quite a while to burn through
the 256k transactions the cache ranges over.  Also this case is not
bound by scan performance anyways making profiling it a little silly
-- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
workloads.  Scan heavy loads is where this matters, and scan heavy
loads naturally tend to generate less xids per tuple scanned.

 Just for the sake of argument, let's suppose we made an array with 64K
 elements, each one representing one 64K chunk of the XID space.  Each
 element is a 4-byte unsigned integer, so the array takes up 256kB of
 memory... probably not good, but I'm just thinking out loud here.
 Every time we're asked about an XID, we bump the corresponding
 counter.  Periodically (say, every 64K probes) we zip through all the
 counters and consider performing a cache replacement.  We look at the
 not-already-cached page that has the highest counter value and compare
 it to the counter value for the cached pages.  If it is higher and the
 difference exceeds some safety margin (to protect against hysteresis),
 then we do the replacement.  I think something like this would allow
 you to have a lot more information available to make a direct
 apples-to-apples comparison at cache replacement time, as compared
 with what you have now.

 yeah -- what you've done here is basically two things: 1. by mapping
 static ranges you get to skip the sort (but not the scan) during the
 replacement, and 2. by vastly expanding the sampling size you
 eliminate thrashing via noise in the rolling sample.  This comes at a
 significant memory cost and loss of some nimbleness in the cache (i'm
 not completely sure more aggressive replacement is 'bad')  -- although
 that could be mitigated by replacing at more frequent intervals.

 Well, that gets at an interesting 

Re: [HACKERS] hint bit cache v6

2011-06-29 Thread Robert Haas
On Fri, May 13, 2011 at 3:48 PM, Merlin Moncure mmonc...@gmail.com wrote:
 what's changed:
 *) as advertised, i'm no longer bothering to cache invalid bits.  hint
 bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
 will work as they have always done.
 *) all the tuple visibility routines are now participating in the cache
 *) xmax commit bits are now being cached, not just xmin.  this
 prevents hint bit i/o following deletes.
 *) juggled the cache interaction a bit so the changes to the
 visibility routines could be a lot more surgical. specifically, I
 reverted SetHintBits() to return void and put the cache adjustment
 inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
 the cache.  SetHintBitsNonDirty() only sets the bits without dirtying
 the page, and is called when you get a cache hit.
 *) various minor cleanups, decided to keep pageindex as type
 TransactionId since that's what clog does.
 *) made a proper init function, put cache data into
 CacheMemoryContext, and moved the cache data pointer references into
 the page structure

 performance testing done:
 *) select only pgbench and standard i/o pgbech tests don't show
 performance difference within statistical noise.

 The test I need to do, a cache and clog thrashing test, I haven't
 found a clean way to put together yet.  I'm really skeptical that any
 problems will show up there. That said, I am satisfied the patch does
 what it is supposed to do -- eliminates hint bit i/o without for cases
 where it is a big headache without penalizing other cases.  Anyone
 that would like to comment on the technique or other points of the
 patch please do so (otherwise it's time for the 'fest).

OK, so I read through this patch.  I agree with Simon's comment on the
other thread that it's probably not ready for commit right at the
moment, but also want to offer a few other thoughts.

The patch fails to conform to our coding style in a variety of ways,
but I don't want to get bogged down in that at this point.  The first
question we need to answer here is: What is this patch doing?  And do
we want that?

With regards to the first question, it seems to me that the patch is
doing two separate but related things.

1. It allocates four 8kB pages in backend-local memory, each of which
can store one bit for each XID in a 64K range.  The bit is set if the
XID is known to be committed.  If we find this bit set, then we
needn't consult CLOG.  This reduces CLOG traffic, at the cost of
potentially doing more work in the case where the cache misses.  Every
so often, we reconsider which XID ranges should be represented in our
cache and consider replacing an existing page in the cache with some
other one.

2. When we probe the cache and hit, we set the hint bit on the tuple
*without dirtying the page*.  Thus, the hint bit will only get written
back to disk if the page is dirtied for some other reason.  This will
frequently reduce the amount of write I/O generated by large table
scans, at the cost of possibly generating additional work for other
backends who try to read this data later.

Now, the interesting thing about the patch is that the downsides of #2
are considerably mitigated by #1.  If we assume for the sake of
argument that the backend-local cache is really, really fast, and that
the cache replacement policy is sound, then one is just as good as the
other, and the only time we need the hint bits is when the cache
overflows, which just so happens to be exactly the patch forces them
to be written out to disk.  That's pretty clever.  The exact way this
is implemented is a little bit grotty, but I feel like it can work.
So I'm inclined to think that, at 10,000 feet, if we can convince
ourselves that the basic idea of sticking a cache in here is OK, then
the additional refinement of declining to dirty the page when the
cache, rather than CLOG, tell us that the XID is committed is probably
OK too.

With respect to the cache itself, I think the part that concerns me
most is the cache replacement algorithm.  It's obvious that straight
LRU can't work here, since that could easily result in horrible
thrashing behavior.  But it's not clear to me that the actual
algorithm, which considers evicting a page after every 100 misses, is
all that great either.  Each page stores 64K transaction IDs, and
after being in cache for a while, you might have 25% or 50% of those
bits set, which is quite an asset.  Now you hit a run of 100 tuples
with some XID not represented in that cache and you chuck that page
out the window and replace it with a page that has just that one bit
set.  Now you hit another run of 100 tuples with XIDs that would have
hit the old page and flip it back; but now you've lost all those
valuable bits you had set.  I'm not sure how likely that is to happen
in real life, but it certainly seems like it could be a bit painful if
it did.  The cache replacement algorithm is not cheap.

Rather than just fiddling with the thresholds, it 

Re: [HACKERS] hint bit cache v6

2011-06-29 Thread Merlin Moncure
On Wed, Jun 29, 2011 at 11:11 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, May 13, 2011 at 3:48 PM, Merlin Moncure mmonc...@gmail.com wrote:
 what's changed:
 *) as advertised, i'm no longer bothering to cache invalid bits.  hint
 bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
 will work as they have always done.
 *) all the tuple visibility routines are now participating in the cache
 *) xmax commit bits are now being cached, not just xmin.  this
 prevents hint bit i/o following deletes.
 *) juggled the cache interaction a bit so the changes to the
 visibility routines could be a lot more surgical. specifically, I
 reverted SetHintBits() to return void and put the cache adjustment
 inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
 the cache.  SetHintBitsNonDirty() only sets the bits without dirtying
 the page, and is called when you get a cache hit.
 *) various minor cleanups, decided to keep pageindex as type
 TransactionId since that's what clog does.
 *) made a proper init function, put cache data into
 CacheMemoryContext, and moved the cache data pointer references into
 the page structure

 performance testing done:
 *) select only pgbench and standard i/o pgbech tests don't show
 performance difference within statistical noise.

 The test I need to do, a cache and clog thrashing test, I haven't
 found a clean way to put together yet.  I'm really skeptical that any
 problems will show up there. That said, I am satisfied the patch does
 what it is supposed to do -- eliminates hint bit i/o without for cases
 where it is a big headache without penalizing other cases.  Anyone
 that would like to comment on the technique or other points of the
 patch please do so (otherwise it's time for the 'fest).

 OK, so I read through this patch.  I agree with Simon's comment on the
 other thread that it's probably not ready for commit right at the
 moment, but also want to offer a few other thoughts.

 The patch fails to conform to our coding style in a variety of ways,
 but I don't want to get bogged down in that at this point.  The first
 question we need to answer here is: What is this patch doing?  And do
 we want that?

Well, thanks for the review!  Considering feedback I'll pull it from
the current 'fest and see if it can be brought up to muster.

 With regards to the first question, it seems to me that the patch is
 doing two separate but related things.

 1. It allocates four 8kB pages in backend-local memory, each of which
 can store one bit for each XID in a 64K range.  The bit is set if the
 XID is known to be committed.  If we find this bit set, then we
 needn't consult CLOG.  This reduces CLOG traffic, at the cost of
 potentially doing more work in the case where the cache misses.  Every
 so often, we reconsider which XID ranges should be represented in our
 cache and consider replacing an existing page in the cache with some
 other one.

 2. When we probe the cache and hit, we set the hint bit on the tuple
 *without dirtying the page*.  Thus, the hint bit will only get written
 back to disk if the page is dirtied for some other reason.  This will
 frequently reduce the amount of write I/O generated by large table
 scans, at the cost of possibly generating additional work for other
 backends who try to read this data later.

Right -- exactly.  If you work through all the cases the main price to
pay is the overhead of the cache itself.

 Now, the interesting thing about the patch is that the downsides of #2
 are considerably mitigated by #1.  If we assume for the sake of
 argument that the backend-local cache is really, really fast, and that
 the cache replacement policy is sound, then one is just as good as the
 other, and the only time we need the hint bits is when the cache
 overflows, which just so happens to be exactly the patch forces them
 to be written out to disk.  That's pretty clever.  The exact way this
 is implemented is a little bit grotty, but I feel like it can work.
 So I'm inclined to think that, at 10,000 feet, if we can convince
 ourselves that the basic idea of sticking a cache in here is OK, then
 the additional refinement of declining to dirty the page when the
 cache, rather than CLOG, tell us that the XID is committed is probably
 OK too.

 With respect to the cache itself, I think the part that concerns me
 most is the cache replacement algorithm.  It's obvious that straight
 LRU can't work here, since that could easily result in horrible
 thrashing behavior.  But it's not clear to me that the actual
 algorithm, which considers evicting a page after every 100 misses, is
 all that great either.  Each page stores 64K transaction IDs, and
 after being in cache for a while, you might have 25% or 50% of those
 bits set, which is quite an asset.  Now you hit a run of 100 tuples
 with some XID not represented in that cache and you chuck that page
 out the window and replace it with a page that has just that one bit
 set.  Now you 

Re: [HACKERS] hint bit cache v6

2011-06-29 Thread Robert Haas
On Wed, Jun 29, 2011 at 3:16 PM, Merlin Moncure mmonc...@gmail.com wrote:
 I think it's a fair point to ask how often thrashing cases will truly
 come up where you don't have some other significant cost like i/o.
 Even when you do thrash, you are just falling back on stock postgres
 behaviors minus the maintenance costs.

Agree.

 With the algorithm as coded, to fully flush the cache you'd have to
 find a series of *unhinted* tuples distributed across minimum of four
 64k wide page ranges, with the number of tuples being over the 5%
 noise floor.  Furthermore, to eject a cache page you have to have more
 hits than a page already in the cache got -- hits are tracked on the
 existing pages and the candidate pages are effectively with the
 existing pages based on hits with the best four picked.  Also, is it
 reasonable to expect the cache to help in these types of cases
 anyways?

*scratches head*

Well, surely you must need to reset the counters for the pages you've
chosen to include in the cache at some point.  If you don't, then
there's a strong likelihood that after some period of time the pages
you have in there will become so firmly nailed into the cache that
they can never be evicted.

To be clear, I don't really think it matters how sensitive the cache
is to a *complete* flush.  The question I want to ask is: how much
does it take to knock ONE page out of cache?  And what are the chances
of that happening too frequently?  It seems to me that if a run of 100
tuples with the same previously-unseen XID is enough to knock over the
applecart, then that's not a real high bar - you could easily hit that
limit on a single page.  And if that isn't enough, then I don't
understand the algorithm.

 If your concern is the cost of replacement, then you can
 micro-optimize (a hand rolled qsort alone should squeeze out quite a
 bit of fat) or experiment with a new algorithm as you suggested.
 However i should note that at least for pgbench fsync=off oltp tests
 (the only way I could think of to exercise cache replacement), it
 (replacement costs) did not show up on the radar.

It might be useful, just for debugging purposes, to add an elog(LOG)
in there that triggers whenever a cache flush occurs.  And then play
with different workloads and try to make it happen as often as you
can.  One idea I had was to hack the XID generator so that it only
returns every (say) 200th XID, instead of consecutive ones.  That
might make it easier to identify the sorts of workloads that are prone
to causing problems, and then we could further analyze those workloads
and see whether they are a problem in real life, and/or what can be
done to minimize the impact.

 OTOH, If your main concern is about losing data out of the cache via
 page swap, increasing the number of cache pages (or use smaller ones)
 might work but this incrementally makes the testing the cache more
 expensive.

Yeah, I am inclined to think that going very far in that direction is
going to be a big pile of fail.  TBH, I'm somewhat surprised that you
managed to get what you already have to work without a performance
regression.  That code path is extremely hot, and injecting more
complexity seems like it ought to be a loser.  For purposes of this
review, I'm taking it as given that you've tested this carefully, but
I'll admit to lingering skepticism.

 I did briefly experiment with trying to bootstrap incoming cache pages
 with data gathered out of the clog -- but it didn't help much and was
 a lot more trouble than worth.

I believe it.

 One possible enhancement would be to try and bias pages with more data
 in them so that it's more difficult to get them out -- or even
 maintain separate pools with ejection priority.  I'm not in a hurry
 and suggestions are welcome.

I think the basic problem is that the cache pages are so large.  It's
hard to make them smaller because that increases the cost of accessing
the cache, as you already noted, but at the same time, making an
eviction decision on a cache that holds 64K entries based on 100 data
points seems like waving around a pretty large-caliber weapon in a
fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
it's hard to avoid the niggling fear that someone might accidentally
get shot.

Just for the sake of argument, let's suppose we made an array with 64K
elements, each one representing one 64K chunk of the XID space.  Each
element is a 4-byte unsigned integer, so the array takes up 256kB of
memory... probably not good, but I'm just thinking out loud here.
Every time we're asked about an XID, we bump the corresponding
counter.  Periodically (say, every 64K probes) we zip through all the
counters and consider performing a cache replacement.  We look at the
not-already-cached page that has the highest counter value and compare
it to the counter value for the cached pages.  If it is higher and the
difference exceeds some safety margin (to protect against hysteresis),
then we do the replacement.  I think 

Re: [HACKERS] hint bit cache v6

2011-06-29 Thread Jim Nasby
On Jun 29, 2011, at 3:18 PM, Robert Haas wrote:
 To be clear, I don't really think it matters how sensitive the cache
 is to a *complete* flush.  The question I want to ask is: how much
 does it take to knock ONE page out of cache?  And what are the chances
 of that happening too frequently?  It seems to me that if a run of 100
 tuples with the same previously-unseen XID is enough to knock over the
 applecart, then that's not a real high bar - you could easily hit that
 limit on a single page.  And if that isn't enough, then I don't
 understand the algorithm.

Would it be reasonable to keep a second level cache that store individual XIDs 
instead of blocks? That would provide protection for XIDs that are extremely 
common but don't have a good fit with the pattern of XID ranges that we're 
caching. I would expect this to happen if you had a transaction that touched a 
bunch of data (ie: bulk load or update) some time ago (so the other XIDs around 
it are less likely to be interesting) but not old enough to have been frozen 
yet. Obviously you couldn't keep too many XIDs in this secondary cache, but if 
you're just trying to prevent certain pathological cases then hopefully you 
wouldn't need to keep that many.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] hint bit cache v6

2011-05-13 Thread Merlin Moncure
what's changed:
*) as advertised, i'm no longer bothering to cache invalid bits.  hint
bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
will work as they have always done.
*) all the tuple visibility routines are now participating in the cache
*) xmax commit bits are now being cached, not just xmin.  this
prevents hint bit i/o following deletes.
*) juggled the cache interaction a bit so the changes to the
visibility routines could be a lot more surgical. specifically, I
reverted SetHintBits() to return void and put the cache adjustment
inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
the cache.  SetHintBitsNonDirty() only sets the bits without dirtying
the page, and is called when you get a cache hit.
*) various minor cleanups, decided to keep pageindex as type
TransactionId since that's what clog does.
*) made a proper init function, put cache data into
CacheMemoryContext, and moved the cache data pointer references into
the page structure

performance testing done:
*) select only pgbench and standard i/o pgbech tests don't show
performance difference within statistical noise.

The test I need to do, a cache and clog thrashing test, I haven't
found a clean way to put together yet.  I'm really skeptical that any
problems will show up there. That said, I am satisfied the patch does
what it is supposed to do -- eliminates hint bit i/o without for cases
where it is a big headache without penalizing other cases.  Anyone
that would like to comment on the technique or other points of the
patch please do so (otherwise it's time for the 'fest).

merlin


hbache_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers