Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-09-26 Thread Bruce Momjian

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---

Simon Riggs wrote:
 On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
   On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
   A quick grep suggests that VACUUM FULL might be at risk here.
  
   No we're clear: I caught that issue specifically for VACUUM FULL fairly
   early on. VF assumes all hint bits are set after the first scan, so we
   flush prior to the scan to ensure its safe to set the hint bits.
  
  Flush what prior to the scan?
  
  The methodology I suggested earlier (involving tracking LSN only at the
  level of pg_clog pages) isn't going to make that work, unless you
  somehow force the XID counter forward to the next page boundary.
  It might be that that level of tracking is too coarse anyway, since
  it essentially says that you can't hint any transaction until the
  next 32K-transaction boundary is reached.
 
 Solutions I'm going for are these:
 
 - Force XLogFlush() prior to initial VF scan. Tqual will set hint bits
 if WAL has been flushed, else it will be deferred, so no WAL flushes
 will be forced by normal hint bit setting and VF will work without
 needing any crufty special cases or rework of VF logic.
 
 - Use Tom's LSN tracking at clog page level. Make the LSN tracking store
 an array of LSNs rather than just one. Array size is fixed at
 NUMBER_TRACKED_LSNS_PER_PAGE, so that each LSN covers
 32,000/NUMBER_TRACKED_LSNS_PER_PAGE transactions. I'd guess that storing
 8 per page would be optimal, so each stored xid would track 4,000
 transactions - probably around 1 sec worth of transactions when the
 feature is used.
 
 Comments?
 
 -- 
   Simon Riggs
   EnterpriseDB  http://www.enterprisedb.com
 
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-09-26 Thread Simon Riggs
On Wed, 2007-09-26 at 04:33 -0400, Bruce Momjian wrote:
 This has been saved for the 8.4 release:
 
   http://momjian.postgresql.org/cgi-bin/pgpatches_hold

Already applied.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-07-18 Thread Simon Riggs
On Wed, 2007-07-18 at 01:01 +0100, Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Where are we on this?
 
 Well Simon just sent the reworked patch yesterday so the answer is we haven't
 started tuning this parameter. (Bruce's message is referring to the discussion
 about what the optimal value of lsns per clog page would be.)

From discussion, Greg wishes to see a certain parameter higher, though
I'm fairly comfortable with the value now. Thanks to Greg for making me
think this through in more detail. There is some confirmation required,
but no significant tuning effort required.

Here's the thinking:

When we commit, we store an LSN that covers a group of transactions.
When we set hint bits we check the appropriate LSN. If it has been
flushed, then we set the hint bit.

If we're using synch commits then we can always set hint bits.
If we're mixing async commits with sync commits then the clog LSNs will
very often be flushed already, so we'll be able to set most hint bits.

If we are doing *only* simple async commits then only the WAL writer
flushes WAL. In that case when we try to set hint bits we will only be
prevented from setting hint bits for the last few transactions. So how
many is that exactly?

We are flushing WAL every W seconds and doing T transactions per second,
then we will flush WAL every T*W transactions. e.g. W = 0.2 (default), T
~ 10,000 = T*W = 2,000 transactions.

We have divided up each clog page up into N groups of transactions, each
with their own LSN, then we will have 32,000/N transactions per group.

The best situation is that we set N such that we never defer the setting
of a hint bit for a transaction that has been flushed.
There is no point setting N any lower than
32,000/N = T*W
i.e. N = W * (32,000/T)

My laptop can do T=1000 with W=0.2 with pgbench, so N = 6. 

By definition, anyone using async commit will have a high transaction
rate (else there is no benefit), so N seems able to be reasonably small.
If the transaction rate drops off momentarily the number of unflushed
committed async transactions doesn't rise, it falls quickly to zero as
the WAL writer flushes WAL up to the last async LSN.

The patch sets a value of 16, which seems likely to be enough to cover
most situations. 

Having thought that through, I see two changes I'd like to make:

- I realise that I've arranged the LSNs to be striped rather than
grouped. I'll change this part of the patch - couple of lines in
GetLSNIndex().

- The LSN of sync commits is also used to update the LSN of the clog
page. That is unnecessary and we only need set the clog LSN for async
commits.

I'll wait for other review comments before cutting version 23

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-07-17 Thread Bruce Momjian

Where are we on this?

---

Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  I'd guess that storing 8 per page would be optimal, so each stored xid would
  track 4,000 transactions - probably around 1 sec worth of transactions when
  the feature is used.
 
 This is something we can experiment with but I suspect that while 8 might be
 sufficient for many use cases there would be others where more would be
 better. The cost to having more lsns stored in the clog would be pretty small.
 
 On TPCC which has longer transactions on moderate hardware we only see order
 of 1,000 txn/min. So a setting like 128 which allows a granularity of 256
 transactions would be about 15s which is not so much longer than the xmin
 horizon of the 90th percentile response time of 2*5s.
 
 -- 
   Gregory Stark
   EnterpriseDB  http://www.enterprisedb.com
 
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-07-17 Thread Gregory Stark
Bruce Momjian [EMAIL PROTECTED] writes:

 Where are we on this?

Well Simon just sent the reworked patch yesterday so the answer is we haven't
started tuning this parameter. (Bruce's message is referring to the discussion
about what the optimal value of lsns per clog page would be.)

I intend to do some benchmarking on it and testing what the best value of this
parameter is one of the things I aim to determine with these benchmarks.

I'm not 100% sure yet what to measure though. There are two questions here:

1) What are some good workloads to test which will require larger values for
   this parameter or which will be hurt by excessively large values?

I think any short-transaction workload should be basically good enough. I'm
thinking of using just pgbench's default workload. Does anyone think there are
other workloads that we need to specifically test?

2) What metric do I use to determine if the value is large enough or too
   large?

The gross measurement would be to look at tps. I would prefer to actually
count events which we want to minimize such as xlogflushes because the clog
lsn is too old and how much extra work the visibility checks do. I'm not sure
exactly how much of this we can really measure though. Are there any other
events having an insufficiently large or excessively large value of this
parameter will cause which we can count?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-07-05 Thread Simon Riggs
On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
  A quick grep suggests that VACUUM FULL might be at risk here.
 
  No we're clear: I caught that issue specifically for VACUUM FULL fairly
  early on. VF assumes all hint bits are set after the first scan, so we
  flush prior to the scan to ensure its safe to set the hint bits.
 
 Flush what prior to the scan?
 
 The methodology I suggested earlier (involving tracking LSN only at the
 level of pg_clog pages) isn't going to make that work, unless you
 somehow force the XID counter forward to the next page boundary.
 It might be that that level of tracking is too coarse anyway, since
 it essentially says that you can't hint any transaction until the
 next 32K-transaction boundary is reached.

Solutions I'm going for are these:

- Force XLogFlush() prior to initial VF scan. Tqual will set hint bits
if WAL has been flushed, else it will be deferred, so no WAL flushes
will be forced by normal hint bit setting and VF will work without
needing any crufty special cases or rework of VF logic.

- Use Tom's LSN tracking at clog page level. Make the LSN tracking store
an array of LSNs rather than just one. Array size is fixed at
NUMBER_TRACKED_LSNS_PER_PAGE, so that each LSN covers
32,000/NUMBER_TRACKED_LSNS_PER_PAGE transactions. I'd guess that storing
8 per page would be optimal, so each stored xid would track 4,000
transactions - probably around 1 sec worth of transactions when the
feature is used.

Comments?

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-07-05 Thread Gregory Stark
Simon Riggs [EMAIL PROTECTED] writes:

 I'd guess that storing 8 per page would be optimal, so each stored xid would
 track 4,000 transactions - probably around 1 sec worth of transactions when
 the feature is used.

This is something we can experiment with but I suspect that while 8 might be
sufficient for many use cases there would be others where more would be
better. The cost to having more lsns stored in the clog would be pretty small.

On TPCC which has longer transactions on moderate hardware we only see order
of 1,000 txn/min. So a setting like 128 which allows a granularity of 256
transactions would be about 15s which is not so much longer than the xmin
horizon of the 90th percentile response time of 2*5s.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-07-02 Thread Simon Riggs
On Fri, 2007-06-29 at 11:13 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
  The methodology I suggested earlier (involving tracking LSN only at the
  level of pg_clog pages) isn't going to make that work, unless you
  somehow force the XID counter forward to the next page boundary.
 
  If you completely flush WAL after the AccessExclusiveLock has been taken
  by VF, then you are guaranteed to have flushed all asynchronous commits
  that touch the table.
 
 I don't believe this is correct (see system catalogs) and in any case
 it's far too fragile for my taste.  I think it'd be a lot better to just
 stop referencing the hint bits directly in VACUUM FULL.

Well, according to the comments in repair_frag(), line 1799-1815,
specifically line 1810 says we wouldnt be in repair_frag() at all if
the tuple was in a system catalog, inserted or deleted by a not yet
committed transaction. If we have flushed all committed and/or aborted
transactions then we're good.

Maybe the comment is wrong? Wouldn't be the first time. Either way,
please explain your concern in more detail.

I'd rather do this the easy way since VF seems soon to die (payback for
all the torture its caused us down the years).

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-29 Thread Simon Riggs
On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
  A quick grep suggests that VACUUM FULL might be at risk here.
 
  No we're clear: I caught that issue specifically for VACUUM FULL fairly
  early on. VF assumes all hint bits are set after the first scan, so we
  flush prior to the scan to ensure its safe to set the hint bits.
 
 Flush what prior to the scan?
 
 The methodology I suggested earlier (involving tracking LSN only at the
 level of pg_clog pages) isn't going to make that work, unless you
 somehow force the XID counter forward to the next page boundary.
 It might be that that level of tracking is too coarse anyway, since
 it essentially says that you can't hint any transaction until the
 next 32K-transaction boundary is reached.

If you completely flush WAL after the AccessExclusiveLock has been taken
by VF, then you are guaranteed to have flushed all asynchronous commits
that touch the table.

You're right that I just broke VF again by changing the implementation
back to deferred hint setting. I'll fix again. Hmmm, I can add a test
for LockHeldByMe(AccessExclusiveLock) in tqual, or just have a
VacuumHintStrategy() call that alters the behaviour of the tqual
routines. Latter sounds best?

My aim is to get this all wrapped up by end of Monday night.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-29 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
 The methodology I suggested earlier (involving tracking LSN only at the
 level of pg_clog pages) isn't going to make that work, unless you
 somehow force the XID counter forward to the next page boundary.

 If you completely flush WAL after the AccessExclusiveLock has been taken
 by VF, then you are guaranteed to have flushed all asynchronous commits
 that touch the table.

I don't believe this is correct (see system catalogs) and in any case
it's far too fragile for my taste.  I think it'd be a lot better to just
stop referencing the hint bits directly in VACUUM FULL.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 So we suspected an interaction between multiple processes each holding
 a SHARE lock and setting/checking different bits in the infomask and
 we could theoritically say that such interaction can potentially lead to
 missing hint bit updates.

Yeah.  This is in fact something that's been foreseen, but I guess it
didn't occur to anyone that those Asserts would fail.  I concur with
removing them.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Heikki Linnakangas

Pavan Deolasee wrote:

During one of HOT stress tests, an asserition failed at tqual.c:1178
in HeapTupleSatisfiesVacuum(). The assertion failure looked really
strange because the assertion checks for HEAP_XMAX_COMMITTED
which we set just couple of lines above. I inspected the core dump
and found that the flag is *set* properly. That was even more strange.
I confirmed that we are holding a SHARE lock on the buffer as we
do at several other places while checking/setting the infomask bits.

We had a theory that somebody clears the flag after the asserting
process sets it and before it checks it. The other process also sets it
back before core dump is generated because core shows the flag
being set properly. The chances of this happening are very slim and
can further be ruled out because I carefully looked at the code and found
that the flag can only be cleared holding an exclusive lock on the buffer.

So we suspected an interaction between multiple processes each holding
a SHARE lock and setting/checking different bits in the infomask and
we could theoritically say that such interaction can potentially lead to
missing hint bit updates. I can think of the following:


FWIW, this can be reproduced by single-stepping with a debugger:

First, you need a tuple that's committed dead but no hint bits have been 
set:


BEGIN; truncate foo; INSERT INTO foo values (1,'foo'); DELETE FROM Foo; 
commit;


In one backend, set a breakpoint to HeapTupleSatisfiesMVCC lin 953 where 
it sets the XMIN_COMMITED hint bit:


 else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
 {
 tuple-t_infomask |= HEAP_XMIN_COMMITTED;
 SetBufferCommitInfoNeedsSave(buffer);
 }

Issue a SELECT * FROM foo, and step a single instruction that fetches 
the infomask field from memory to a register.


Open another backend, set a breakpoint to HeapTupleSatisfiesVacuum line 
1178:


 else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))
 {
 tuple-t_infomask |= HEAP_XMAX_COMMITTED;
 SetBufferCommitInfoNeedsSave(buffer);
 }
 else
 {
 /*
  * Not in Progress, Not Committed, so either Aborted or 
crashed

  */
 tuple-t_infomask |= HEAP_XMAX_INVALID;
 SetBufferCommitInfoNeedsSave(buffer);
 return HEAPTUPLE_LIVE;
 }
 /* Should only get here if we set XMAX_COMMITTED */
 Assert(tuple-t_infomask  HEAP_XMAX_COMMITTED);
 }

And issue VACUUM foo. It'll stop on that breakpoint.

Let the first backend continue. It will clear the XMAX_COMMITTED field.

Now let the 2nd backend to continue and you get an assertion failure.


AFAICS, we can just simply remove the assertion. But is there any 
codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
appropriate hint bits are set?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 AFAICS, we can just simply remove the assertion. But is there any 
 codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
 appropriate hint bits are set?

There had better not be, since we are going to postpone setting hint
bits for recently-committed transactions as part of the async-commit
patch.

A quick grep suggests that VACUUM FULL might be at risk here.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Alvaro Herrera
Tom Lane escribió:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  AFAICS, we can just simply remove the assertion. But is there any 
  codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
  appropriate hint bits are set?
 
 There had better not be, since we are going to postpone setting hint
 bits for recently-committed transactions as part of the async-commit
 patch.
 
 A quick grep suggests that VACUUM FULL might be at risk here.

That particular case seems easily fixed since VACUUM FULL must hold an
exclusive lock; and we can forcibly set sync commit for VACUUM FULL.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane escribió:
 A quick grep suggests that VACUUM FULL might be at risk here.

 That particular case seems easily fixed since VACUUM FULL must hold an
 exclusive lock; and we can forcibly set sync commit for VACUUM FULL.

Uh, that wouldn't help.  The problem is that if VACUUM FULL is *looking
at* a recently-committed tuple, tqual.c might decide it can't set the
hint bit yet because it's not certain the commit record for that other
transaction is flushed.

We could possibly hack things so that inside a VACUUM FULL (maybe plain
vacuum too?), we prefer flushing xlog to leaving hint bits unset.
That's likely to be messy though.

Probably a cleaner and more robust answer is to make VACUUM FULL call
tqual.c again in the places where it currently assumes it can look
directly at the hint bits.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Simon Riggs
On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  AFAICS, we can just simply remove the assertion. But is there any 
  codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
  appropriate hint bits are set?
 
 There had better not be, since we are going to postpone setting hint
 bits for recently-committed transactions as part of the async-commit
 patch.
 
 A quick grep suggests that VACUUM FULL might be at risk here.

No we're clear: I caught that issue specifically for VACUUM FULL fairly
early on. VF assumes all hint bits are set after the first scan, so we
flush prior to the scan to ensure its safe to set the hint bits. There
are no concurrent hint bit setters, so we are good.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Simon Riggs
On Thu, 2007-06-28 at 15:29 -0400, Alvaro Herrera wrote:
 Tom Lane escribió:
  Heikki Linnakangas [EMAIL PROTECTED] writes:
   AFAICS, we can just simply remove the assertion. But is there any 
   codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
   appropriate hint bits are set?
  
  There had better not be, since we are going to postpone setting hint
  bits for recently-committed transactions as part of the async-commit
  patch.
  
  A quick grep suggests that VACUUM FULL might be at risk here.
 
 That particular case seems easily fixed since VACUUM FULL must hold an
 exclusive lock; and we can forcibly set sync commit for VACUUM FULL.

Exactly what it does!

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] SetBufferCommitInfoNeedsSave and race conditions

2007-06-28 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
 A quick grep suggests that VACUUM FULL might be at risk here.

 No we're clear: I caught that issue specifically for VACUUM FULL fairly
 early on. VF assumes all hint bits are set after the first scan, so we
 flush prior to the scan to ensure its safe to set the hint bits.

Flush what prior to the scan?

The methodology I suggested earlier (involving tracking LSN only at the
level of pg_clog pages) isn't going to make that work, unless you
somehow force the XID counter forward to the next page boundary.
It might be that that level of tracking is too coarse anyway, since
it essentially says that you can't hint any transaction until the
next 32K-transaction boundary is reached.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend