Re: [HACKERS] Freezing without write I/O

2014-05-15 Thread Robert Haas
On Wed, May 14, 2014 at 8:46 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 +1.  I can't think of many things we might do that would be more
 important.

 Can anyone guess how likely this approach is to make it into 9.5?  I've been
 pondering some incremental improvements over what we have now, but if this
 revolutionary approach has a high chance of landing then any work on
 incremental improvements would be pointless.

Well, Heikki was saying on another thread that he had kind of gotten
cold feet about this, so I gather he's not planning to pursue it.  Not
sure if I understood that correctly.  If so, I guess it depends on
whether someone else can pick it up, but we might first want to
establish why he got cold feet and how worrying those problems seem to
other people.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2014-05-14 Thread Jeff Janes
On Monday, January 27, 2014, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs 
 si...@2ndquadrant.comjavascript:;
 wrote:
  On 26 January 2014 12:58, Andres Freund 
  and...@2ndquadrant.comjavascript:;
 wrote:
  On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
  Shouldn't this patch be in the January commitfest?
 
  I think we previously concluded that there wasn't much chance to get
  this into 9.4 and there's significant work to be done on the patch
  before new reviews are required, so not submitting it imo makes sense.
 
  I think we should make this a priority feature for 9.5

 +1.  I can't think of many things we might do that would be more important.


Can anyone guess how likely this approach is to make it into 9.5?  I've
been pondering some incremental improvements over what we have now, but if
this revolutionary approach has a high chance of landing then any work on
incremental improvements would be pointless.

Thanks,

Jeff


Re: [HACKERS] Freezing without write I/O

2014-05-14 Thread David Fetter
On Wed, May 14, 2014 at 05:46:49PM -0700, Jeff Janes wrote:
 On Monday, January 27, 2014, Robert Haas robertmh...@gmail.com wrote:
 
  On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs 
  si...@2ndquadrant.comjavascript:;
  wrote:
   On 26 January 2014 12:58, Andres Freund 
   and...@2ndquadrant.comjavascript:;
  wrote:
   On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
   Shouldn't this patch be in the January commitfest?
  
   I think we previously concluded that there wasn't much chance to get
   this into 9.4 and there's significant work to be done on the patch
   before new reviews are required, so not submitting it imo makes sense.
  
   I think we should make this a priority feature for 9.5
 
  +1.  I can't think of many things we might do that would be more important.
 
 Can anyone guess how likely this approach is to make it into 9.5?
 I've been pondering some incremental improvements over what we have
 now, but if this revolutionary approach has a high chance of landing
 then any work on incremental improvements would be pointless.

How much work do you anticipate for the description and PoC of
whatever it is you've been pondering?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2014-01-27 Thread Simon Riggs
On 26 January 2014 12:58, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
 Shouldn't this patch be in the January commitfest?

 I think we previously concluded that there wasn't much chance to get
 this into 9.4 and there's significant work to be done on the patch
 before new reviews are required, so not submitting it imo makes sense.

I think we should make this a priority feature for 9.5

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2014-01-27 Thread Robert Haas
On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 26 January 2014 12:58, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
 Shouldn't this patch be in the January commitfest?

 I think we previously concluded that there wasn't much chance to get
 this into 9.4 and there's significant work to be done on the patch
 before new reviews are required, so not submitting it imo makes sense.

 I think we should make this a priority feature for 9.5

+1.  I can't think of many things we might do that would be more important.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2014-01-26 Thread Andres Freund
On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
 Shouldn't this patch be in the January commitfest?

I think we previously concluded that there wasn't much chance to get
this into 9.4 and there's significant work to be done on the patch
before new reviews are required, so not submitting it imo makes sense.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2014-01-25 Thread Peter Geoghegan
Shouldn't this patch be in the January commitfest?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-11-23 Thread Jeff Davis
On Wed, 2013-09-25 at 12:31 +0300, Heikki Linnakangas wrote:
 On 19.09.2013 16:24, Andres Freund wrote:
 ...
  There's probably more.

I think _bt_check_unique is also a problem.

 Hmm, some of those are trivial, but others, rewrite_heap_tuple() are 
 currently only passed the HeapTuple, with no reference to the page where 
 the tuple came from. Instead of changing code to always pass that along 
 with a tuple, I think we should add a boolean to HeapTupleData, to 
 indicate if the tuple came from a mature page. If the flag is set, the 
 tuple should be considered visible to everyone, without looking at the 
 xmin/xmax. This might affect extensions, although usually external C 
 code that have to deal with HeapTuples will use functions like 
 heap_form_tuple to do so, and won't need to deal with low-level stuff or 
 visibility themselves.

How bad would the code be to just pass along the buffer when it's
needed? After looking around, it doesn't seem necessarily worse than
adding the struct field (and I agree with Andres that it seems more
proper to pass the buffer along).

I also have a separate question about this patch:

It seems like this gets us closer to 64-bit transaction IDs. Can we get
some help from the compiler to distinguish the two cases in a less
error-prone way? Is it worth thinking about making 64-bit the norm, and
32-bit an optimization in certain places (e.g. the heap and maybe the
xip array of a snapshot)?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-11-18 Thread Andres Freund
On 2013-09-25 12:31:20 +0300, Heikki Linnakangas wrote:
 Hmm, some of those are trivial, but others, rewrite_heap_tuple() are
 currently only passed the HeapTuple, with no reference to the page where the
 tuple came from. Instead of changing code to always pass that along with a
 tuple, I think we should add a boolean to HeapTupleData, to indicate if the
 tuple came from a mature page. If the flag is set, the tuple should be
 considered visible to everyone, without looking at the xmin/xmax. This might
 affect extensions, although usually external C code that have to deal with
 HeapTuples will use functions like heap_form_tuple to do so, and won't need
 to deal with low-level stuff or visibility themselves.
 
 Attached is a new version, which adds that field to HeapTupleData. Most of
 the issues on you listed above have been fixed, plus a bunch of other bugs I
 found myself. The bug that Jeff ran into with his count.pl script has also
 been fixed.

This seems a bit hacky to me. Change a widely used struct because a few
functions don't get passed enough information? Visibilitychecks are
properly done with a buffer passed along; that some places have cut
corners around that doesn't mean we have to continue to do so. The
pullups around rs_pagemature are imo indicative of this (there's even a
bug because a page can be all visible before it's mature, right? That
could then cause an assert somewhere down the line when we check page
and tuple are coherent).

Ok, making another scan through this:
* Why can we do PageSetLSN(page, RangeSwitchLSN) in heap_* when the action
  doesn't need WAL? And why is that correct? I can see doing that in
  vacuum itself, but doing it when we write *new* data to the page?
* heap_freeze_page(): The PageSetLSN(page, RangeSwitchLSN) if there's
  nothing to log is not WAL logged. Which means that if we crash it
  won't necessarily be set, so the VM and the heap lsn might get out of
  sync. That probably doesn't have any bad effects, but imo deserves a
  comment.
* heap_freeze_page(): PageUpdateNeedsFreezing() can now be true before
  and after. That's a bit confusing :/
* GetSafeClogTruncationPoint truncates the xid-lsn ranges, but there's
  also an, uncommented, TruncateXidLSNRanges. At leas how they work together
  should be described better.
* There's quite some FIXMEs around
* Let's move the xid-lsn ranges code from GetNewTransactionId() into
  it's own function.
* PageMatureLSN and RangeSwitchLSN should be documented somewhere. They
  are implicitly, but they are used widely enough that that doesn't seem
  sufficient.
* pg_controldata should output pageMatureLSN

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-10-01 Thread Andres Freund
On 2013-10-01 04:47:42 +0300, Ants Aasma wrote:
 I still think we should have a macro for the volatile memory accesses.
 As a rule, each one of those needs a memory barrier, and if we
 consolidate them, or optimize them out, the considerations why this is
 safe should be explained in a comment. Race prone memory accesses
 should stick out like a sore thumb.

Agreed. The wait free LW_SHARED thing[1] I posted recently had a simple

#define pg_atomic_read(atomic) (*(volatile uint32 *)(atomic))

That should be sufficient and easily greppable, right?

I generally think we need to provide some primitives for doing atomic
stuff. There's lots of stuff that's not easy to accelerate further without.

Greetings,

Andres Freund

[1] 
http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-10-01 Thread Ants Aasma
On Tue, Oct 1, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote:
 Agreed. The wait free LW_SHARED thing[1] I posted recently had a simple

 #define pg_atomic_read(atomic) (*(volatile uint32 *)(atomic))

 That should be sufficient and easily greppable, right?

Looks good enough for me. I would consider using a naming scheme that
accounts for possible future uint64 atomics.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-30 Thread Ants Aasma
Just found this in my drafts folder. Sorry for the late response.

On Fri, Sep 20, 2013 at 3:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I am entirely unconvinced that we need this.  Some people use acquire
 + release fences, some people use read + write fences, and there are
 all combinations (read-acquire, read-release, read-acquire-release,
 write-acquire, write-release, write-acquire-release, both-acquire,
 both-release, both-acquire-release).  I think we're going to have
 enough trouble deciding between the primitives we already have, let
 alone with a whole mess of new ones.  Mistakes will only manifest
 themselves on certain platforms and in many cases the races are so
 tight that actual failures are very unlikely to be reserved in
 regression testing.

I have to retract my proposal to try to emulate C11 atomics in C89. I
guess this goes to show why one shouldn't try to conjure up atomic
API's at 2AM. I forgot the fact that while acquire-release semantics
are enough to ensure sequentially consistent behavior, the actual
barriers need to be paired with specific atomic accesses to achieve
that. It's not possible to use freestanding acquire/release barriers
to do implement this, nor would it be possible to include barriers in
the atomic accesses themselves without causing significant
pessimization.

Sequentially consistency (along with causal transitivity and total
store ordering that come with it) should be regarded as a goal. I'm
not able to reason about concurrent programs without those guarantees,
and I suspect no one else is either. Sequential consistency is
guaranteed if atomic variables (including locks) are accessed with
appropriate acquire and release semantics. We just need to use a
hodge-podge of read/write/full barriers and volatile memory accesses
to actually implement the semantics until some distant future date
where we can start relying on compilers getting it right.

I still think we should have a macro for the volatile memory accesses.
As a rule, each one of those needs a memory barrier, and if we
consolidate them, or optimize them out, the considerations why this is
safe should be explained in a comment. Race prone memory accesses
should stick out like a sore thumb.

 Personally, I think the biggest change that would help here is to
 mandate that spinlock operations serve as compiler fences.  That would
 eliminate the need for scads of volatile references all over the
 place.

+1. The commits that you showed fixing issues in this area show quite
well why this is a good idea.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-25 Thread Heikki Linnakangas

On 18.09.2013 22:55, Jeff Janes wrote:

On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangashlinnakan...@vmware.com

wrote:
Here's a rebased version of the patch, including the above-mentioned
fixes. Nothing else new.


I've applied this to 0892ecbc015930d, the last commit to which it applies
cleanly.

When I test this by repeatedly incrementing a counter in a randomly chosen
row, then querying the whole table and comparing the results to what my
driver knows they should be, I get discrepancies.


Ok, I found the bug. The problem was that when a HOT chain begins with a 
dead tuple, when the page was frozen, the dead tuple was not removed, 
but the xmin of the live tuple in the chain was replaced with FrozenXid. 
That breaks the HOT-chain following code, which checks that the xmin of 
the next tuple in the chain matches the xmax of the previous tuple.


I fixed that by simply not freezing a page which contains any dead 
tuples. That's OK because the page will be visited by vacuum before it 
becomes old enough to be mature. However, it means that the invariant 
that a page can only contain XIDs within one XID-LSN range, determined 
by the LSN, is no longer true. AFAICS it everything still works, but I 
would feel more comfortable if we could uphold that invariant, for 
debugging reasons if nothing else. Will have to give that some more 
thought..


Thanks for the testing! I just posted an updated version of the patch 
elsewhere in this thread.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-25 Thread Peter Eisentraut
On 9/25/13 5:31 AM, Heikki Linnakangas wrote:
 Attached is a new version, which adds that field to HeapTupleData. Most
 of the issues on you listed above have been fixed, plus a bunch of other
 bugs I found myself. The bug that Jeff ran into with his count.pl script
 has also been fixed.

This patch doesn't apply at all:

Reversed (or previously applied) patch detected!  Assuming -R.
1 out of 39 hunks FAILED -- saving rejects to file
src/backend/access/heap/heapam.c.rej
5 out of 9 hunks FAILED -- saving rejects to file
src/backend/access/heap/rewriteheap.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
10 out of 12 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
13 out of 18 hunks FAILED -- saving rejects to file
src/backend/commands/cluster.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
15 out of 17 hunks FAILED -- saving rejects to file
src/backend/commands/vacuum.c.rej
5 out of 24 hunks FAILED -- saving rejects to file
src/backend/commands/vacuumlazy.c.rej
3 out of 9 hunks FAILED -- saving rejects to file
src/backend/postmaster/autovacuum.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
1 out of 4 hunks FAILED -- saving rejects to file
src/backend/utils/misc/guc.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
1 out of 2 hunks FAILED -- saving rejects to file
src/include/commands/cluster.h.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/include/commands/vacuum.h.rej
Reversed (or previously applied) patch detected!  Assuming -R.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-25 Thread Heikki Linnakangas

On 25.09.2013 15:48, Peter Eisentraut wrote:

On 9/25/13 5:31 AM, Heikki Linnakangas wrote:

Attached is a new version, which adds that field to HeapTupleData. Most
of the issues on you listed above have been fixed, plus a bunch of other
bugs I found myself. The bug that Jeff ran into with his count.pl script
has also been fixed.


This patch doesn't apply at all:


Huh, that's strange. Merged with latest changes from master, here's a 
new version that applies.


- Heikki


xid-lsn-ranges-5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-23 Thread Robert Haas
On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
 I think we should go through the various implementations and make sure
 they are actual compiler barriers and then change the documented policy.

 From a quick look
 * S_UNLOCK for PPC isn't a compiler barrier
 * S_UNLOCK for MIPS isn't a compiler barrier
 * I don't know enough about unixware (do we still support that as a
 platform even) to judge
 * True64 Alpha I have no clue about
 * PA-RISCs tas() might not be a compiler barrier for !GCC
 * PA-RISCs S_UNLOCK might not be a compiler barrier
 * HP-UX !GCC might not
 * IRIX 5 seems to be a compiler barrier
 * SINIX - I don't care
 * AIX PPC - compiler barrier
 * Sun - TAS is implemented in external assembly, normal function call,
   compiler barrier
 * Win(32|64) - compiler barrier
 * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

 Ok, so I might have been a bit too optimistic...

Yeah, it seems worth fixing, but it's not going to be a 5-minute
project, I fear.

But why do you think that this is not a compiler barrier (PPC):

__asm__ __volatile__ ( sync \n); \
*((volatile slock_t *) (lock)) = 0; \

Surely, the __asm__ __volatile__ must be a compiler barrier, but the
compiler could presumably allow the store to the lock itself to move
forward past other non-volatilized stores during optimization?  Is
that what you're concerned about?  If so, that's easily fixed by
sticking __asm__ __volatile__(:::memory) in there.  But some of
the other cases are less clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-23 Thread Andres Freund
On 2013-09-23 11:50:05 -0400, Robert Haas wrote:
 On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
  I think we should go through the various implementations and make sure
  they are actual compiler barriers and then change the documented policy.
 
  From a quick look
  * S_UNLOCK for PPC isn't a compiler barrier
  * S_UNLOCK for MIPS isn't a compiler barrier
  * I don't know enough about unixware (do we still support that as a
  platform even) to judge
  * True64 Alpha I have no clue about
  * PA-RISCs tas() might not be a compiler barrier for !GCC
  * PA-RISCs S_UNLOCK might not be a compiler barrier
  * HP-UX !GCC might not
  * IRIX 5 seems to be a compiler barrier
  * SINIX - I don't care
  * AIX PPC - compiler barrier
  * Sun - TAS is implemented in external assembly, normal function call,
compiler barrier
  * Win(32|64) - compiler barrier
  * Generic S_UNLOCK *NOT* necessarily a compiler barrier.
 
  Ok, so I might have been a bit too optimistic...
 
 Yeah, it seems worth fixing, but it's not going to be a 5-minute
 project, I fear.

Yea :(. I think we should start by trimming the above list by removing
some platforms:
* SINIX - doesn't actually seem to be supported
* Tru64 - not even a zombie anymore
* IRIX - ...

The others probably can't be removed?

 But why do you think that this is not a compiler barrier (PPC):
 
 __asm__ __volatile__ ( sync \n); \
 *((volatile slock_t *) (lock)) = 0; \
 
 Surely, the __asm__ __volatile__ must be a compiler barrier, but the
 compiler could presumably allow the store to the lock itself to move
 forward past other non-volatilized stores during optimization?  Is
 that what you're concerned about?  If so, that's easily fixed by
 sticking __asm__ __volatile__(:::memory) in there.

Yes, the memory clobber is missing for PPC and MIPS.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-23 Thread Andres Freund
Just some notes, before I forget them again.

On 2013-09-23 11:50:05 -0400, Robert Haas wrote:
 On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
  I think we should go through the various implementations and make sure
  they are actual compiler barriers and then change the documented policy.
 
  From a quick look
  * S_UNLOCK for PPC isn't a compiler barrier
  * S_UNLOCK for MIPS isn't a compiler barrier

Needs memory clobber looks easy enough given only gcc seems to be
supported.

  * I don't know enough about unixware (do we still support that as a
  platform even) to judge

Looks like another thing to remove?

  * PA-RISCs tas() might not be a compiler barrier for !GCC
  * PA-RISCs S_UNLOCK might not be a compiler barrier

http://h21007.www2.hp.com/portal/download/files/unprot/ddk/mem_ordering_pa_ia.pdf

_MF(); seems to work on both PA-RISC and itanic.

  * HP-UX !GCC might not

The document linked from the source code explains how to implement it:
http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf

#define COMP_FENCE _Asm_fence(_UP_MEM_FENCE|_DOWN_MEM_FENCE)

  * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

Not sure how to handle that one.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread Robert Haas
On Thu, Sep 19, 2013 at 6:27 PM, Ants Aasma a...@cybertec.at wrote:
 I'm tackling similar issues in my patch. What I'm thinking is that we should
 change our existing supposedly atomic accesses to be more explicit and make
 the API compatible with C11 atomics[1]. For now I think the changes should be
 something like this:

 * Have separate typedefs for atomic variants of datatypes (I don't think we
   have a whole lot of them). With C11 atomics support, this would change to

 typedef _Atomic TransactionId AtomicTransactionId;

What's the point of this?

 * Require that pg_atomic_init(type, var, val) be used to init atomic
   variables. Initially it would just pass through to assignment, as all
   supported platforms can do 32bit atomic ops. C11 compatible compilers will
   delegate to atomic_init().

I don't think this is a bad idea for decoration, but again I'm not
sure how much fundamental value it adds.  If it makes it easier for
people to write code that works in C11 and fails on C89, we lose.

 * Create atomic read and write macros that look something like:

 #define pg_atomic_load(type, var) (*((volatile type*) (var)))

   and

 #define pg_atomic_store(type, var, val) do { \
 *((volatile type*) (var)) = (val); \
 } while(0)

   C11 would pass through to the compilers implementation with relaxed memory
   ordering.

Same comment.

 * Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to
   pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence(). Herb Sutter makes
   a convincing argument why loosening up the barrier semantics is the sane
   choice here. [2] C11 support can then pass though to atomic_thread_fence().

I am entirely unconvinced that we need this.  Some people use acquire
+ release fences, some people use read + write fences, and there are
all combinations (read-acquire, read-release, read-acquire-release,
write-acquire, write-release, write-acquire-release, both-acquire,
both-release, both-acquire-release).  I think we're going to have
enough trouble deciding between the primitives we already have, let
alone with a whole mess of new ones.  Mistakes will only manifest
themselves on certain platforms and in many cases the races are so
tight that actual failures are very unlikely to be reserved in
regression testing.

Personally, I think the biggest change that would help here is to
mandate that spinlock operations serve as compiler fences.  That would
eliminate the need for scads of volatile references all over the
place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread Andres Freund
Hi,

I agree with most of what you said - I think that's a littlebit too much
change for too little benefit.

On 2013-09-20 08:32:29 -0400, Robert Haas wrote:
 Personally, I think the biggest change that would help here is to
 mandate that spinlock operations serve as compiler fences.  That would
 eliminate the need for scads of volatile references all over the
 place.

The effectively already do, don't they? It's an external, no-inlineable
function call (s_lock, not the actual TAS). And even if it were to get
inlined via LTO optimization, the inline assembler we're using is doing
the __asm__ __volatile__ (..., memory) dance. That's a full compiler
barrier.
The non-asm implementations call to OS/compiler primitives that are also
fences.

In the case I brougth up here there is no spinlock or something similar.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread Robert Haas
On Fri, Sep 20, 2013 at 8:40 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-09-20 08:32:29 -0400, Robert Haas wrote:
 Personally, I think the biggest change that would help here is to
 mandate that spinlock operations serve as compiler fences.  That would
 eliminate the need for scads of volatile references all over the
 place.

 The effectively already do, don't they? It's an external, no-inlineable
 function call (s_lock, not the actual TAS). And even if it were to get
 inlined via LTO optimization, the inline assembler we're using is doing
 the __asm__ __volatile__ (..., memory) dance. That's a full compiler
 barrier.
 The non-asm implementations call to OS/compiler primitives that are also
 fences.

 In the case I brougth up here there is no spinlock or something similar.

Well, that doesn't match my previous discussions with Tom Lane, or this comment:

 *  Another caution for users of these macros is that it is the caller's
 *  responsibility to ensure that the compiler doesn't re-order accesses
 *  to shared memory to precede the actual lock acquisition, or follow the
 *  lock release.  Typically we handle this by using volatile-qualified
 *  pointers to refer to both the spinlock itself and the shared data
 *  structure being accessed within the spinlocked critical section.
 *  That fixes it because compilers are not allowed to re-order accesses
 *  to volatile objects relative to other such accesses.

That's not to disagree with your point.  If TAS is a compiler barrier,
then we oughta be OK.  Unless something can migrate into the spot
between a failed TAS and the call to s_lock?  But I'm pretty sure that
we've repeatedly had to change code to keep things from falling over
in this area, see e.g. commits
fa72121594534dda6cc010f0bf6b3e8d22987526,
07eeb9d109c057854b20707562ce517efa591761,
d3fc362ec2ce1cf095950dddf24061915f836c22, and
584f818bef68450d23d1b75afbaf19febe38fd37 (the last apparently a live
bug).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread Andres Freund
On 2013-09-20 08:54:26 -0400, Robert Haas wrote:
 On Fri, Sep 20, 2013 at 8:40 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-09-20 08:32:29 -0400, Robert Haas wrote:
  Personally, I think the biggest change that would help here is to
  mandate that spinlock operations serve as compiler fences.  That would
  eliminate the need for scads of volatile references all over the
  place.
 
  The effectively already do, don't they? It's an external, no-inlineable
  function call (s_lock, not the actual TAS). And even if it were to get
  inlined via LTO optimization, the inline assembler we're using is doing
  the __asm__ __volatile__ (..., memory) dance. That's a full compiler
  barrier.
  The non-asm implementations call to OS/compiler primitives that are also
  fences.
 
  In the case I brougth up here there is no spinlock or something similar.
 
 Well, that doesn't match my previous discussions with Tom Lane, or this 
 comment:
 
  *  Another caution for users of these macros is that it is the caller's
  *  responsibility to ensure that the compiler doesn't re-order accesses
  *  to shared memory to precede the actual lock acquisition, or follow the
  *  lock release.  Typically we handle this by using volatile-qualified
  *  pointers to refer to both the spinlock itself and the shared data
  *  structure being accessed within the spinlocked critical section.
  *  That fixes it because compilers are not allowed to re-order accesses
  *  to volatile objects relative to other such accesses.

To me that doesn't really make much sense to be honest.Note that the next 
paragraph tells us that

 *  On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
 *  S_UNLOCK() macros must further include hardware-level memory fence
 *  instructions to prevent similar re-ordering at the hardware level.
 *  TAS() and TAS_SPIN() must guarantee that loads and stores issued after
 *  the macro are not executed until the lock has been obtained.  
Conversely,
 *  S_UNLOCK() must guarantee that loads and stores issued before the macro
 *  have been executed before the lock is released.

so, TAS has to work as a memory barrier if the architecture doesn't have
strong cache coherency guarantees itself. But memory barriers have to be
compiler barriers?

 That's not to disagree with your point.  If TAS is a compiler barrier,
 then we oughta be OK.  Unless something can migrate into the spot
 between a failed TAS and the call to s_lock?

We're talking compiler barriers right. In that case failure or success
do not play a role, or am I missing something?

 But I'm pretty sure that
 we've repeatedly had to change code to keep things from falling over
 in this area, see e.g. commits

 584f818bef68450d23d1b75afbaf19febe38fd37 (the last apparently a live
 bug).
 d3fc362ec2ce1cf095950dddf24061915f836c22, and

I've quickly checked those out, and things looked mighty different back
then. And incidentally several of the inline assembly implementations
back then didn't specify that they clobber memory (which is what makes
it a compiler barrier).

 fa72121594534dda6cc010f0bf6b3e8d22987526,

Not sure here. Several of the inline assembly bits where changed to
clobber memory, but not all.

 07eeb9d109c057854b20707562ce517efa591761,

Hm. Those mostly look to be overly cautios to me.

I think we should go through the various implementations and make sure
they are actual compiler barriers and then change the documented policy.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread Alvaro Herrera
Andres Freund escribió:
 Hi,
 
 On 2013-09-19 14:42:19 +0300, Heikki Linnakangas wrote:
  On 18.09.2013 16:22, Andres Freund wrote:
  * Why can we do a GetOldestXmin(allDbs = false) in
 BeginXidLSNRangeSwitch()?
  
  Looks like a bug. I think I got the arguments backwards, was supposed to be
  allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be
  ignored, but 'false' is the safe option.
 
 Not sure either...

The ignoreVacuum flag specifies to ignore backends running non-full
VACUUM, that is, processes that are known never to generate new Xids,
never obtain Xmin, and never to insert tuples anywhere.  With all these
restrictions, I think it's safe to use ignoreVacuum=true here.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread Andres Freund
On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
 I think we should go through the various implementations and make sure
 they are actual compiler barriers and then change the documented policy.

From a quick look
* S_UNLOCK for PPC isn't a compiler barrier
* S_UNLOCK for MIPS isn't a compiler barrier
* I don't know enough about unixware (do we still support that as a
platform even) to judge
* True64 Alpha I have no clue about
* PA-RISCs tas() might not be a compiler barrier for !GCC
* PA-RISCs S_UNLOCK might not be a compiler barrier
* HP-UX !GCC might not
* IRIX 5 seems to be a compiler barrier
* SINIX - I don't care
* AIX PPC - compiler barrier
* Sun - TAS is implemented in external assembly, normal function call,
  compiler barrier
* Win(32|64) - compiler barrier
* Generic S_UNLOCK *NOT* necessarily a compiler barrier.

Ok, so I might have been a bit too optimistic...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread didier
Hi


On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
  I think we should go through the various implementations and make sure
  they are actual compiler barriers and then change the documented policy.

 From a quick look
 * S_UNLOCK for PPC isn't a compiler barrier
 * S_UNLOCK for MIPS isn't a compiler barrier
 * I don't know enough about unixware (do we still support that as a
 platform even) to judge
 * True64 Alpha I have no clue about
 * PA-RISCs tas() might not be a compiler barrier for !GCC
 * PA-RISCs S_UNLOCK might not be a compiler barrier
 * HP-UX !GCC might not
 * IRIX 5 seems to be a compiler barrier
 * SINIX - I don't care
 * AIX PPC - compiler barrier
 * Sun - TAS is implemented in external assembly, normal function call,
   compiler barrier
 * Win(32|64) - compiler barrier
 * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

 Ok, so I might have been a bit too optimistic...

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Freezing without write I/O

2013-09-20 Thread didier
Hi,

IMO it's a bug if S_UNLOCK is a not a compiler barrier.

Moreover for volatile remember:
https://www.securecoding.cert.org/confluence/display/seccode/DCL17-C.+Beware+of+miscompiled+volatile-qualified+variables

Who is double checking compiler output? :)

regards
Didier



On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
  I think we should go through the various implementations and make sure
  they are actual compiler barriers and then change the documented policy.

 From a quick look
 * S_UNLOCK for PPC isn't a compiler barrier
 * S_UNLOCK for MIPS isn't a compiler barrier
 * I don't know enough about unixware (do we still support that as a
 platform even) to judge
 * True64 Alpha I have no clue about
 * PA-RISCs tas() might not be a compiler barrier for !GCC
 * PA-RISCs S_UNLOCK might not be a compiler barrier
 * HP-UX !GCC might not
 * IRIX 5 seems to be a compiler barrier
 * SINIX - I don't care
 * AIX PPC - compiler barrier
 * Sun - TAS is implemented in external assembly, normal function call,
   compiler barrier
 * Win(32|64) - compiler barrier
 * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

 Ok, so I might have been a bit too optimistic...

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Freezing without write I/O

2013-09-19 Thread Heikki Linnakangas

On 18.09.2013 16:22, Andres Freund wrote:

On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote:

Here's a rebased version of the patch, including the above-mentioned fixes.
Nothing else new.


* We need some higherlevel description of the algorithm somewhere in the
   source. I don't think I've understood the concept from the patch alone
   without having read the thread previously.
* why do we need to do the PageUpdateNeedsFreezing() dance in
   heap_page_prune? No xids should change during it.


Ah, but its LSN does change. Moving the LSN forward might move the LSN 
from one xid-lsn range to another, requiring any XIDs on the page that 
fall outside the xid range of the new xid-lsn range to be frozen.



* Why can we do a GetOldestXmin(allDbs = false) in
   BeginXidLSNRangeSwitch()?


Looks like a bug. I think I got the arguments backwards, was supposed to 
be allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could 
be ignored, but 'false' is the safe option.



* Is there any concrete reasoning behind the current values for
   XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just gut feeling?


The values in the patch are:

#define NUM_XID_LSN_RANGES 100
#define XID_LSN_RANGE_INTERVAL 100

The interval is probably way too small for production, but a small value 
is handy for testing, as you can get through ranges faster. Something 
like 100 million would probably be a more suitable value for production. 
With a smaller interval, you need to freeze more often, while with a 
large interval, you can't truncate the clog as early.


NUM_XID_LSN_RANGES is also quite arbitrary. I don't have a good feeling 
on what the appropriate sizing for that would be. Something like 5 or 10 
would probably be enough. Although at the moment, the patch doesn't 
handle the situation that you run out of slots very well. You could 
merge some old ranges to make room, but ATM, it just stops creating new 
ones.



* the lsn ranges file can possibly become bigger than 512bytes (the size
   we assume to be written atomically) and you write it inplace. If we
   fail halfway through writing, we seem to be able to recover by using
   the pageMatureLSN from the last checkpoint, but it seems better to
   do the fsync(),rename(),fsync() dance either way.


The lsn-range array is also written to the xlog in toto whenever it 
changes, so on recovery, the ranges will be read from the WAL, and the 
ranges file will be recreated at the end-of-recovery checkpoint.



* Should we preemptively freeze tuples on a page in lazy_scan_heap if we
   already have dirtied the page? That would make future modifcations
   cheaper.


Hmm, dunno. Any future modification will also dirty the page, so you're 
not actually saving any I/O by freezing it earlier. You're just choosing 
to do the CPU work and WAL insertion earlier than you have to. And if 
the page is not modified later, it is a waste of cycles. That said, 
maybe it would indeed be better to do it in vacuum when you have a 
chance to reduce latency in the critical path, even if it might be a waste.



* lazy_scan_heap now blocks acquiring a cleanup lock on every buffer
   that contains dead tuples. Shouldn't we use some kind of cutoff xid
   there? That might block progress too heavily. Also the comment above
   it still refers to the old logic.


Hmm, you mean like skip the page even if there are some dead tuples on 
it, as long as the dead tuples are not older than X xids? I guess we 
could do that. Or the other thing mentioned in the comments, ie. 
remember the page and come back to it later.


For now though, I think it's good enough as it is.


* There's no way to force a full table vacuum anymore, that seems
   problematic to me.


Yeah, it probably would be good to have that. An ignore visibility map 
option.



* I wonder if CheckPointVarsup() doesn't need to update
   minRecoveryPoint.


Hmm,  I guess it should.


StartupVarsup() should be ok, because we should only
   read one from the future during a basebackup?


You might read one from the future during recovery, whether it's crash 
or archive recovery, but as soon as you've replayed up to the min 
recovery point, or the end of WAL on crash recovery, the XID ranges 
array in memory should be consistent with the rest of the system, 
because changes to the array are WAL logged.



* xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why
   can't we just reset dirty before the WriteXidLSNRangesFile() call?
   There's only one process doing the writeout. Just because the
   checkpointing process could be killed?


Right, if the write fails, you need to retry on the next checkpoint.


* I think we should either not require consuming an multixactid or use a
   function that doesn't need MultiXactIdSetOldestMember(). If the
   transaction doing so lives for long it will unnecessarily prevent
   truncation of mxacts.


Agreed, that was just a quick kludge to get it working.


* 

Re: [HACKERS] Freezing without write I/O

2013-09-19 Thread Andres Freund
Hi,

On 2013-09-19 14:42:19 +0300, Heikki Linnakangas wrote:
 On 18.09.2013 16:22, Andres Freund wrote:
 * Why can we do a GetOldestXmin(allDbs = false) in
BeginXidLSNRangeSwitch()?
 
 Looks like a bug. I think I got the arguments backwards, was supposed to be
 allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be
 ignored, but 'false' is the safe option.

Not sure either...

 * the lsn ranges file can possibly become bigger than 512bytes (the size
we assume to be written atomically) and you write it inplace. If we
fail halfway through writing, we seem to be able to recover by using
the pageMatureLSN from the last checkpoint, but it seems better to
do the fsync(),rename(),fsync() dance either way.
 
 The lsn-range array is also written to the xlog in toto whenever it changes,
 so on recovery, the ranges will be read from the WAL, and the ranges file
 will be recreated at the end-of-recovery checkpoint.

But we will read the file before we read the WAL record covering it,
won't we?

 * Should we preemptively freeze tuples on a page in lazy_scan_heap if we
already have dirtied the page? That would make future modifcations
cheaper.

 Hmm, dunno. Any future modification will also dirty the page, so you're not
 actually saving any I/O by freezing it earlier. You're just choosing to do
 the CPU work and WAL insertion earlier than you have to. And if the page is
 not modified later, it is a waste of cycles. That said, maybe it would
 indeed be better to do it in vacuum when you have a chance to reduce latency
 in the critical path, even if it might be a waste.

Well, if we already dirtied the page (as in vacuum itself, to remove
dead tuples) we already have to do WAL. True it's a separate record, but
that should probably be fixed.

 * lazy_scan_heap now blocks acquiring a cleanup lock on every buffer
that contains dead tuples. Shouldn't we use some kind of cutoff xid
there? That might block progress too heavily. Also the comment above
it still refers to the old logic.
 
 Hmm, you mean like skip the page even if there are some dead tuples on it,
 as long as the dead tuples are not older than X xids? I guess we could do
 that. Or the other thing mentioned in the comments, ie. remember the page
 and come back to it later.

 For now though, I think it's good enough as it is.

I have seen vacuum being slowed down considerably because we couldn't
acquire a cleanup lock. It's not that infrequent to have pages that are
pinned most of the time. And that was when we only acquired the cleanup
lock when necessary for freezing.

Not processing the page will not mark it as all-visible which basically
is remembering it...

 * switchFinishXmin and nextSwitchXid should probably be either volatile
or have a compiler barrier between accessing shared memory and
checking them. The compiler very well could optimize them away and
access shmem all the time which could lead to weird results.
 
 Hmm, that would be a weird optimization. Is that really legal for the
 compiler to do? Better safe than sorry, I guess.

I think it is. The compiler doesn't know anything about shared memory or
threads, so it doesn't see that those parameters can change. It will be
forced to load them as soon as an non-inlined function is called which
actually might make it safe in this case - unless you compile in LTO
mode where it recognizes that TransactionIdFollowsOrEquals() won't
modify thigns...

 * I wonder whether the fact that we're doing the range switches after
acquiring an xid could be problematic if we're preventing xid
allocation due to the checks earlier in that function?
 
 You mean that you might get into a situation where you cannot assign a new
 XID because you've reached xidStopLimit, and you cannot advance xidStopLimit
 because you can't switch to a new xid-lsn range, because no new XIDs are
 being assigned? Yeah, that would be nasty. The range management stuff should
 really be moved out of GetNewTransaction(), maybe do them in walwriter or
 bgwriter as Alvaro suggested.

Yes, I think we should do that. I am not sure yet where to put it
best though. Doing it in the wal writer doesn't seem to be a good idea, doing
more there will increase latency for normal backends.

 * I think heap_lock_tuple() needs to unset all-visible, otherwise we
won't vacuum that page again which can lead to problems since we
don't do full-table vacuums again?
 
 It's OK if the page is never vacuumed again, the whole point of the patch is
 that old XIDs can be just left lingering in the table. The next time the
 page is updated, perhaps to lock a tuple again, the page will be frozen and
 the old xmax will be cleared.

Yes, you're right, it should be possible to make it work that way. But
currently, we look at xmax and infomask of a tuple in heap_lock_tuple()
*before* the PageUpdateNeedsFreezing() call. Currently we would thus
create a new multixact with long dead xids as members and 

Re: [HACKERS] Freezing without write I/O

2013-09-19 Thread Andres Freund
On 2013-09-19 14:40:35 +0200, Andres Freund wrote:
  * I think heap_lock_tuple() needs to unset all-visible, otherwise we
 won't vacuum that page again which can lead to problems since we
 don't do full-table vacuums again?
  
  It's OK if the page is never vacuumed again, the whole point of the patch is
  that old XIDs can be just left lingering in the table. The next time the
  page is updated, perhaps to lock a tuple again, the page will be frozen and
  the old xmax will be cleared.
 
 Yes, you're right, it should be possible to make it work that way. But
 currently, we look at xmax and infomask of a tuple in heap_lock_tuple()
 *before* the PageUpdateNeedsFreezing() call. Currently we would thus
 create a new multixact with long dead xids as members and such.

That reminds me of something:

There are lots of checks sprayed around that unconditionally look at
xmin, xmax or infomask.

Things I noticed in a quick look after thinking about the previous
statment:
* AlterEnum() looks at xmin
* The PLs look at xmin
* So does afair VACUUM FREEZE, COPY and such.
* Several HeapTupleSatisfiesVacuum() callers look at xmin and stuff
  without checking for maturity or freezing the page first.
  * lazy_scan_heap() itself if the page isn't already marked all_visible
  * heap_page_is_all_visible()
* rewrite_heap_tuple() does an unconditional heap_freeze_tuple() without
  considering maturity/passing in Invalid*
* heap_page_prune_opt() shouldn't do anything on a mature (or even all
  visible) page.
* heap_page_prune() marks a buffer dirty, writes a new xid without
  changing the lsn.
* heap_prune_chain() looks at tuples without considering page maturity,
  but the current implementation actually looks safe.
* CheckForSerializableConflictOut() looks at xmins.
* pgrowlocks() looks at xmax unconditionally.
* heap_update() computes, just like heap_lock_tuple(), a new
  xmax/infomask before freezing.
* Same for heap_lock_updated_tuple(). Not sure if that's an actual
  concern, but it might if PageMatureLSN advances or such.
* heap_getsysattr() should probably return different things for mature
  pages.

There's probably more.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-19 Thread Ants Aasma
On Thu, Sep 19, 2013 at 2:42 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 * switchFinishXmin and nextSwitchXid should probably be either volatile
or have a compiler barrier between accessing shared memory and
checking them. The compiler very well could optimize them away and
access shmem all the time which could lead to weird results.

 Hmm, that would be a weird optimization. Is that really legal for the
 compiler to do? Better safe than sorry, I guess.

C doesn't really have a multi-threaded memory model before C11, so the
compiler makers have just made one up that suits them best. For unlocked
memory reads the compiler is free to assume that no one else is accessing the
variable, so yes that optimization would be legal according to their rules.

I'm tackling similar issues in my patch. What I'm thinking is that we should
change our existing supposedly atomic accesses to be more explicit and make
the API compatible with C11 atomics[1]. For now I think the changes should be
something like this:

* Have separate typedefs for atomic variants of datatypes (I don't think we
  have a whole lot of them). With C11 atomics support, this would change to

typedef _Atomic TransactionId AtomicTransactionId;

* Require that pg_atomic_init(type, var, val) be used to init atomic
  variables. Initially it would just pass through to assignment, as all
  supported platforms can do 32bit atomic ops. C11 compatible compilers will
  delegate to atomic_init().

* Create atomic read and write macros that look something like:

#define pg_atomic_load(type, var) (*((volatile type*) (var)))

  and

#define pg_atomic_store(type, var, val) do { \
*((volatile type*) (var)) = (val); \
} while(0)

  C11 would pass through to the compilers implementation with relaxed memory
  ordering.

* Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to
  pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence(). Herb Sutter makes
  a convincing argument why loosening up the barrier semantics is the sane
  choice here. [2] C11 support can then pass though to atomic_thread_fence().

This way we have a relatively future proof baseline for lockfree programming,
with options to expand with other primitives. We could also only do the
volatile access macros part, at least it would make the intention more clear
than the current approach of sprinkling around volatile pointers.

Regards,
Ants Aasma

[1] http://en.cppreference.com/w/c/atomic
[2] (long video about atomics)
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-18 Thread Andres Freund
On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote:
 Here's a rebased version of the patch, including the above-mentioned fixes.
 Nothing else new.

* We need some higherlevel description of the algorithm somewhere in the
  source. I don't think I've understood the concept from the patch alone
  without having read the thread previously.
* why do we need to do the PageUpdateNeedsFreezing() dance in
  heap_page_prune? No xids should change during it.
* Why can we do a GetOldestXmin(allDbs = false) in
  BeginXidLSNRangeSwitch()?
* Is there any concrete reasoning behind the current values for
  XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just gut feeling?
* the lsn ranges file can possibly become bigger than 512bytes (the size
  we assume to be written atomically) and you write it inplace. If we
  fail halfway through writing, we seem to be able to recover by using
  the pageMatureLSN from the last checkpoint, but it seems better to
  do the fsync(),rename(),fsync() dance either way.
* Should we preemptively freeze tuples on a page in lazy_scan_heap if we
  already have dirtied the page? That would make future modifcations
  cheaper.
* lazy_scan_heap now blocks acquiring a cleanup lock on every buffer
  that contains dead tuples. Shouldn't we use some kind of cutoff xid
  there? That might block progress too heavily. Also the comment above
  it still refers to the old logic.
* There's no way to force a full table vacuum anymore, that seems
  problematic to me.
* I wonder if CheckPointVarsup() doesn't need to update
  minRecoveryPoint. StartupVarsup() should be ok, because we should only
  read one from the future during a basebackup?
* xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why
  can't we just reset dirty before the WriteXidLSNRangesFile() call?
  There's only one process doing the writeout. Just because the
  checkpointing process could be killed?
* I think we should either not require consuming an multixactid or use a
  function that doesn't need MultiXactIdSetOldestMember(). If the
  transaction doing so lives for long it will unnecessarily prevent
  truncation of mxacts.
* switchFinishXmin and nextSwitchXid should probably be either volatile
  or have a compiler barrier between accessing shared memory and
  checking them. The compiler very well could optimize them away and
  access shmem all the time which could lead to weird results.
* I wonder whether the fact that we're doing the range switches after
  acquiring an xid could be problematic if we're preventing xid
  allocation due to the checks earlier in that function?
* I think heap_lock_tuple() needs to unset all-visible, otherwise we
  won't vacuum that page again which can lead to problems since we
  don't do full-table vacuums again?

So, I think that's enough for a first look. Will think about general
issues a bit more.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-18 Thread Jeff Janes
On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:


 Here's a rebased version of the patch, including the above-mentioned
 fixes. Nothing else new.


I've applied this to 0892ecbc015930d, the last commit to which it applies
cleanly.

When I test this by repeatedly incrementing a counter in a randomly chosen
row, then querying the whole table and comparing the results to what my
driver knows they should be, I get discrepancies.

No crash/recovery needs to be done to get the behavior.

The number of rows is correct, so one version of every row is visible, but
it is the wrong version.

The discrepancy arises shortly after the first time this type of message
appears:

6930 UPDATE 2013-09-18 12:36:34.519 PDT:LOG:  started new XID range, XIDs
133-, MultiXIDs 1-, tentative LSN 0/FA517F8
6930 UPDATE 2013-09-18 12:36:34.519 PDT:STATEMENT:  update foo set
count=count+1 where index=$1
6928 UPDATE 2013-09-18 12:36:34.521 PDT:LOG:  closed old XID range at
1000193 (LSN 0/FA58A08)
6928 UPDATE 2013-09-18 12:36:34.521 PDT:STATEMENT:  update foo set
count=count+1 where index=$1

I'll work on getting the driver to shutdown the database the first time it
finds a problem so that autovac doesn't destroy evidence.

Cheers,

Jeff


Re: [HACKERS] Freezing without write I/O

2013-09-17 Thread Peter Eisentraut
On 9/16/13 9:59 AM, Heikki Linnakangas wrote:
 Here's a rebased version of the patch, including the above-mentioned
 fixes. Nothing else new.

It still fails to apply.  You might need to rebase it again.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-17 Thread Andres Freund
On 2013-09-17 09:41:47 -0400, Peter Eisentraut wrote:
 On 9/16/13 9:59 AM, Heikki Linnakangas wrote:
  Here's a rebased version of the patch, including the above-mentioned
  fixes. Nothing else new.
 
 It still fails to apply.  You might need to rebase it again.

FWIW, I don't think it's too important that this specific patch applies
all the time. From the look I had and the discussions onlist, what it
needs so far is mostly architecture review and such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-17 Thread Alvaro Herrera
Heikki Linnakangas escribió:

 Here's a rebased version of the patch, including the above-mentioned
 fixes. Nothing else new.

Nice work.  I apologize for the conflicts I created yesterday.

I would suggest renaming varsup_internal.h to varsup_xlog.h.

You added a FIXME comment to HeapTupleHeaderIsOnlyLocked.  I think the
answer to the question is yes, because checking for locked might incur
examining the Xmax of tuples, which cannot be done if the page is
mature; perhaps the check for maturity needs to happen only after the
infomask has been checked.

The new stuff in GetNewTransactionId() is said to be very expensive
(it might involve grabbing the ProcArrayLock and scanning the whole
procarray, for example.)  There's a comment about having this be done
only in other processes, and I think that's a good idea, otherwise we
risk adding a lot of latency, randomly, to client-connected processes
which might be latency sensitive.  I think autovacuum is not a good
choice however (it might not even be running).  How about using the
bgwriter or walwriter for this?  As far as I can tell, there's no need
for this process to actually be able to run transactions or scan
catalogs; the ability to lock and scan ProcArray appears sufficient.
Using a RTC (instead of the Xid counter % 128) to determine when to run
this doesn't look like a problem to me.  Something that sleeps for too
long might be, so we would need to ensure that it happens timely.  Not
sure what's an appropriate time for this, however.

Another option is have backends check the Xid counter, and every 128
ticks set a flag in shmem so that the background process knows to
execute the switch.


heap_freeze_tuple() receives one Xid and one MultiXactId; they can be
passed as Invalid to mean forced freezing.  Do we really need to offer
the possibility of freezing xids but not multis, and multis but not
xids?  From a conceptual PoV, it seems to make more sense to me to pass
a boolean force meaning freeze everything, and ignore the numerical
values.


The greatest risk in this patch is the possibility that some
heap_freeze_page() call is forgotten.  I think you covered all cases in
heapam.c.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-16 Thread Heikki Linnakangas

On 27.08.2013 19:37, Heikki Linnakangas wrote:

On 27.08.2013 18:56, Heikki Linnakangas wrote:

Here's an updated patch.


Ah, forgot one thing:

Here's a little extension I've been using to test this. It contains two
functions; one to simply consume N xids, making it faster to hit the
creation of new XID ranges and wraparound. The other,
print_xidlsnranges(), prints out the contents of the current XID-LSN
range array.

Also, just ran into two silly bugs in the patch version I posted, while
checking that that xidtest extension hasn't bitrotted. A fixed version
has been pushed to the git repository, so make sure you use that version
if you want to actually run it.


Here's a rebased version of the patch, including the above-mentioned 
fixes. Nothing else new.


- Heikki

--
- Heikki


xid-lsn-ranges-3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-03 Thread Robert Haas
On Mon, Sep 2, 2013 at 3:16 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-08-30 at 20:34 +0200, Andres Freund wrote:
 I have a quick question: The reason I'd asked about the status of the
 patch was that I was thinking about the state of the forensic freezing
 patch. After a quick look at your proposal, we still need to freeze in
 some situations (old  new data on the same page basically), so I'd say
 it still makes sense to apply the forensic freezing patch, right?

 Differing Opinions?

 The Freeze Forensically patch is nice because (if I understand it
 correctly) it allows us to freeze at the same time as we mark
 PD_ALL_VISIBLE, which avoids the potential extra page write.

The patch itself doesn't actually make that change, but it removes one
major objection to such a change.

 But that's
 not such a big advantage if we don't ordinarily have to write again for
 freezing, anyway.

That was my thought as well.

 However, there are still some cases where we'd be able to preserve the
 forensic information. If nothing else, that might help debug this patch,
 if necessary. There might also be cases where we can freeze more eagerly
 to avoid the case where very old (but unfrozen) and very new tuples mix
 on the same page. Perhaps Robert has some thoughts here, as well.

I basically agree.  I think if we adopt Heikki's patch forensic
freezing becomes much less important, but we might find there's still
a reason to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-09-02 Thread Jeff Davis
On Fri, 2013-08-30 at 20:34 +0200, Andres Freund wrote:
 I have a quick question: The reason I'd asked about the status of the
 patch was that I was thinking about the state of the forensic freezing
 patch. After a quick look at your proposal, we still need to freeze in
 some situations (old  new data on the same page basically), so I'd say
 it still makes sense to apply the forensic freezing patch, right?
 
 Differing Opinions?

The Freeze Forensically patch is nice because (if I understand it
correctly) it allows us to freeze at the same time as we mark
PD_ALL_VISIBLE, which avoids the potential extra page write. But that's
not such a big advantage if we don't ordinarily have to write again for
freezing, anyway.

However, there are still some cases where we'd be able to preserve the
forensic information. If nothing else, that might help debug this patch,
if necessary. There might also be cases where we can freeze more eagerly
to avoid the case where very old (but unfrozen) and very new tuples mix
on the same page. Perhaps Robert has some thoughts here, as well.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-08-30 Thread Andres Freund
Hi Heikki,

On 2013-08-27 18:56:15 +0300, Heikki Linnakangas wrote:
 Here's an updated patch. The race conditions I mentioned above have been
 fixed.

Thanks for posting the new version!

I have a quick question: The reason I'd asked about the status of the
patch was that I was thinking about the state of the forensic freezing
patch. After a quick look at your proposal, we still need to freeze in
some situations (old  new data on the same page basically), so I'd say
it still makes sense to apply the forensic freezing patch, right?

Differing Opinions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-08-27 Thread Heikki Linnakangas

On 10.06.2013 21:58, Heikki Linnakangas wrote:

On 01.06.2013 23:21, Robert Haas wrote:

On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch. Whenever we know
this to be true, we set the bit. If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.


Yep, I think that would work. Want to write the patch, or should I? ;-)


Have at it.


Here's a first draft. A lot of stuff missing and broken, but make
check passes :-).

In the patch, instead of working with half-epochs, there are XID-LSN
ranges, which can be of arbitrary size. An XID-LSN range consists of
three values:

minlsn: The point in WAL where this range begins.
minxid - maxxid: The range of XIDs allowed in this range.

Every point in WAL belongs to exactly one range. The minxid-maxxid of
the ranges can overlap. For example:

1. XIDs 25000942 - 2703 LSN 0/3BB9938
2. XIDs 23000742 - 2603 LSN 0/2AB9288
3. XIDs 22000721 - 2503 LSN 0/1AB8BE0
4. XIDs 2202 - 2403 LSN 0/10B1550

The invariant with the ranges is that a page with a certain LSN is only
allowed to contain XIDs that belong to the range specified by that LSN.
For example, if a page has LSN 0/350, it belongs to the 2nd range,
and can only contain XIDs between 23000742 - 2603. If a backend
updates the page, so that it's LSN is updated to, say, 0/3D12345, all
XIDs on the page older than 25000942 must be frozen first, to avoid
violating the rule.

The system keeps track of a small number of these XID-LSN ranges. Where
we currently truncate clog, we can also truncate the ranges with maxxid
 the clog truncation point. Vacuum removes any dead tuples and updates
relfrozenxid as usual, to make sure that there are no dead tuples or
aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
It no longer needs to freeze old committed XIDs, however - that's the
gain from this patch (except to uphold the invariant, if it has to
remove some dead tuples on the page and update its LSN).

A new range is created whenever we reach the maxxid on the current one.
The new range's minxid is set to the current global oldest xmin value,
and maxxid is just the old maxxid plus a fixed number (1 million in the
patch, but we probably want a higher value in reality). This ensures
that if you modify a page and update its LSN, all the existing XIDs on
the page that cannot be frozen yet are greater than the minxid of the
latest range. In other words, you can always freeze old XIDs on a page,
so that any remaining non-frozen XIDs are within the minxid-maxxid of
the latest range.

The HeapTupleSatisfies functions are modified to look at the page's LSN
first. If it's old enough, it doesn't look at the XIDs on the page level
at all, and just considers everything on the page is visible to everyone
(I'm calling this state a mature page).


I think the tricky part is going to be figuring out the
synchronization around half-epoch boundaries.


Yep. I skipped all those difficult parts in this first version. There
are two race conditions that need to be fixed:

1. When a page is updated, we check if it needs to be frozen. If its LSN
is greater than the latest range's LSN. IOW, if we've already modified
the page, and thus frozen all older tuples, within the current range.
However, it's possible that a new range is created immediately after
we've checked that. When we then proceed to do the actual update on the
page and WAL-log that, the new LSN falls within the next range, and we
should've frozen the page. I'm planning to fix that by adding a parity
bit on the page header. Each XID-LSN range is assigned a parity bit, 0
or 1. When we check if a page needs to be frozen on update, we make note
of the latest range's parity bit, and write it in the page header.
Later, when we look at the page's LSN to determine which XID-LSN range
it belongs to, we compare the parity. If the parity doesn't match, we
know that the race condition happened, so we treat the page to belong to
the previous range, not the one it would normally belong to, per the LSN.

2. When we look at a page, and determine that it's not old enough to be
matured, we then check the clog as usual. A page is considered mature,
if the XID-LSN range (and corresponding clog pages) has already been
truncated away. It's possible that between those steps, the XID-LSN
range and clog is truncated away, so that the backend tries to access a
clog page that doesn't exist anymore. To fix that, the XID-LSN range and
clog truncation needs to be done in two steps. First, mark the
truncation point in shared memory. Then somehow wait until 

Re: [HACKERS] Freezing without write I/O

2013-08-27 Thread Heikki Linnakangas

On 27.08.2013 18:56, Heikki Linnakangas wrote:

Here's an updated patch.


Ah, forgot one thing:

Here's a little extension I've been using to test this. It contains two 
functions; one to simply consume N xids, making it faster to hit the 
creation of new XID ranges and wraparound. The other, 
print_xidlsnranges(), prints out the contents of the current XID-LSN 
range array.


Also, just ran into two silly bugs in the patch version I posted, while 
checking that that xidtest extension hasn't bitrotted. A fixed version 
has been pushed to the git repository, so make sure you use that version 
if you want to actually run it.


- Heikki


xidtest.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-10 Thread Heikki Linnakangas

On 01.06.2013 23:21, Robert Haas wrote:

On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch.  Whenever we know
this to be true, we set the bit.  If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
   If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.


Yep, I think that would work. Want to write the patch, or should I? ;-)


Have at it.


Here's a first draft. A lot of stuff missing and broken, but make 
check passes :-).


In the patch, instead of working with half-epochs, there are XID-LSN 
ranges, which can be of arbitrary size. An XID-LSN range consists of 
three values:


minlsn: The point in WAL where this range begins.
minxid - maxxid: The range of XIDs allowed in this range.

Every point in WAL belongs to exactly one range. The minxid-maxxid of 
the ranges can overlap. For example:


1. XIDs 25000942 - 2703 LSN 0/3BB9938
2. XIDs 23000742 - 2603 LSN 0/2AB9288
3. XIDs 22000721 - 2503 LSN 0/1AB8BE0
4. XIDs 2202 - 2403 LSN 0/10B1550

The invariant with the ranges is that a page with a certain LSN is only 
allowed to contain XIDs that belong to the range specified by that LSN. 
For example, if a page has LSN 0/350, it belongs to the 2nd range, 
and can only contain XIDs between 23000742 - 2603. If a backend 
updates the page, so that it's LSN is updated to, say, 0/3D12345, all 
XIDs on the page older than 25000942 must be frozen first, to avoid 
violating the rule.


The system keeps track of a small number of these XID-LSN ranges. Where 
we currently truncate clog, we can also truncate the ranges with maxxid 
 the clog truncation point. Vacuum removes any dead tuples and updates 
relfrozenxid as usual, to make sure that there are no dead tuples or 
aborted XIDs older than the minxid of the oldest tracked XID-LSN range. 
It no longer needs to freeze old committed XIDs, however - that's the 
gain from this patch (except to uphold the invariant, if it has to 
remove some dead tuples on the page and update its LSN).


A new range is created whenever we reach the maxxid on the current one. 
The new range's minxid is set to the current global oldest xmin value, 
and maxxid is just the old maxxid plus a fixed number (1 million in the 
patch, but we probably want a higher value in reality). This ensures 
that if you modify a page and update its LSN, all the existing XIDs on 
the page that cannot be frozen yet are greater than the minxid of the 
latest range. In other words, you can always freeze old XIDs on a page, 
so that any remaining non-frozen XIDs are within the minxid-maxxid of 
the latest range.


The HeapTupleSatisfies functions are modified to look at the page's LSN 
first. If it's old enough, it doesn't look at the XIDs on the page level 
at all, and just considers everything on the page is visible to everyone 
(I'm calling this state a mature page).



I think the tricky part is going to be figuring out the
synchronization around half-epoch boundaries.


Yep. I skipped all those difficult parts in this first version. There 
are two race conditions that need to be fixed:


1. When a page is updated, we check if it needs to be frozen. If its LSN 
is greater than the latest range's LSN. IOW, if we've already modified 
the page, and thus frozen all older tuples, within the current range. 
However, it's possible that a new range is created immediately after 
we've checked that. When we then proceed to do the actual update on the 
page and WAL-log that, the new LSN falls within the next range, and we 
should've frozen the page. I'm planning to fix that by adding a parity 
bit on the page header. Each XID-LSN range is assigned a parity bit, 0 
or 1. When we check if a page needs to be frozen on update, we make note 
of the latest range's parity bit, and write it in the page header. 
Later, when we look at the page's LSN to determine which XID-LSN range 
it belongs to, we compare the parity. If the parity doesn't match, we 
know that the race condition happened, so we treat the page to belong to 
the previous range, not the one it would normally belong to, per the LSN.


2. When we look at a page, and determine that it's not old enough to be 
matured, we then check the clog as usual. A page is considered mature, 
if the XID-LSN range (and corresponding clog pages) has already been 
truncated away. It's possible that between those steps, the XID-LSN 
range and clog is truncated away, so that the backend tries to access a 
clog page that doesn't exist anymore. To fix that, the XID-LSN range and 
clog truncation needs to be done in two steps. First, mark the 
truncation point in shared memory. Then 

Re: [HACKERS] Freezing without write I/O

2013-06-10 Thread Simon Riggs
On 10 June 2013 19:58, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 01.06.2013 23:21, Robert Haas wrote:

 On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
 hlinnakan...@vmware.com  wrote:

 We define a new page-level bit, something like PD_RECENTLY_FROZEN.
 When this bit is set, it means there are no unfrozen tuples on the
 page with XIDs that predate the current half-epoch.  Whenever we know
 this to be true, we set the bit.  If the page LSN crosses more than
 one half-epoch boundary at a time, we freeze the page and set the bit.
If the page LSN crosses exactly one half-epoch boundary, then (1) if
 the bit is set, we clear it and (2) if the bit is not set, we freeze
 the page and set the bit.


 Yep, I think that would work. Want to write the patch, or should I? ;-)


 Have at it.


 Here's a first draft. A lot of stuff missing and broken, but make check
 passes :-).

Well done, looks like good progress.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-10 Thread Robert Haas
On Mon, Jun 10, 2013 at 4:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Well done, looks like good progress.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Robert Haas
On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark st...@mit.edu wrote:
 On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 That will keep OldestXmin from advancing. Which will keep vacuum from
 advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
 about wrap-around, then stops new XIDs from being generated, and finally a
 forced shutdown.

 The forced shutdown will actually happen some time before going beyond 2
 billion XIDs. So it is not possible to have a long-lived transaction, older
 than 2 B XIDs, still live in the system. But let's imagine that you somehow
 bypass the safety mechanism:

 Ah, so if you do the epoch in the page header thing or Robert's LSN
 trick that I didn't follow then you'll need a new safety check against
 this. Since relfrozenxid/datfrozenxid will no longer be necessary.

Nothing proposed here gets rid of either of those, that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Heikki Linnakangas

On 07.06.2013 20:54, Robert Haas wrote:

On Thu, Jun 6, 2013 at 6:28 PM, Greg Starkst...@mit.edu  wrote:

On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

That will keep OldestXmin from advancing. Which will keep vacuum from
advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
about wrap-around, then stops new XIDs from being generated, and finally a
forced shutdown.

The forced shutdown will actually happen some time before going beyond 2
billion XIDs. So it is not possible to have a long-lived transaction, older
than 2 B XIDs, still live in the system. But let's imagine that you somehow
bypass the safety mechanism:


Ah, so if you do the epoch in the page header thing or Robert's LSN
trick that I didn't follow then you'll need a new safety check against
this. Since relfrozenxid/datfrozenxid will no longer be necessary.


Nothing proposed here gets rid of either of those, that I can see.


Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no 
longer means that all XIDs older than frozenxid are replaced with 
FrozenXid. Instead, it will mean that all XIDs older than frozenxid are 
committed, ie. all dead tuples older than that have been vacuumed away.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Simon Riggs
On 7 June 2013 19:08, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 07.06.2013 20:54, Robert Haas wrote:

 On Thu, Jun 6, 2013 at 6:28 PM, Greg Starkst...@mit.edu  wrote:

 On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
 hlinnakan...@vmware.com  wrote:

 That will keep OldestXmin from advancing. Which will keep vacuum from
 advancing relfrozenxid/datfrozenxid. Which will first trigger the
 warnings
 about wrap-around, then stops new XIDs from being generated, and finally
 a
 forced shutdown.

 The forced shutdown will actually happen some time before going beyond 2
 billion XIDs. So it is not possible to have a long-lived transaction,
 older
 than 2 B XIDs, still live in the system. But let's imagine that you
 somehow
 bypass the safety mechanism:


 Ah, so if you do the epoch in the page header thing or Robert's LSN
 trick that I didn't follow then you'll need a new safety check against
 this. Since relfrozenxid/datfrozenxid will no longer be necessary.


 Nothing proposed here gets rid of either of those, that I can see.


 Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no
 longer means that all XIDs older than frozenxid are replaced with FrozenXid.
 Instead, it will mean that all XIDs older than frozenxid are committed, ie.
 all dead tuples older than that have been vacuumed away.

Now that I consider Greg's line of thought, the idea we focused on
here was about avoiding freezing. But Greg makes me think that we may
also wish to look at allowing queries to run longer than one epoch as
well, if the epoch wrap time is likely to come down substantially.

To do that I think we'll need to hold epoch for relfrozenxid as well,
amongst other things.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Heikki Linnakangas

On 07.06.2013 21:33, Simon Riggs wrote:

Now that I consider Greg's line of thought, the idea we focused on
here was about avoiding freezing. But Greg makes me think that we may
also wish to look at allowing queries to run longer than one epoch as
well, if the epoch wrap time is likely to come down substantially.

To do that I think we'll need to hold epoch for relfrozenxid as well,
amongst other things.


The biggest problem I see with that is that if a snapshot can be older 
than 2 billion XIDs, it must be possible to store XIDs on the same page 
that are more than 2 billion XIDs apart. All the discussed schemes where 
we store the epoch at the page level, either explicitly or derived from 
the LSN, rely on the fact that it's not currently necessary to do that. 
Currently, when one XID on a page is older than 2 billion XIDs, that old 
XID can always be replaced with FrozenXid, because there cannot be a 
snapshot old enough to not see it.


I agree that it would be nice if you could find a way around that. You 
had a suggestion on making room on the tuple header for the epoch. I'm 
not sure I like that particular proposal, but we would need something 
like that. If we settle for snapshots that can be at most, say, 512 
billion transactions old, instead of 2 billion, then we would only need 
one byte to store an epoch offset in the tuple header. Ie. deduce the 
epoch of tuples on the page from the LSN on the page header, but allow 
individual tuples to specify an offset from that deduced epoch.


In practice, I think we're still quite far from people running into that 
2 billion XID limit on snapshot age. But maybe in a few years, after 
we've solved all the more pressing vacuum and wraparound issues that 
people currently run into before reaching that stage...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Simon Riggs
On 7 June 2013 19:56, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 07.06.2013 21:33, Simon Riggs wrote:

 Now that I consider Greg's line of thought, the idea we focused on
 here was about avoiding freezing. But Greg makes me think that we may
 also wish to look at allowing queries to run longer than one epoch as
 well, if the epoch wrap time is likely to come down substantially.

 To do that I think we'll need to hold epoch for relfrozenxid as well,
 amongst other things.

 The biggest problem I see with that is that if a snapshot can be older than
 2 billion XIDs, it must be possible to store XIDs on the same page that are
 more than 2 billion XIDs apart. All the discussed schemes where we store the
 epoch at the page level, either explicitly or derived from the LSN, rely on
 the fact that it's not currently necessary to do that. Currently, when one
 XID on a page is older than 2 billion XIDs, that old XID can always be
 replaced with FrozenXid, because there cannot be a snapshot old enough to
 not see it.

It does seem that there are two problems: avoiding freezing AND long
running queries

The long running query problem hasn't ever been looked at, it seems,
until here and now.

 I agree that it would be nice if you could find a way around that. You had a
 suggestion on making room on the tuple header for the epoch. I'm not sure I
 like that particular proposal, but we would need something like that. If we
 settle for snapshots that can be at most, say, 512 billion transactions old,
 instead of 2 billion, then we would only need one byte to store an epoch
 offset in the tuple header. Ie. deduce the epoch of tuples on the page
 from the LSN on the page header, but allow individual tuples to specify an
 offset from that deduced epoch.

I like the modification you propose. And I like it even better because
it uses just 1 byte, which is even more easily squeezed into the
existing tuple header, whether we go with my proposed squeezing route
or not.

 In practice, I think we're still quite far from people running into that 2
 billion XID limit on snapshot age. But maybe in a few years, after we've
 solved all the more pressing vacuum and wraparound issues that people
 currently run into before reaching that stage...

Your WALInsert lock patch will fix that. ;-)

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Robert Haas
On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggs si...@2ndquadrant.com wrote:
 The long running query problem hasn't ever been looked at, it seems,
 until here and now.

For what it's worth (and that may not be much), I think most people
will die a horrible death due to bloat after holding a transaction
open for a tiny fraction of 2B XIDs.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Andres Freund
On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
 On 7 June 2013 19:56, Heikki Linnakangas hlinnakan...@vmware.com wrote:
  On 07.06.2013 21:33, Simon Riggs wrote:
 
  Now that I consider Greg's line of thought, the idea we focused on
  here was about avoiding freezing. But Greg makes me think that we may
  also wish to look at allowing queries to run longer than one epoch as
  well, if the epoch wrap time is likely to come down substantially.
 
  To do that I think we'll need to hold epoch for relfrozenxid as well,
  amongst other things.
 
  The biggest problem I see with that is that if a snapshot can be older than
  2 billion XIDs, it must be possible to store XIDs on the same page that are
  more than 2 billion XIDs apart. All the discussed schemes where we store the
  epoch at the page level, either explicitly or derived from the LSN, rely on
  the fact that it's not currently necessary to do that. Currently, when one
  XID on a page is older than 2 billion XIDs, that old XID can always be
  replaced with FrozenXid, because there cannot be a snapshot old enough to
  not see it.
 
 It does seem that there are two problems: avoiding freezing AND long
 running queries
 
 The long running query problem hasn't ever been looked at, it seems,
 until here and now.

I'd say that's because it's prohibitive to run so long transactions
anyway since it causes too much unremovable bloat. 2bio transactions
really is a quite a bit, I don't think it's a relevant restriction. Yet.

Let's discuss it if we have solved the other problems ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Simon Riggs
On 7 June 2013 20:16, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
 On 7 June 2013 19:56, Heikki Linnakangas hlinnakan...@vmware.com wrote:
  On 07.06.2013 21:33, Simon Riggs wrote:
 
  Now that I consider Greg's line of thought, the idea we focused on
  here was about avoiding freezing. But Greg makes me think that we may
  also wish to look at allowing queries to run longer than one epoch as
  well, if the epoch wrap time is likely to come down substantially.
 
  To do that I think we'll need to hold epoch for relfrozenxid as well,
  amongst other things.

  The biggest problem I see with that is that if a snapshot can be older than
  2 billion XIDs, it must be possible to store XIDs on the same page that are
  more than 2 billion XIDs apart. All the discussed schemes where we store 
  the
  epoch at the page level, either explicitly or derived from the LSN, rely on
  the fact that it's not currently necessary to do that. Currently, when one
  XID on a page is older than 2 billion XIDs, that old XID can always be
  replaced with FrozenXid, because there cannot be a snapshot old enough to
  not see it.

 It does seem that there are two problems: avoiding freezing AND long
 running queries

 The long running query problem hasn't ever been looked at, it seems,
 until here and now.

 I'd say that's because it's prohibitive to run so long transactions
 anyway since it causes too much unremovable bloat. 2bio transactions
 really is a quite a bit, I don't think it's a relevant restriction. Yet.

 Let's discuss it if we have solved the other problems ;)

Let me say that I think that problem is solvable also.

At the moment we allow all visible tuple versions to be linked
together, so that the last visible and latest update are linked by a
chain. If we break that assumption and say that we will never follow
an update chain from a snapshot in the distant past, then we can
remove intermediate dead rows. We currently regard those as recently
dead, but that just requires some extra thought. We still keep all
*visible* tuple versions, we just don't bother to keep all the
intermediate ones as well.

Perhaps another day, but one day.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Avoiding bloat in the presence of a long-running transaction (Re: [HACKERS] Freezing without write I/O)

2013-06-07 Thread Heikki Linnakangas

On 07.06.2013 22:15, Robert Haas wrote:

On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggssi...@2ndquadrant.com  wrote:

The long running query problem hasn't ever been looked at, it seems,
until here and now.


For what it's worth (and that may not be much), I think most people
will die a horrible death due to bloat after holding a transaction
open for a tiny fraction of 2B XIDs.  :-(


Yeah, we should fix that too ;-).

While we're at it: I've been thinking that we should try harder to 
vacuum dead tuples that are no longer visible to any snapshot, even if 
there's an even old snapshot. The stereotypical scenario is a table with 
a single row that's updated very very frequently. Like a counter. 
Autovacuum can normally keep it in check, but if you have a long-running 
transaction, it will bloat horrendously. But if you only have one 
long-running transaction with one really old snapshot, and everything 
else is recent, you'd really only need to keep one old tuple around for 
the old snapshot to see, and a recent version or two for the rest. At 
worst, the database needs to bloat to double the size, but not more than 
that.


To know which tuples are dead at such a fine-grained level, vacuum would 
need to know in more detail what snapshots the backends have. I'm really 
excited about Ants Aasma's proposal to use a CSN for snapshots, or more 
precisely the variant using commit record's LSN for that. If a snapshot 
is just a single integer, it becomes easier for backends to share their 
snapshots, in limited amount of shared memory.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Hannu Krosing
On 06/07/2013 08:56 PM, Heikki Linnakangas wrote:
 On 07.06.2013 21:33, Simon Riggs wrote:
 Now that I consider Greg's line of thought, the idea we focused on
 here was about avoiding freezing. But Greg makes me think that we may
 also wish to look at allowing queries to run longer than one epoch as
 well, if the epoch wrap time is likely to come down substantially.

 To do that I think we'll need to hold epoch for relfrozenxid as well,
 amongst other things.

 The biggest problem I see with that is that if a snapshot can be older
 than 2 billion XIDs, it must be possible to store XIDs on the same
 page that are more than 2 billion XIDs apart. 
It could be possible to recognise the situation and save the new XIDs on
*another* page ?


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-07 Thread Hannu Krosing
On 06/07/2013 09:16 PM, Andres Freund wrote:
 On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
 On 7 June 2013 19:56, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 07.06.2013 21:33, Simon Riggs wrote:
 Now that I consider Greg's line of thought, the idea we focused on
 here was about avoiding freezing. But Greg makes me think that we may
 also wish to look at allowing queries to run longer than one epoch as
 well, if the epoch wrap time is likely to come down substantially.

 To do that I think we'll need to hold epoch for relfrozenxid as well,
 amongst other things.
 The biggest problem I see with that is that if a snapshot can be older than
 2 billion XIDs, it must be possible to store XIDs on the same page that are
 more than 2 billion XIDs apart. All the discussed schemes where we store the
 epoch at the page level, either explicitly or derived from the LSN, rely on
 the fact that it's not currently necessary to do that. Currently, when one
 XID on a page is older than 2 billion XIDs, that old XID can always be
 replaced with FrozenXid, because there cannot be a snapshot old enough to
 not see it.
 It does seem that there are two problems: avoiding freezing AND long
 running queries

 The long running query problem hasn't ever been looked at, it seems,
 until here and now.
 I'd say that's because it's prohibitive to run so long transactions
 anyway since it causes too much unremovable bloat. 2bio transactions
 really is a quite a bit, I don't think it's a relevant restriction. Yet.
2bio transaction means at least 2G rows inserted, updated or deleted,
which may or may not bee too much

In my simple key/value update tests I was able to do 40k single transaction
updates sec on moderate sized server. at this rate 2G trx takes about 15
hours
which is not really very much.

One may for example have an unfinished 2PC transaction lingering over
weekend and while of course one should have monitoring in place to
avoid this, it still would be nice if this did not cause database shutdown.



-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-06 Thread Greg Stark
On Fri, May 31, 2013 at 3:04 AM, Robert Haas robertmh...@gmail.com wrote:
 Even at a more modest 10,000 tps, with default
 settings, you'll do anti-wraparound vacuums of the entire cluster
 about every 8 hours.  That's not fun.

I've forgotten now. What happens if you have a long-lived transaction
still alive from  2B xid ago?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 15:16, Greg Stark wrote:

On Fri, May 31, 2013 at 3:04 AM, Robert Haasrobertmh...@gmail.com  wrote:

Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours.  That's not fun.


I've forgotten now. What happens if you have a long-lived transaction
still alive from  2B xid ago?


That will keep OldestXmin from advancing. Which will keep vacuum from 
advancing relfrozenxid/datfrozenxid. Which will first trigger the 
warnings about wrap-around, then stops new XIDs from being generated, 
and finally a forced shutdown.


The forced shutdown will actually happen some time before going beyond 2 
billion XIDs. So it is not possible to have a long-lived transaction, 
older than 2 B XIDs, still live in the system. But let's imagine that 
you somehow bypass the safety mechanism:


After wraparound, old tuples will look like being in the future, and 
will become invisible to new transactions. That happens even if there 
are no old transactions around. I'm not sure what exactly will happen if 
there is still a transaction alive with an XID and/or snapshots older 
than 2^31 XIDs. New tuples that are not supposed to be visible to the 
old snapshot would suddenly become visible, I guess.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-06 Thread Greg Stark
On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 That will keep OldestXmin from advancing. Which will keep vacuum from
 advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
 about wrap-around, then stops new XIDs from being generated, and finally a
 forced shutdown.

 The forced shutdown will actually happen some time before going beyond 2
 billion XIDs. So it is not possible to have a long-lived transaction, older
 than 2 B XIDs, still live in the system. But let's imagine that you somehow
 bypass the safety mechanism:

Ah, so if you do the epoch in the page header thing or Robert's LSN
trick that I didn't follow then you'll need a new safety check against
this. Since relfrozenxid/datfrozenxid will no longer be necessary.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Simon Riggs
On 30 May 2013 14:33, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 Since we're bashing around ideas around freezing, let me write down the idea
 I've been pondering and discussing with various people for years. I don't
 think I invented this myself, apologies to whoever did for not giving
 credit.

 The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
 and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
 that would waste a lot space. The trick is to add a field to the page header
 indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
 32-bit wide (*).

 The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
 wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
 could implicitly treat anything older than relfrozenxid as frozen.

 That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
 doesn't need to dirty pages that contain no dead tuples.

I have to say this is pretty spooky. I'd not read hackers all week, so
I had no idea so many other people were thinking about freezing as
well. This idea is damn near identical to what I've suggested. My
suggestion came because I was looking to get rid of fields out of the
tuple header; which didn't come to much. The good news is that is
complete chance, so it must mean we're on the right track.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Simon Riggs
On 30 May 2013 19:39, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
 and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
 that would waste a lot space. The trick is to add a field to the page header
 indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
 32-bit wide (*).

 Check.

 The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
 wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
 could implicitly treat anything older than relfrozenxid as frozen.

 Check.

 That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
 doesn't need to dirty pages that contain no dead tuples.

 Check.

Yes, this is the critical point. Large insert-only tables don't need
to be completely re-written twice.


 Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
 replace the XIDs with FrozenXid whenever the difference between the smallest
 and largest XID on a page exceeds 2^31. But that would only happen when
 you're updating the page, in which case the page is dirtied anyway, so it
 wouldn't cause any extra I/O.

 It would cause some extra WAL activity, but it wouldn't dirty the page
 an extra time.

 This would also be the first step in allowing the clog to grow larger than 2
 billion transactions, eliminating the need for anti-wraparound freezing
 altogether. You'd still want to truncate the clog eventually, but it would
 be nice to not be pressed against the wall with run vacuum freeze now, or
 the system will shut down.

 Interesting.  That seems like a major advantage.

 (*) Adding an epoch is inaccurate, but I like to use that as my mental
 model. If you just add a 32-bit epoch field, then you cannot have xids from
 different epochs on the page, which would be a problem. In reality, you
 would store one 64-bit XID value in the page header, and use that as the
 reference point for all the 32-bit XIDs on the tuples. See existing
 convert_txid() function for how that works. Another method is to store the
 32-bit xid values in tuple headers as offsets from the per-page 64-bit
 value, but then you'd always need to have the 64-bit value at hand when
 interpreting the XIDs, even if they're all recent.

 As I see it, the main downsides of this approach are:

 (1) It breaks binary compatibility (unless you do something to
 provided for it, like put the epoch in the special space).

 (2) It consumes 8 bytes per page.  I think it would be possible to get
 this down to say 5 bytes per page pretty easily; we'd simply decide
 that the low-order 3 bytes of the reference XID must always be 0.
 Possibly you could even do with 4 bytes, or 4 bytes plus some number
 of extra bits.

Yes, the idea of having a base Xid on every page is complicated and
breaks compatibility. Same idea can work well if we do this via tuple
headers.


 (3) You still need to periodically scan the entire relation, or else
 have a freeze map as Simon and Josh suggested.

I don't think that is needed with this approach.

(The freeze map was Andres' idea, not mine. I just accepted it as what
I thought was the only way forwards. Now I see other ways)

 The upsides of this approach as compared with what Andres and I are
 proposing are:

 (1) It provides a stepping stone towards allowing indefinite expansion
 of CLOG, which is quite appealing as an alternative to a hard
 shut-down.

I would be against expansion of the CLOG beyond its current size. If
we have removed all aborted rows and marked hints, then we don't need
the CLOG values and can trim that down.

I don't mind the hints, its the freezing we don't need.


 convert_txid() function for how that works. Another method is to store the
 32-bit xid values in tuple headers as offsets from the per-page 64-bit
 value, but then you'd always need to have the 64-bit value at hand when
 interpreting the XIDs, even if they're all recent.

You've touched here on the idea of putting the epoch in the tuple
header, which is where what I posted comes together. We don't need
anything at page level, we just need something on each tuple.

Please can you look at my recent post on how to put this in the tuple header?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Heikki Linnakangas

On 31.05.2013 06:02, Robert Haas wrote:

On Thu, May 30, 2013 at 2:39 PM, Robert Haasrobertmh...@gmail.com  wrote:

Random thought: Could you compute the reference XID based on the page
LSN?  That would eliminate the storage overhead.


After mulling this over a bit, I think this is definitely possible.
We begin a new half-epoch every 2 billion transactions.  We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began.  When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a new half-epoch WAL record, which
becomes the starting LSN for the new half-epoch.


Clever! Pages in unlogged tables need some extra treatment, as they 
don't normally have a valid LSN, but that shouldn't be too hard.



We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch.  Whenever we know
this to be true, we set the bit.  If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
  If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.


Yep, I think that would work. Want to write the patch, or should I? ;-)

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Simon Riggs
On 1 June 2013 19:48, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 31.05.2013 06:02, Robert Haas wrote:

 On Thu, May 30, 2013 at 2:39 PM, Robert Haasrobertmh...@gmail.com
 wrote:

 Random thought: Could you compute the reference XID based on the page
 LSN?  That would eliminate the storage overhead.


 After mulling this over a bit, I think this is definitely possible.
 We begin a new half-epoch every 2 billion transactions.  We remember
 the LSN at which the current half-epoch began and the LSN at which the
 previous half-epoch began.  When a new half-epoch begins, the first
 backend that wants to stamp a tuple with an XID from the new
 half-epoch must first emit a new half-epoch WAL record, which
 becomes the starting LSN for the new half-epoch.


 Clever! Pages in unlogged tables need some extra treatment, as they don't
 normally have a valid LSN, but that shouldn't be too hard.

I like the idea of using the LSN to indicate the epoch. It saves any
other work we might consider, such as setting page or tuple level
epochs.


 We define a new page-level bit, something like PD_RECENTLY_FROZEN.
 When this bit is set, it means there are no unfrozen tuples on the
 page with XIDs that predate the current half-epoch.  Whenever we know
 this to be true, we set the bit.  If the page LSN crosses more than
 one half-epoch boundary at a time, we freeze the page and set the bit.
   If the page LSN crosses exactly one half-epoch boundary, then (1) if
 the bit is set, we clear it and (2) if the bit is not set, we freeze
 the page and set the bit.


 Yep, I think that would work. Want to write the patch, or should I? ;-)

If we set a bit, surely we need to write the page. Isn't that what we
were trying to avoid?

Why set a bit at all? If we know the LSN of the page and we know the
epoch boundaries, then we can work out when the page was last written
to and infer that the page is virtually frozen.

As soon as we make a change to a virtually frozen page, we can
actually freeze it and then make the change.

But we still have the problem of knowing which pages have been frozen
and which haven't.

Can we clear up those points first? Or at least my understanding of them.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Robert Haas
On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 We define a new page-level bit, something like PD_RECENTLY_FROZEN.
 When this bit is set, it means there are no unfrozen tuples on the
 page with XIDs that predate the current half-epoch.  Whenever we know
 this to be true, we set the bit.  If the page LSN crosses more than
 one half-epoch boundary at a time, we freeze the page and set the bit.
   If the page LSN crosses exactly one half-epoch boundary, then (1) if
 the bit is set, we clear it and (2) if the bit is not set, we freeze
 the page and set the bit.

 Yep, I think that would work. Want to write the patch, or should I? ;-)

Have at it.  I think the tricky part is going to be figuring out the
synchronization around half-epoch boundaries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Robert Haas
On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If we set a bit, surely we need to write the page. Isn't that what we
 were trying to avoid?

No, the bit only gets set in situations when we were going to dirty
the page for some other reason anyway.  Specifically, if a page
modification discovers that we've switched epochs (but just once) and
the bit isn't already set, we can set it in lieu of scanning the
entire page for tuples that need freezing.

Under this proposal, pages that don't contain any dead tuples needn't
be dirtied for freezing, ever.  Smells like awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-01 Thread Simon Riggs
On 1 June 2013 21:26, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If we set a bit, surely we need to write the page. Isn't that what we
 were trying to avoid?

 No, the bit only gets set in situations when we were going to dirty
 the page for some other reason anyway.  Specifically, if a page
 modification discovers that we've switched epochs (but just once) and
 the bit isn't already set, we can set it in lieu of scanning the
 entire page for tuples that need freezing.

 Under this proposal, pages that don't contain any dead tuples needn't
 be dirtied for freezing, ever.  Smells like awesome.

Agreed, well done both.

What I especially like about it is how little logic it will require,
and no page format changes.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-31 Thread Heikki Linnakangas

On 31.05.2013 00:06, Bruce Momjian wrote:

On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:

This would also be the first step in allowing the clog to grow
larger than 2 billion transactions, eliminating the need for
anti-wraparound freezing altogether. You'd still want to truncate
the clog eventually, but it would be nice to not be pressed against
the wall with run vacuum freeze now, or the system will shut down.


Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
truncation.  Does this help that?


No. If you want to keep autovacuum_freeze_max_age set at less than 2 
billion, you don't need support for more than 2 billion transactions. 
But for those who would like to set autovacuum_freeze_max_age higher 
than 2B, it would be nice to allow it.


Actually, even with autovacuum_freeze_max_age = 200 M, it would be nice 
to not have the hard stop at 2 billion, in case autovacuum falls behind 
really badly. With autovacuum_freeze_max_age = 200M, there's a lot of 
safety margin, but with 1000M or so, not so much.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-31 Thread Bruce Momjian
On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:
  Hm. Why? If freezing gets notably cheaper I don't really see much need
  for keeping that much clog around? If we still run into anti-wraparound
  areas, there has to be some major operational problem.
 
 That is true, but we have a decent number of customers who do in fact
 have such problems.  I think that's only going to get more common.  As
 hardware gets faster and PostgreSQL improves, people are going to
 process more and more transactions in shorter and shorter periods of
 time.  Heikki's benchmark results for the XLOG scaling patch show
 rates of 80,000 tps.  Even at a more modest 10,000 tps, with default
 settings, you'll do anti-wraparound vacuums of the entire cluster
 about every 8 hours.  That's not fun.

Are you assuming these are all write transactions, hence consuming xids?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-31 Thread Robert Haas
On Fri, May 31, 2013 at 1:26 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:
  Hm. Why? If freezing gets notably cheaper I don't really see much need
  for keeping that much clog around? If we still run into anti-wraparound
  areas, there has to be some major operational problem.

 That is true, but we have a decent number of customers who do in fact
 have such problems.  I think that's only going to get more common.  As
 hardware gets faster and PostgreSQL improves, people are going to
 process more and more transactions in shorter and shorter periods of
 time.  Heikki's benchmark results for the XLOG scaling patch show
 rates of 80,000 tps.  Even at a more modest 10,000 tps, with default
 settings, you'll do anti-wraparound vacuums of the entire cluster
 about every 8 hours.  That's not fun.

 Are you assuming these are all write transactions, hence consuming xids?

Well, there might be read-only transactions as well, but the point is
about how many write transactions there can be.  10,000 tps or more is
not out of the question even today, and progressively higher numbers
are only going to become more and more common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Josh Berkus
Heikki,

This sounds a lot like my idea for 9.3, which didn't go anywhere.
You've worked out the issues I couldn't, I think.

 Another method is
 to store the 32-bit xid values in tuple headers as offsets from the
 per-page 64-bit value, but then you'd always need to have the 64-bit
 value at hand when interpreting the XIDs, even if they're all recent.

Yeah, -1 on the latter, not least because it would require a 100%
rewrite of the tables in order to upgrade.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
 and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
 that would waste a lot space. The trick is to add a field to the page header
 indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
 32-bit wide (*).

Check.

 The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
 wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
 could implicitly treat anything older than relfrozenxid as frozen.

Check.

 That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
 doesn't need to dirty pages that contain no dead tuples.

Check.

 Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
 replace the XIDs with FrozenXid whenever the difference between the smallest
 and largest XID on a page exceeds 2^31. But that would only happen when
 you're updating the page, in which case the page is dirtied anyway, so it
 wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

 This would also be the first step in allowing the clog to grow larger than 2
 billion transactions, eliminating the need for anti-wraparound freezing
 altogether. You'd still want to truncate the clog eventually, but it would
 be nice to not be pressed against the wall with run vacuum freeze now, or
 the system will shut down.

Interesting.  That seems like a major advantage.

 (*) Adding an epoch is inaccurate, but I like to use that as my mental
 model. If you just add a 32-bit epoch field, then you cannot have xids from
 different epochs on the page, which would be a problem. In reality, you
 would store one 64-bit XID value in the page header, and use that as the
 reference point for all the 32-bit XIDs on the tuples. See existing
 convert_txid() function for how that works. Another method is to store the
 32-bit xid values in tuple headers as offsets from the per-page 64-bit
 value, but then you'd always need to have the 64-bit value at hand when
 interpreting the XIDs, even if they're all recent.

As I see it, the main downsides of this approach are:

(1) It breaks binary compatibility (unless you do something to
provided for it, like put the epoch in the special space).

(2) It consumes 8 bytes per page.  I think it would be possible to get
this down to say 5 bytes per page pretty easily; we'd simply decide
that the low-order 3 bytes of the reference XID must always be 0.
Possibly you could even do with 4 bytes, or 4 bytes plus some number
of extra bits.

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

The upsides of this approach as compared with what Andres and I are
proposing are:

(1) It provides a stepping stone towards allowing indefinite expansion
of CLOG, which is quite appealing as an alternative to a hard
shut-down.

(2) It doesn't place any particular requirements on PD_ALL_VISIBLE.  I
don't personally find this much of a benefit as I want to keep
PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree.

Random thought: Could you compute the reference XID based on the page
LSN?  That would eliminate the storage overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Merlin Moncure
On Thu, May 30, 2013 at 1:39 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
 and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
 that would waste a lot space. The trick is to add a field to the page header
 indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
 32-bit wide (*).

 (3) You still need to periodically scan the entire relation, or else
 have a freeze map as Simon and Josh suggested.

Why is this scan required?

Also, what happens if you delete a tuple on a page when another tuple
on the same page with age  2^32 that is still in an open transaction?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 21:46, Merlin Moncure wrote:

On Thu, May 30, 2013 at 1:39 PM, Robert Haasrobertmh...@gmail.com  wrote:

On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).


(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.


Why is this scan required?


To find all the dead tuples and remove them, and advance relfrozenxid. 
That in turn is required so that you can truncate the clog. This scheme 
relies on assuming that everything older than relfrozenxid committed, so 
if there are any aborted XIDs present in the table, you can't advance 
relfrozenxid past them.


Come to think of it, if there are no aborted XIDs in a range of XIDs, 
only commits, then you could just advance relfrozenxid past that range 
and truncate away the clog, without scanning the table. But that's quite 
a special case - generally there would be at least a few aborted XIDs - 
so it's probably not worth adding any special code to take advantage of 
that.



Also, what happens if you delete a tuple on a page when another tuple
on the same page with age  2^32 that is still in an open transaction?


Can't let that happen. Same as today.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Andres Freund
On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
  Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
  replace the XIDs with FrozenXid whenever the difference between the smallest
  and largest XID on a page exceeds 2^31. But that would only happen when
  you're updating the page, in which case the page is dirtied anyway, so it
  wouldn't cause any extra I/O.
 
 It would cause some extra WAL activity, but it wouldn't dirty the page
 an extra time.

You probably could do it similarly to how we currently do
XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
insert/update/delete record will already contain a FPI if necessary, so
that should be safe.

  This would also be the first step in allowing the clog to grow larger than 2
  billion transactions, eliminating the need for anti-wraparound freezing
  altogether. You'd still want to truncate the clog eventually, but it would
  be nice to not be pressed against the wall with run vacuum freeze now, or
  the system will shut down.

 Interesting.  That seems like a major advantage.

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Bruce Momjian
On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:
 This would also be the first step in allowing the clog to grow
 larger than 2 billion transactions, eliminating the need for
 anti-wraparound freezing altogether. You'd still want to truncate
 the clog eventually, but it would be nice to not be pressed against
 the wall with run vacuum freeze now, or the system will shut down.

Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
truncation.  Does this help that?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 3:22 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
  Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
  replace the XIDs with FrozenXid whenever the difference between the 
  smallest
  and largest XID on a page exceeds 2^31. But that would only happen when
  you're updating the page, in which case the page is dirtied anyway, so it
  wouldn't cause any extra I/O.

 It would cause some extra WAL activity, but it wouldn't dirty the page
 an extra time.

 You probably could do it similarly to how we currently do
 XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
 insert/update/delete record will already contain a FPI if necessary, so
 that should be safe.

Ah, good point.

  This would also be the first step in allowing the clog to grow larger than 
  2
  billion transactions, eliminating the need for anti-wraparound freezing
  altogether. You'd still want to truncate the clog eventually, but it would
  be nice to not be pressed against the wall with run vacuum freeze now, or
  the system will shut down.

 Interesting.  That seems like a major advantage.

 Hm. Why? If freezing gets notably cheaper I don't really see much need
 for keeping that much clog around? If we still run into anti-wraparound
 areas, there has to be some major operational problem.

That is true, but we have a decent number of customers who do in fact
have such problems.  I think that's only going to get more common.  As
hardware gets faster and PostgreSQL improves, people are going to
process more and more transactions in shorter and shorter periods of
time.  Heikki's benchmark results for the XLOG scaling patch show
rates of 80,000 tps.  Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours.  That's not fun.

Being able to do such scans only of the not-all-visible pages would be
a huge step forward, of course.  But not having to do them on any
particular deadline would be a whole lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote:
 Random thought: Could you compute the reference XID based on the page
 LSN?  That would eliminate the storage overhead.

After mulling this over a bit, I think this is definitely possible.
We begin a new half-epoch every 2 billion transactions.  We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began.  When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a new half-epoch WAL record, which
becomes the starting LSN for the new half-epoch.

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch.  Whenever we know
this to be true, we set the bit.  If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
 If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.  The advantage of this is that we avoid an
epidemic of freezing right after a half-epoch change.  Immediately
after a half-epoch change, many pages will mix tuples from the current
and previous half-epoch - but relatively few pages will have tuples
from the current half-epoch and a half-epoch more than one in the
past.

As things stand today, we really only need to remember the last two
half-epoch boundaries; they could be stored, for example, in the
control file.  But if we someday generalize CLOG to allow indefinite
retention as you suggest, we could instead remember all half-epoch
boundaries that have ever occurred; just maintain a file someplace
with 8 bytes of data for every 2 billion XIDs consumed over the
lifetime of the cluster.  In fact, we might want to do it that way
anyhow, just to keep our options open, and perhaps for forensics.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers