Re: [HACKERS] logical changeset generation v6.8

2013-12-17 Thread Andres Freund
On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
  Yes, I think we could mostly reuse it, we'd probably want to add a field
  or two more (application_name, sync_prio?). I have been wondering
  whether some of the code in replication/logical/logical.c shouldn't be
  in replication/slot.c or similar. So far I've opted for leaving it in
  its current place since it would have to change a bit for a more general
  role.
 
 I strongly favor moving the slot-related code to someplace with slot
 in the name, and replication/slot.c seems about right.  Even if we
 don't extend them to cover non-logical replication in this release,
 we'll probably do it eventually, and it'd be better if that didn't
 require moving large amounts of code between files.

Any opinion on the storage location of the slot files? It's currently
pg_llog/$slotname/state[.tmp]. It's a directory so we have a location
during logical decoding to spill data to...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-17 Thread Robert Haas
On Tue, Dec 17, 2013 at 7:48 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
  Yes, I think we could mostly reuse it, we'd probably want to add a field
  or two more (application_name, sync_prio?). I have been wondering
  whether some of the code in replication/logical/logical.c shouldn't be
  in replication/slot.c or similar. So far I've opted for leaving it in
  its current place since it would have to change a bit for a more general
  role.

 I strongly favor moving the slot-related code to someplace with slot
 in the name, and replication/slot.c seems about right.  Even if we
 don't extend them to cover non-logical replication in this release,
 we'll probably do it eventually, and it'd be better if that didn't
 require moving large amounts of code between files.

 Any opinion on the storage location of the slot files? It's currently
 pg_llog/$slotname/state[.tmp]. It's a directory so we have a location
 during logical decoding to spill data to...

pg_replslot?  pg_replication_slot?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-16 Thread Andres Freund
Hi Robert,

On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
 On Wed, Dec 11, 2013 at 7:08 PM, Andres Freund and...@2ndquadrant.com wrote:
  I don't think there's much point in including remapping in all of
  the error messages.  It adds burden for translators and users won't
  know what a remapping file is anyway.
 
  It helps in locating wich part of the code caused a problem. I utterly
  hate to get reports with error messages that I can't correlate to a
  sourcefile. Yes, I know that verbose error output exists, but usually
  you don't get it that way... That said, I'll try to make the messages
  simpler.
 
 Well, we could adopt a policy of not making messages originating from
 different locations in the code the same.  However, it looks to me
 like that's NOT the current policy, because some care has been taken
 to reuse messages rather than distinguish them.

To me that mostly looks like cases where we either don't want to tell
more for various security-ish purposes or where they have been copy and
pasted...

 There's no hard and
 fast rule here, because some cases are distinguished, but my gut
 feeling is that all of the errors your patch introduces are
 sufficiently obscure cases that separate messages with separate
 translations are not warranted.

Perhaps we should just introduce a marker that some such strings are not
to be translated if they are of the unexpected kind. That would probably
make debugging easier too ;)

 I need to examine that logic in more detail,
 but I had trouble following it and was hoping the next version would
 be better-commented.

Yea, I've started expanding the comments about the higher level concerns
- I've been so knee deep in this that I didn't realize they weren't
there.

  b) make hot_standby_feedback work across disconnections of the walsender
 connection (i.e peg xmin, not just for catalogs though)
 
 Check; might need to be optional.

Yea, I am pretty sure it will. It'd probably pretty nasty to set
min_recovery_apply_delay=7d and force xmin kept to that...

  c) Make sure we can transport those across cascading
 replication.
 
 Not sure I follow.

Consider a replication scenario like primary - standby-1 -
standby-2. The primary may not only not remove data that standby-1
requires, but also not data that standby-2 needs. Not really necessary
for WAL since that will also reside on standby-1 but definitely for the
xmin horizon.
So standby-1 will need to signal not only his own needs, but also of the
nodes below.

  The hard questions that I see are like:
  * How do we manage standby registration? Does the admin have to do that
manually? Or does a standby register itself automatically if some config
paramter is set?
  * If automatically, how do we deal with the situation that registrant
dies before noting his own identifier somewhere persistent? My best idea
is a two phase registration process where registration in phase 1 are
thrown away after a restart, but yuck.

 If you don't know the answers to these questions for the kind of
 replication that we have now, then how do you know the answers for
 logical replication?  Conversely, what makes the answers that you've
 selected for logical replication unsuitable for our existing
 replication?

There's a pretty fundamental difference imo - with the logical decoding
stuff we only supply support for change producing nodes, with physical
rep we supply both.
There's no need to decide about the way node ids are stored in in-core logical
rep. consumers since there are no in-core ones. Yet. Also, physical rep
by now is a pretty established thing, we need to be much more careful
about compatibility there.

 I have to admit that before I saw your design for the logical
 replication slots, the problem of making this work for our existing
 replication stuff seemed almost intractable to me; I had no idea how
 that was going to work.

Believe me, it caused me some headaches to deceive it for decoding
too. Oh, and I think I watched just about all episodes of some stupid TV
show during it ;)

 Keeping the data in shared memory,
 persisting them across shutdowns, and managing them via either
 function calls or the replication command language seems perfect.

Thanks. I think the concept has quite some merits. The implementation is
a bit simplistic atm, we e.g. might want to work harder at coalescing
fsync()s and such, but that's a further step when we see whether it's
worthwile in the real world.

 Now, in terms of how registration works, whether for physical
 replication or logical, it seems to me that the DBA will have to be
 responsible for telling each client the name of the slot to which that
 client should connect (omitting it at said DBA's option if, as for
 physical replication, the slot is not mandatory).

It seems reasonable to me to reuse the application_name for the slot's
name, similar to the way it's used for synchronous rep. It seems odd to
use two different ways to identify nodes. t should 

Re: [HACKERS] logical changeset generation v6.8

2013-12-16 Thread Robert Haas
On Mon, Dec 16, 2013 at 6:01 AM, Andres Freund and...@2ndquadrant.com wrote:
 There's no hard and
 fast rule here, because some cases are distinguished, but my gut
 feeling is that all of the errors your patch introduces are
 sufficiently obscure cases that separate messages with separate
 translations are not warranted.

 Perhaps we should just introduce a marker that some such strings are not
 to be translated if they are of the unexpected kind. That would probably
 make debugging easier too ;)

Well, we have that: it's called elog.  But that doesn't seem like the
right thing here.

  b) make hot_standby_feedback work across disconnections of the walsender
 connection (i.e peg xmin, not just for catalogs though)

 Check; might need to be optional.

 Yea, I am pretty sure it will. It'd probably pretty nasty to set
 min_recovery_apply_delay=7d and force xmin kept to that...

Yes, that would be... unfortunate.

  c) Make sure we can transport those across cascading
 replication.

 Not sure I follow.

 Consider a replication scenario like primary - standby-1 -
 standby-2. The primary may not only not remove data that standby-1
 requires, but also not data that standby-2 needs. Not really necessary
 for WAL since that will also reside on standby-1 but definitely for the
 xmin horizon.
 So standby-1 will need to signal not only his own needs, but also of the
 nodes below.

True.

  The hard questions that I see are like:
  * How do we manage standby registration? Does the admin have to do that
manually? Or does a standby register itself automatically if some config
paramter is set?
  * If automatically, how do we deal with the situation that registrant
dies before noting his own identifier somewhere persistent? My best idea
is a two phase registration process where registration in phase 1 are
thrown away after a restart, but yuck.

 If you don't know the answers to these questions for the kind of
 replication that we have now, then how do you know the answers for
 logical replication?  Conversely, what makes the answers that you've
 selected for logical replication unsuitable for our existing
 replication?

 There's a pretty fundamental difference imo - with the logical decoding
 stuff we only supply support for change producing nodes, with physical
 rep we supply both.

I'm not sure I follow this.  Both what and what?

 There's no need to decide about the way node ids are stored in in-core logical
 rep. consumers since there are no in-core ones. Yet.

I don't know that we have or need to make any judgements about how to
store node IDs.  You have decided that slots have names, and I see no
problem there.

 Also, physical rep
 by now is a pretty established thing, we need to be much more careful
 about compatibility there.

I don't think we should change anything in backward-incompatible
fashion.  If we add any new behavior, it'd surely be optional.

 I think we need to improve the monitoring facilities a bit, and that
 should be it. Like
 * expose xmin in pg_stat_activity, pg_prepared_xacts,
   pg_replication_slots (or whatever it's going to be called)
 * expose the last restartpoint's redo pointer in pg_stat_replication, 
 pg_replication_slots

+1.

 That said, the consequences can be a bit harsher than a full disk - the
 anti-wraparound security measures might kick in requiring a restart into
 single user mode. That's way more confusing than cleaning up a bit of
 space on the disk.

Yes, true.  I'm not sure what approach to that problem is best.  It's
long seemed to me that well before we get to the point of shutting
down the whole cluster we ought to just start killing sessions with
old xmins.  But that doesn't generalize well to prepared transactions,
which can't just be rolled back or committed without guidance; and
killing slots seems a bit dicey too.

 Consider what happens though, if you promote a node for physical rep. As
 soon as it's promoted, it will accept writes and then start a
 checkpoint. Unless other standbys have started to follow that node
 before either that checkpoint happens (removing WAL) or
 autovacuuming/hot-pruning is performed (creating recovery conflicts),
 we'll possibly loose the data required to let the standbys follow the
 promotion. Note that wal_keep_segments and vacuum_defer_cleanup_age both
 sorta work for that...

True.

 Could somebody please deliver me a time dilation device?

Upon reflection, I am less concerned with actually having physical
slots in this release than I am with making sure we're not boxing
ourselves into a corner that will make them hard to add later.  If
we've got a clear design that can be generalized to that case, but the
SMOP required exceeds what can be done in the time available, I am OK
to punt it.  But I am not sure we're at that point yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make 

Re: [HACKERS] logical changeset generation v6.8

2013-12-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Dec 16, 2013 at 6:01 AM, Andres Freund and...@2ndquadrant.com wrote:
 Perhaps we should just introduce a marker that some such strings are not
 to be translated if they are of the unexpected kind. That would probably
 make debugging easier too ;)

 Well, we have that: it's called elog.  But that doesn't seem like the
 right thing here.

errmsg_internal?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-16 Thread Robert Haas
On Mon, Dec 16, 2013 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Dec 16, 2013 at 6:01 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 Perhaps we should just introduce a marker that some such strings are not
 to be translated if they are of the unexpected kind. That would probably
 make debugging easier too ;)

 Well, we have that: it's called elog.  But that doesn't seem like the
 right thing here.

 errmsg_internal?

There's that, too.  But again, these messages are not can't-happen
scenarios.  The argument is just whether to reuse existing error
message text (like could not write file) or invent a new variation
(like could not write remapping file).  Andres' argument (which is
valid) is that distinguished messages make it easier to troubleshoot
without needing to turn on verbose error messages.  My argument (which
I think is also valid) is that a user isn't likely to know what a
remapping file is, and having more messages increases the translation
burden.  Is there a project policy on this topic?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-16 Thread Alvaro Herrera
Robert Haas escribió:

 There's that, too.  But again, these messages are not can't-happen
 scenarios.  The argument is just whether to reuse existing error
 message text (like could not write file) or invent a new variation
 (like could not write remapping file).  Andres' argument (which is
 valid) is that distinguished messages make it easier to troubleshoot
 without needing to turn on verbose error messages.  My argument (which
 I think is also valid) is that a user isn't likely to know what a
 remapping file is, and having more messages increases the translation
 burden.  Is there a project policy on this topic?

I would vote for a generic could not write file %s where the %s lets
the troubleshooter know the path of the file, and thus in what context
it is being read.  We already have a similar case where slru.c reports
error as pertaining to transaction 12345 but the path is
pg_subtrans/xyz or multixact etc; while it doesn't explicitely say
what module is raising the error, it's pretty clear from the path.

Would that not work here?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 There's that, too.  But again, these messages are not can't-happen
 scenarios.  The argument is just whether to reuse existing error
 message text (like could not write file) or invent a new variation
 (like could not write remapping file).

As long as the message includes the file name, which it surely oughta,
I don't see that we need any explanation of what Postgres thinks the
file is for.  If someone cares about that they can reverse-engineer
it from the file name; while as you said upthread, most of the time
the directory path is going to be the key piece of useful information.

So +1 for could not write file.

 Andres' argument (which is
 valid) is that distinguished messages make it easier to troubleshoot
 without needing to turn on verbose error messages.  My argument (which
 I think is also valid) is that a user isn't likely to know what a
 remapping file is, and having more messages increases the translation
 burden.  Is there a project policy on this topic?

I think Andres' argument is a thinly veiled version of let's put the
routine name into the message text, which there definitely is project
policy against (see 49.3.13 in the message style guide).  If you want to
know the code location where the error was thrown, the answer is to get
a verbose log, not to put identifying information into the user-facing
message text.  And this is only partially-identifying information,
which seems like the worst of both worlds: you've got confused users and
overworked translators, and you still don't know exactly where it was
thrown from.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-15 Thread Robert Haas
On Wed, Dec 11, 2013 at 7:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't think there's much point in including remapping in all of
 the error messages.  It adds burden for translators and users won't
 know what a remapping file is anyway.

 It helps in locating wich part of the code caused a problem. I utterly
 hate to get reports with error messages that I can't correlate to a
 sourcefile. Yes, I know that verbose error output exists, but usually
 you don't get it that way... That said, I'll try to make the messages
 simpler.

Well, we could adopt a policy of not making messages originating from
different locations in the code the same.  However, it looks to me
like that's NOT the current policy, because some care has been taken
to reuse messages rather than distinguish them.  There's no hard and
fast rule here, because some cases are distinguished, but my gut
feeling is that all of the errors your patch introduces are
sufficiently obscure cases that separate messages with separate
translations are not warranted.  The time when this is likely to fail
is when someone borks the permissions on the data directory, and the
user probably won't need to care exactly which file we can't write.

 +   if (!TransactionIdIsNormal(xmax))
 +   {
 +   /*
 +* no xmax is set, can't have any permanent ones, so
 this check is
 +* sufficient
 +*/
 +   }
 +   else if (HEAP_XMAX_IS_LOCKED_ONLY(new_tuple-t_data-t_infomask))
 +   {
 +   /* only locked, we don't care */
 +   }
 +   else if (!TransactionIdPrecedes(xmax, cutoff))
 +   {
 +   /* tuple has been deleted recently, log */
 +   do_log_xmax = true;
 +   }

 Obfuscated.  Rewrite without empty blocks.

 I don't understand why having an empty block is less clear than
 including a condition in several branches? Especially if the individual
 conditions might need to be documented?

It's not a coding style we typically use, to my knowledge.  Of course,
what is actually clearest is a matter of opinion, but my own
experience is that such blocks typically end up seeming clearer to me
when the individual comments are joined together into a paragraph that
explains the logic in full sentences what's going on.

 +   /*
 +* Write out buffer everytime we've too many in-memory entries.
 +*/
 +   if (state-rs_num_rewrite_mappings = 1000 /* arbitrary number */)
 +   logical_heap_rewrite_flush_mappings(state);

 What happens if the number of items per hash table entry is small but
 the number of entries is large?

 rs_num_rewrite_mappings is the overall number of in-memory mappings, not
 the number of per-entry mappings. That means we flush, even if all
 entries have only a couple of mappings, as soon as 1000 in memory
 entries have been collected. A bit simplistic, but seems okay enough?

Possibly, not sure yet.  I need to examine that logic in more detail,
but I had trouble following it and was hoping the next version would
be better-commented.

 I'm still unhappy that we're introducing logical decoding slots but no
 analogue for physical replication.  If we had the latter, would it
 share meaningful amounts of structure with this?

 Yes, I think we could mostly reuse it, we'd probably want to add a field
 or two more (application_name, sync_prio?). I have been wondering
 whether some of the code in replication/logical/logical.c shouldn't be
 in replication/slot.c or similar. So far I've opted for leaving it in
 its current place since it would have to change a bit for a more general
 role.

I strongly favor moving the slot-related code to someplace with slot
in the name, and replication/slot.c seems about right.  Even if we
don't extend them to cover non-logical replication in this release,
we'll probably do it eventually, and it'd be better if that didn't
require moving large amounts of code between files.

 I still think that the technical parts of properly supporting persistent
 slots for physical rep aren't that hard, but that the behavioural
 decisions are. I think there are primarily two things for SR that we
 want to prevent using slots:
 a) removal of still used WAL (i.e. maintain knowledge about a standby's
last required REDO location)

Check.

 b) make hot_standby_feedback work across disconnections of the walsender
connection (i.e peg xmin, not just for catalogs though)

Check; might need to be optional.

 c) Make sure we can transport those across cascading
replication.

Not sure I follow.

 once those are there it's also useful to keep a bit more information
 about the state of replicas:
 * write, flush, apply lsn

Check.

 The hard questions that I see are like:
 * How do we manage standby registration? Does the admin have to do that
   manually? Or does a standby register itself automatically if some config
   paramter is set?
 * If automatically, how do we deal with the situation 

Re: [HACKERS] logical changeset generation v6.8

2013-12-13 Thread David Rowley
On Wed, Dec 11, 2013 at 1:11 PM, Robert Haas robertmh...@gmail.com wrote:

 Committed #1 (again).  Regarding this:


This introduced a new compiler warning on the visual studios build:
  d:\postgres\b\src\backend\utils\cache\relcache.c(3958): warning C4715:
'RelationGetIndexAttrBitmap' : not all control paths return a value
[D:\Postgres\b\postgres.vcxproj]

The attached patch fixes it.

Regards

David Rowley


 +   /* XXX: we could also do this unconditionally, the space is used
 anyway
 +   if (copy_oid)
 +   HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));

 I would like to put in a big +1 for doing that unconditionally.  I
 didn't make that change before committing, but I think it'd be a very
 good idea.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



relcache_fix_warning.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-13 Thread Andres Freund
On 2013-12-13 20:58:24 +1300, David Rowley wrote:
 On Wed, Dec 11, 2013 at 1:11 PM, Robert Haas robertmh...@gmail.com wrote:
 This introduced a new compiler warning on the visual studios build:
   d:\postgres\b\src\backend\utils\cache\relcache.c(3958): warning C4715:
 'RelationGetIndexAttrBitmap' : not all control paths return a value
 [D:\Postgres\b\postgres.vcxproj]
 
 The attached patch fixes it.

I thought we'd managed to get elog(ERROR) properly annotated as noreturn
on msvc as well?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-13 Thread David Rowley
On Sat, Dec 14, 2013 at 12:12 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-12-13 20:58:24 +1300, David Rowley wrote:
  On Wed, Dec 11, 2013 at 1:11 PM, Robert Haas robertmh...@gmail.com
 wrote:
  This introduced a new compiler warning on the visual studios build:
d:\postgres\b\src\backend\utils\cache\relcache.c(3958): warning C4715:
  'RelationGetIndexAttrBitmap' : not all control paths return a value
  [D:\Postgres\b\postgres.vcxproj]
 
  The attached patch fixes it.

 I thought we'd managed to get elog(ERROR) properly annotated as noreturn
 on msvc as well?


It looks like this is down to the elog macro, where the elevel is being
assigned to a variable elevel_ then we're only doing pg_unreachable(); if
elevel_ = ERROR. The compiler must not be confident enough to optimise out
the if condition even though the elevel is not changed after it is set from
the constant.

Regards

David Rowley


 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] logical changeset generation v6.8

2013-12-11 Thread Andres Freund
On 2013-12-10 19:11:03 -0500, Robert Haas wrote:
 Committed #1 (again).

Thanks!

  Regarding this:
 
 +   /* XXX: we could also do this unconditionally, the space is used 
 anyway
 +   if (copy_oid)
 +   HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));
 
 I would like to put in a big +1 for doing that unconditionally.  I
 didn't make that change before committing, but I think it'd be a very
 good idea.

Ok. I wasn't sure if it wouldn't be wierd to include the oid in the
tuple logged for a replica identity that doesn't cover the oid. But the
downside is pretty small...

Will send a patch.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-11 Thread Andres Freund
On 2013-12-10 19:11:03 -0500, Robert Haas wrote:
 Committed #1 (again).  Regarding this:
 
 +   /* XXX: we could also do this unconditionally, the space is used 
 anyway
 +   if (copy_oid)
 +   HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));
 
 I would like to put in a big +1 for doing that unconditionally.  I
 didn't make that change before committing, but I think it'd be a very
 good idea.

Patch attached.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 47c93d8e7afbcaef268de66246571cdc2f134c97 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Wed, 11 Dec 2013 17:20:27 +0100
Subject: [PATCH] Dep't of second thoughts: Always include oids in WAL logged
 replica identities.

Since replica identities are logged using the normal format for heap
tuples, the space for oids in WITH OIDS tables was already used, so
there's little point in only including the oid if it is included in
the configured IDENTITY.

Per comment from Robert Haas.
---
 src/backend/access/heap/heapam.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 249fffe..e647453 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6638,7 +6638,6 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *
 	TupleDesc	idx_desc;
 	char		replident = relation-rd_rel-relreplident;
 	HeapTuple	key_tuple = NULL;
-	bool		copy_oid = false;
 	bool		nulls[MaxHeapAttributeNumber];
 	Datum		values[MaxHeapAttributeNumber];
 	int			natt;
@@ -6698,7 +6697,8 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *
 		int attno = idx_rel-rd_index-indkey.values[natt];
 
 		if (attno == ObjectIdAttributeNumber)
-			copy_oid = true;
+			/* copied below */
+			;
 		else if (attno  0)
 			elog(ERROR, system column in index);
 		else
@@ -6709,8 +6709,12 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *
 	*copy = true;
 	RelationClose(idx_rel);
 
-	/* XXX: we could also do this unconditionally, the space is used anyway */
-	if (copy_oid)
+	/*
+	 * Always copy oids if the table has them, even if not included in the
+	 * index. The space in the logged tuple is used anyway, so there's little
+	 * point in not including the information.
+	 */
+	if (relation-rd_rel-relhasoids)
 		HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));
 
 	/*
-- 
1.8.5.rc2.dirty


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-11 Thread Robert Haas
On Wed, Dec 11, 2013 at 11:25 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-10 19:11:03 -0500, Robert Haas wrote:
 Committed #1 (again).  Regarding this:

 +   /* XXX: we could also do this unconditionally, the space is used 
 anyway
 +   if (copy_oid)
 +   HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));

 I would like to put in a big +1 for doing that unconditionally.  I
 didn't make that change before committing, but I think it'd be a very
 good idea.

 Patch attached.

Committed with kibitzing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-11 Thread Andres Freund
Hi,

Lots of sensible comments removed, I plan to make changes to
address them.

On 2013-12-10 22:17:44 -0500, Robert Haas wrote:
 - I think this needs SGML documentation, same kind of thing we have
 for background workers, except probably significantly more.  A design
 document with ASCII art in a different patch does not constitute
 usable documentation.  I think it would fit best under section VII,
 internals, perhaps entitled Writing a Logical Decoding Plugin.
 Looking at it, I rather wonder if the Background Worker Processes
 ought to be there as well, rather than under section V, server
 programming.

Completely agreed it needs that. I'd like the UI decisions in
http://www.postgresql.org/message-id/20131205001520.ga8...@awork2.anarazel.de
resolved before though.

 +   logical_xmin =
 +   ((volatile LogicalDecodingCtlData*) LogicalDecodingCtl)-xmin;
 
 Ugly.

We really should have a macro for this...

 Message style.  I suggest treating a short write as ENOSPC; there is
 precedent elsewhere.
 
 I don't think there's much point in including remapping in all of
 the error messages.  It adds burden for translators and users won't
 know what a remapping file is anyway.

It helps in locating wich part of the code caused a problem. I utterly
hate to get reports with error messages that I can't correlate to a
sourcefile. Yes, I know that verbose error output exists, but usually
you don't get it that way... That said, I'll try to make the messages
simpler.

 +   /* use *GetUpdateXid to correctly deal with multixacts */
 +   xmax = HeapTupleHeaderGetUpdateXid(new_tuple-t_data);

 I don't feel enlightened by that comment.

Well, it will return the update xid of a tuple locked with NO KEY SHARE
and updated (with NO KEY UPDATE). I.e. 9.3+ foreign key locking stuff.

 +   if (!TransactionIdIsNormal(xmax))
 +   {
 +   /*
 +* no xmax is set, can't have any permanent ones, so
 this check is
 +* sufficient
 +*/
 +   }
 +   else if (HEAP_XMAX_IS_LOCKED_ONLY(new_tuple-t_data-t_infomask))
 +   {
 +   /* only locked, we don't care */
 +   }
 +   else if (!TransactionIdPrecedes(xmax, cutoff))
 +   {
 +   /* tuple has been deleted recently, log */
 +   do_log_xmax = true;
 +   }
 
 Obfuscated.  Rewrite without empty blocks.

I don't understand why having an empty block is less clear than
including a condition in several branches? Especially if the individual
conditions might need to be documented?

 +   /*
 +* Write out buffer everytime we've too many in-memory entries.
 +*/
 +   if (state-rs_num_rewrite_mappings = 1000 /* arbitrary number */)
 +   logical_heap_rewrite_flush_mappings(state);
 
 What happens if the number of items per hash table entry is small but
 the number of entries is large?

rs_num_rewrite_mappings is the overall number of in-memory mappings, not
the number of per-entry mappings. That means we flush, even if all
entries have only a couple of mappings, as soon as 1000 in memory
entries have been collected. A bit simplistic, but seems okay enough?

 +   /* XXX: should we warn about such files? */
 
 Nah.

Ok, will remove that comment from a couple locations then...

 +   /* XXX: we could replay the transaction and
 prepare it as well. */
 
 Should we do that?

It would allow a neat feature, namely using 2PC to make sure that a
transaction commit on all the nodes connected using changeset
extraction. But I think that's a feature for later. Its use would have
to be optional anyway.


 All right, I'm giving up for now.  These patches need a LOT of work on
 comments and documentation to be committable, even if the underlying
 architecture is basically sound.

 +* clog. If we're doing logical replication we can't do that though, 
 so
 +* hold the lock for a moment longer.
 
 ...because why?

That's something pretty icky imo. But more in the existing HS code than
in this. Without the changes in the locking we can have the situation
that transactions are marked as running in the xl_running_xact record,
but are actually already committed. There's some code for HS that tries
to cope with that situation but I don't trust it very much, and it'd be
more complicated to make it work for logical decoding. I could reproduce
problems for the latter without those changes.

I'll add a comment explaining this.

 I'm still unhappy that we're introducing logical decoding slots but no
 analogue for physical replication.  If we had the latter, would it
 share meaningful amounts of structure with this?

Yes, I think we could mostly reuse it, we'd probably want to add a field
or two more (application_name, sync_prio?). I have been wondering
whether some of the code in replication/logical/logical.c shouldn't be
in replication/slot.c or similar. So far I've opted for leaving it in
its current 

Re: [HACKERS] logical changeset generation v6.8

2013-12-10 Thread Robert Haas
On Wed, Dec 4, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-03 15:19:26 -0500, Robert Haas wrote:
 Yeah, you're right.  I think the current logic will terminate when all
 flags are set to false or all attribute numbers have been checked, but
 it doesn't know that if HOT's been disproven then we needn't consider
 further HOT columns.  I think the way to fix that is to tweak this
 part:

 +   if (next_hot_attnum  FirstLowInvalidHeapAttributeNumber)
 check_now = next_hot_attnum;
 +   else if (next_key_attnum  
 FirstLowInvalidHeapAttributeNumber)
 +   check_now = next_key_attnum;
 +   else if (next_id_attnum  FirstLowInvalidHeapAttributeNumber)
 +   check_now = next_id_attnum;
 else
 +   break;

 What I think we ought to do there is change each of those criteria to
 say if (hot_result  next_hot_attnum 
 FirstLowInvalidHeapAttributeNumber) and similarly for the other two.
 That way we consider each set a valid source of attribute numbers only
 until the result flag for that set flips false.

 That seems to work well, yes.

 Updated  rebased series attached.

 * Rebased since the former patch 01 has been applied
 * Lots of smaller changes in the wal_level=logical patch
   * Use Robert's version of wal_level=logical, with the above fixes
   * Use only macros for RelationIsAccessibleInLogicalDecoding/LogicallyLogged
   * Moved a mit more logic into ExtractReplicaIdentity
   * some comment copy-editing
   * Bug noted by Euler fixed, testcase added
 * Some copy editing in later patches, nothing significant.

 I've primarily sent this, because I don't know of further required
 changes in 0001-0003. I am trying reviewing the other patches in detail
 atm.

Committed #1 (again).  Regarding this:

+   /* XXX: we could also do this unconditionally, the space is used anyway
+   if (copy_oid)
+   HeapTupleSetOid(key_tuple, HeapTupleGetOid(tp));

I would like to put in a big +1 for doing that unconditionally.  I
didn't make that change before committing, but I think it'd be a very
good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-10 Thread Robert Haas
On Wed, Dec 4, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote:
 I've primarily sent this, because I don't know of further required
 changes in 0001-0003. I am trying reviewing the other patches in detail
 atm.

Committed #3 also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.8

2013-12-10 Thread Robert Haas
On Wed, Dec 4, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote:
 [ updated logical decoding patches ]

Regarding patch #4, introduce wal decoding via catalog timetravel,
which seems to be the bulk of what's not committed at this point...

- I think this needs SGML documentation, same kind of thing we have
for background workers, except probably significantly more.  A design
document with ASCII art in a different patch does not constitute
usable documentation.  I think it would fit best under section VII,
internals, perhaps entitled Writing a Logical Decoding Plugin.
Looking at it, I rather wonder if the Background Worker Processes
ought to be there as well, rather than under section V, server
programming.

+   /* setup the redirected t_self for the benefit
of logical decoding */
...
+   /* reset to original, non-redirected, tid */

Clear as mud.

+ * rrow to disk but only do so in batches when we've collected several of them

Typo.

+ * position before those records back. Independently from WAL logging,

before those records back?

+ * position before those records back. Independently from WAL logging,
+ * everytime a rewrite is finished all generated mapping files are directly

I would delete Independently from WAL logging from this sentence.
And everytime is two words.

+ * file. That leaves the tail end that has not yet been fsync()ed to disk open
...
+ * fsynced.

Pick a spelling and stick with it.  My suggestion is flushed to
disk, not actually using fsync per se at all.

+ * TransactionDidCommit() check. But we want to support physical replication
+ * for availability and to support logical decoding on the standbys.

What is physical replication for availability?

+ * necessary. If we detect that we don't need to log anything we'll prevent
+ * any further action by logical_*rewrite*

Sentences should end with a period, and the reason for the asterisks
is not clear.

+   logical_xmin =
+   ((volatile LogicalDecodingCtlData*) LogicalDecodingCtl)-xmin;

Ugly.

+errmsg(failed to write to
logical remapping file: %m)));

Message style.

+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg(incomplete write to
logical remapping file, wrote %d of %u,
+   written, len)));

Message style.  I suggest treating a short write as ENOSPC; there is
precedent elsewhere.

I don't think there's much point in including remapping in all of
the error messages.  It adds burden for translators and users won't
know what a remapping file is anyway.

+   /*
+* We intentionally violate the usual WAL coding
practices here and
+* write to the file *first*. This way an eventual
checkpoint will
+* sync any part of the file that's not guaranteed to
be recovered by
+* the XLogInsert(). We deal with the potential
corruption in the tail
+* of the file by truncating it to the last safe point
during WAL
+* replay and by checking whether the xid performing
the mapping has
+* committed.
+*/

Don't have two different comments explaining this.  Either replace
this one with a reference to the other one, or get rid of the other
one and just keep this one.  I vote for the latter.

I don't see a clear explanation anywhere of what the
rs_logical_mappings hash is actually doing.  This is badly needed.
This code basically presupposes that you know what it's try to
accomplish, and even though I sort of do, it leaves a lot to be
desired in terms of clarity.

+   /* nothing to do if we're not working on a catalog table */
+   if (!state-rs_logical_rewrite)
+   return;

Comment doesn't accurately describe code.

+   /* use *GetUpdateXid to correctly deal with multixacts */
+   xmax = HeapTupleHeaderGetUpdateXid(new_tuple-t_data);

I don't feel enlightened by that comment.

+   if (!TransactionIdIsNormal(xmax))
+   {
+   /*
+* no xmax is set, can't have any permanent ones, so
this check is
+* sufficient
+*/
+   }
+   else if (HEAP_XMAX_IS_LOCKED_ONLY(new_tuple-t_data-t_infomask))
+   {
+   /* only locked, we don't care */
+   }
+   else if (!TransactionIdPrecedes(xmax, cutoff))
+   {
+   /* tuple has been deleted recently, log */
+   do_log_xmax = true;
+   }

Obfuscated.  Rewrite without empty blocks.

+   /*
+* Write out buffer everytime we've too many in-memory entries.
+*/
+   if (state-rs_num_rewrite_mappings = 1000 /* arbitrary number */)
+   logical_heap_rewrite_flush_mappings(state);

What happens if the number of items per hash table entry is small but
the