Re: [HACKERS] logical changeset generation v6.4

2013-10-25 Thread Andres Freund
Hi,

On 2013-10-22 16:07:16 +0200, Andres Freund wrote:
 On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
  Current draft is:
  ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
  ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;
  
  which leaves the door open for
  
  ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';
  
  Does anybody have a strong feeling about requiring support for CREATE
  TABLE for this?
 
 Attached is a patch ontop of master implementing this syntax. It's not
 wired up to the changeset extraction patch yet as I am not sure whether
 others agree about the storage.

So, I am currently wondering about how to store the old tuple, based
on this. Currently it is stored using the TupleDesc of the index the old
tuple is based on. But if we want to allow transporting the entire tuple
that obviously cannot be the only option.
One option would be to change the stored format based on what's
configured, using the relation's TupleDesc if FULL is used. But I think
always using the heap relation's desc is better.
The not-logged columns would then just be represented as NULLs. That
will make old primary keys bigger if the relation has a high number of
columns and the key small, but I don't think it matters enough.

Opinions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-25 Thread Robert Haas
On Fri, Oct 25, 2013 at 10:58 AM, Andres Freund and...@2ndquadrant.com wrote:
 So, I am currently wondering about how to store the old tuple, based
 on this. Currently it is stored using the TupleDesc of the index the old
 tuple is based on. But if we want to allow transporting the entire tuple
 that obviously cannot be the only option.
 One option would be to change the stored format based on what's
 configured, using the relation's TupleDesc if FULL is used. But I think
 always using the heap relation's desc is better.

I heartily agree.

 The not-logged columns would then just be represented as NULLs. That
 will make old primary keys bigger if the relation has a high number of
 columns and the key small, but I don't think it matters enough.

Even if it does matter, the cure seems likely to be worse than the disease.

My only other comment is that if NONE is selected, we ought to omit
the old tuple altogether, not store one that is all-nulls.  But I bet
you had that in mind anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 10:07 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
 On 2013-10-18 20:50:58 +0200, Andres Freund wrote:
  How about modifying the selection to go from:
  * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
  * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
  * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)

 Current draft is:
 ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
 ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;

 which leaves the door open for

 ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';

 Does anybody have a strong feeling about requiring support for CREATE
 TABLE for this?

 Attached is a patch ontop of master implementing this syntax. It's not
 wired up to the changeset extraction patch yet as I am not sure whether
 others agree about the storage.

 pg_class grew a 'relreplident' char, storing:
 * 'd' default
 * 'n' nothing
 * 'f' full
 * 'i' index with indisreplident set, or default if index has been
   dropped
 pg_index grew a 'indisreplident' bool indicating it is set as the
 replica identity for a replident = 'i' relation.

 Both changes shouldn't change the width of the affected relations, they
 should reuse existing padding.

 Does somebody prefer a different storage?

I had imagined that the storage might consist simply of a pg_attribute
boolean.  So full would turn them all on, null would turn them all of,
etc.  But that does make it hard to implement the whatever the pkey
happens to be right now behavior, so maybe your idea is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Robert Haas
On Fri, Oct 18, 2013 at 2:50 PM, Andres Freund and...@2ndquadrant.com wrote:
 How about modifying the selection to go from:
 * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
 * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
 * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
 * primary key
 * candidate key with the smallest oid

 Including the candidate key will help people using changeset extration
 for auditing that do not have primary key. That really isn't an
 infrequent usecase.

 I've chosen REPLICA IDENTITY; NOTHIN; FULL; because those are all
 existing keywords, and afaics shouldn't generate any conflicts. On a
 green field we probably name them differently, but ...

I'm really pretty much dead set against the candidate key with the
smallest OID proposal.  I think that's just plain old bad idea.  It's
just magical behavior which will result in users being surprised and
unhappy.  I don't think there's really a problem with saying, hey, if
you configure changeset extraction and you don't configure a replica
identity, then you don't get any columns from the old tuple.  If you
don't like that, change the configuration.  It's always nice to spare
users unnecessary configuration, of course, but trying to make things
simpler than they really are tends to hurt more than it helps.

On the naming, I find REPLICA IDENTITY to be pretty good.  We've
already places where we're using the REPLICA keyword to indicate
places where we've got core support intended to dovetail with external
replication solutions, and this seems to fit that paradigm nicely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Andres Freund
On 2013-10-21 09:40:13 -0400, Robert Haas wrote:
 On Fri, Oct 18, 2013 at 2:50 PM, Andres Freund and...@2ndquadrant.com wrote:
  How about modifying the selection to go from:
  * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
  * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
  * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
  * primary key
  * candidate key with the smallest oid
 
  Including the candidate key will help people using changeset extration
  for auditing that do not have primary key. That really isn't an
  infrequent usecase.
 
  I've chosen REPLICA IDENTITY; NOTHIN; FULL; because those are all
  existing keywords, and afaics shouldn't generate any conflicts. On a
  green field we probably name them differently, but ...
 
 I'm really pretty much dead set against the candidate key with the
 smallest OID proposal.  I think that's just plain old bad idea.  It's
 just magical behavior which will result in users being surprised and
 unhappy.  I don't think there's really a problem with saying, hey, if
 you configure changeset extraction and you don't configure a replica
 identity, then you don't get any columns from the old tuple.

I have a hard time to understand why you dislike it so much. Think of a
big schema where you want to add auditing via changeset
extraction. Because of problems with reindexing primary key you've just
used candidate keys so far. Why should you go through each of a couple
of hundred tables and explictly choose an index when you just want an
identifier of changed rows?
By nature of it being a candidate key it is *guranteed* to uniquely
identify a row? And you can make the output plugin give you the used
columns/the indexname without a problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Hannu Krosing
On 10/18/2013 08:50 PM, Andres Freund wrote:
 On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
...
 2. If that seems too complicated, how about just logging the whole old
 tuple for version 1?
 I think that'd make the patch much less useful because it bloats WAL
 unnecessarily for the primary user (replication) of it. I'd rather go
 for primary keys only if that proves to be the contentious point.

 How about modifying the selection to go from:
 * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
 * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
 * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
 * primary key
 * candidate key with the smallest oid

 Including the candidate key will help people using changeset extration
 for auditing that do not have primary key. That really isn't an
 infrequent usecase.
As I understand it for a table with *no* unique index,
the candidate key is the full tuple, so if we get an UPDATE for
it then this should be replicated as
UPDATE first row matching (NOT DISTINCT FROM) all columns
which on replay side will be equivalent to
CREATE CURSOR ...; FETCH 1 ...; UPDATE ... WHERE CURRENT...'

I know that this will slow down replication, as you can not use direct
index updates internally - at least not easily - but need to let postgreSQL
actually plan this, but such single row update is no faster on origin
either.

Of course when it is a full-table update on a table with no
indexes, then doing the same one tuple at a time is really slow.



-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Andres Freund
On 2013-10-21 16:40:43 +0200, Hannu Krosing wrote:
 On 10/18/2013 08:50 PM, Andres Freund wrote:
  On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
 ...
  2. If that seems too complicated, how about just logging the whole old
  tuple for version 1?
  I think that'd make the patch much less useful because it bloats WAL
  unnecessarily for the primary user (replication) of it. I'd rather go
  for primary keys only if that proves to be the contentious point.
 
  How about modifying the selection to go from:
  * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
  * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
  * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
  * primary key
  * candidate key with the smallest oid
 
  Including the candidate key will help people using changeset extration
  for auditing that do not have primary key. That really isn't an
  infrequent usecase.

 As I understand it for a table with *no* unique index,
 the candidate key is the full tuple, so if we get an UPDATE for
 it then this should be replicated as
 UPDATE first row matching (NOT DISTINCT FROM) all columns
 which on replay side will be equivalent to
 CREATE CURSOR ...; FETCH 1 ...; UPDATE ... WHERE CURRENT...'

No, it's not a candidate key since it's not uniquely identifying a
row. You can play tricks as you describe, but that still doesn't make
the whole row a candidate key.

But anyway,  I suggest allowing for logging all columns above...

 I know that this will slow down replication, as you can not use direct
 index updates internally - at least not easily - but need to let postgreSQL
 actually plan this, but such single row update is no faster on origin
 either.

That's not actually true. Consider somebody doing something like:
UPDATE big_table_without_indexes SET column = ...;
On the source side that's essentialy O(n). If you replicate on a
row-by-row basis it will be O(n^2) on the replay side.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Robert Haas
On Mon, Oct 21, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote:
 I have a hard time to understand why you dislike it so much. Think of a
 big schema where you want to add auditing via changeset
 extraction. Because of problems with reindexing primary key you've just
 used candidate keys so far. Why should you go through each of a couple
 of hundred tables and explictly choose an index when you just want an
 identifier of changed rows?
 By nature of it being a candidate key it is *guranteed* to uniquely
 identify a row? And you can make the output plugin give you the used
 columns/the indexname without a problem.

Sure, well, if a particular user wants to choose candidate keys
essentially at random from among the unique indexes present, there's
nothing to prevent them from writing a script to do that.  But
assuming that one unique index is just as good as another is just
wrong.  If you pick a candidate key that doesn't actually represent
the users' notion of row identity, then your audit log will be
thoroughly useless, even if it does uniquely identify the rows
involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Andres Freund
On 2013-10-21 11:14:37 -0400, Robert Haas wrote:
 On Mon, Oct 21, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote:
  I have a hard time to understand why you dislike it so much. Think of a
  big schema where you want to add auditing via changeset
  extraction. Because of problems with reindexing primary key you've just
  used candidate keys so far. Why should you go through each of a couple
  of hundred tables and explictly choose an index when you just want an
  identifier of changed rows?
  By nature of it being a candidate key it is *guranteed* to uniquely
  identify a row? And you can make the output plugin give you the used
  columns/the indexname without a problem.
 
 Sure, well, if a particular user wants to choose candidate keys
 essentially at random from among the unique indexes present, there's
 nothing to prevent them from writing a script to do that.  But
 assuming that one unique index is just as good as another is just
 wrong.  If you pick a candidate key that doesn't actually represent
 the users' notion of row identity, then your audit log will be
 thoroughly useless, even if it does uniquely identify the rows
 involved.

Why? If the columns are specified in the log, by definition the values
will be sufficient to identify a row. Even if a nicer key might exist.

Since I seemingly can't convince you, I'll modify things that way for
now as it can easily be changed later, but I still don't see the
problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Hannu Krosing
On 10/21/2013 05:06 PM, Andres Freund wrote:
 On 2013-10-21 16:40:43 +0200, Hannu Krosing wrote:
 On 10/18/2013 08:50 PM, Andres Freund wrote:
 On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
 ...
 2. If that seems too complicated, how about just logging the whole old
 tuple for version 1?
 I think that'd make the patch much less useful because it bloats WAL
 unnecessarily for the primary user (replication) of it. I'd rather go
 for primary keys only if that proves to be the contentious point.

 How about modifying the selection to go from:
 * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
 * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
 * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
 * primary key
 * candidate key with the smallest oid

 Including the candidate key will help people using changeset extration
 for auditing that do not have primary key. That really isn't an
 infrequent usecase.
 As I understand it for a table with *no* unique index,
 the candidate key is the full tuple, so if we get an UPDATE for
 it then this should be replicated as
 UPDATE first row matching (NOT DISTINCT FROM) all columns
 which on replay side will be equivalent to
 CREATE CURSOR ...; FETCH 1 ...; UPDATE ... WHERE CURRENT...'
 No, it's not a candidate key since it's not uniquely identifying a
 row. You can play tricks as you describe, but that still doesn't make
 the whole row a candidate key.

 But anyway,  I suggest allowing for logging all columns above...
I the all columns option this ?

How about modifying the selection to go from:
* all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;

for some reason I thought it to be option to either log or not log PK column 
... 


 I know that this will slow down replication, as you can not use direct
 index updates internally - at least not easily - but need to let postgreSQL
 actually plan this, but such single row update is no faster on origin
 either.
 That's not actually true. Consider somebody doing something like:
 UPDATE big_table_without_indexes SET column = ...;
 On the source side that's essentialy O(n). If you replicate on a
 row-by-row basis it will be O(n^2) on the replay side.
Probably more like  O(n^2 / 2) but yes, this is what I meant with the
sentence
after that ;)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-21 Thread Andres Freund
On 2013-10-18 20:50:58 +0200, Andres Freund wrote:
 How about modifying the selection to go from:
 * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
 * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
 * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)

Current draft is:
ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;

which leaves the door open for

ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';

Does anybody have a strong feeling about requiring support for CREATE
TABLE for this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-18 Thread Robert Haas
On Mon, Oct 14, 2013 at 9:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 Attached you can find version 6.4 of the patchset:

So I'm still unhappy with the arbitrary logic in what's now patch 1
for choosing the candidate key.  On another thread, someone mentioned
that they might want the entire old tuple, and that got me thinking:
there's no particular reason why the user has to want exactly the
columns that exist in some unique, immediate, non-partial index (what
a name).  So I have two proposals:

1. Instead of allowing the user to choose the index to be used, or
picking it for them, how about if we let them choose the old-tuple
columns they want logged?  This could be a per-column option.  If the
primary key can be assumed known and unchanging, then the answer might
be that the user wants *no* old-tuple columns logged.  Contrariwise
someone might want everything logged, or anything in the middle.

2. If that seems too complicated, how about just logging the whole old
tuple for version 1?

I'm basically fine with the rest of what's in the first two patches,
but we need to sort out some kind of consensus on this issue.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-18 Thread Merlin Moncure
On Fri, Oct 18, 2013 at 7:11 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Oct 14, 2013 at 9:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 Attached you can find version 6.4 of the patchset:

 So I'm still unhappy with the arbitrary logic in what's now patch 1
 for choosing the candidate key.  On another thread, someone mentioned
 that they might want the entire old tuple, and that got me thinking:
 there's no particular reason why the user has to want exactly the
 columns that exist in some unique, immediate, non-partial index (what
 a name).  So I have two proposals:

Aside: what's an immediate index?  Is this speaking to the constraint?
(immediate vs deferred?)

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-18 Thread Andres Freund
On 2013-10-18 08:11:29 -0400, Robert Haas wrote:
 On Mon, Oct 14, 2013 at 9:12 AM, Andres Freund and...@2ndquadrant.com wrote:
  Attached you can find version 6.4 of the patchset:
 
 So I'm still unhappy with the arbitrary logic in what's now patch 1
 for choosing the candidate key.  On another thread, someone mentioned
 that they might want the entire old tuple, and that got me thinking:
 there's no particular reason why the user has to want exactly the
 columns that exist in some unique, immediate, non-partial index (what
 a name).  So I have two proposals:

 1. Instead of allowing the user to choose the index to be used, or
 picking it for them, how about if we let them choose the old-tuple
 columns they want logged?  This could be a per-column option.  If the
 primary key can be assumed known and unchanging, then the answer might
 be that the user wants *no* old-tuple columns logged.  Contrariwise
 someone might want everything logged, or anything in the middle.

I definitely can see the usecase for logging anything or nothing,
arbitrary column select seems to be too complicated for now.

 2. If that seems too complicated, how about just logging the whole old
 tuple for version 1?

I think that'd make the patch much less useful because it bloats WAL
unnecessarily for the primary user (replication) of it. I'd rather go
for primary keys only if that proves to be the contentious point.

How about modifying the selection to go from:
* all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
* index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
* [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)
* primary key
* candidate key with the smallest oid

Including the candidate key will help people using changeset extration
for auditing that do not have primary key. That really isn't an
infrequent usecase.

I've chosen REPLICA IDENTITY; NOTHIN; FULL; because those are all
existing keywords, and afaics shouldn't generate any conflicts. On a
green field we probably name them differently, but ...

Comments?

Greetings,

Andres Freund

PS: candidate key implies a key which is: immediate (aka not deferred),
unique, non-partial and only contains NOT NULL columns.

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers