Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 14:40, Heikki Linnakangas
 wrote:

> The reason we need an origin id in this scenario is that otherwise this will
> happen:
>
> 1. A row is updated on node A
> 2. Node B receives the WAL record from A, and updates the corresponding row
> in B. This generates a new WAL record.
> 3. Node A receives the WAL record from B, and updates the rows again. This
> again generates a new WAL record, which is replicated to A, and you loop
> indefinitely.
>
> If each WAL record carries an origin id, node A can use it to refrain from
> applying the WAL record it receives from B, which breaks the loop.
>
> However, note that in this simple scenario, if the logical log replay /
> conflict resolution is smart enough to recognize that the row has already
> been updated, because the old and the new rows are identical, the loop is
> broken at step 3 even without the origin id. That works for the
> newest-update-wins and similar strategies. So the origin id is not
> absolutely necessary in this case.

Including the origin id in the WAL allows us to filter out WAL records
when we generate LCRs, so we can completely avoid step 3, including
all of the CPU, disk and network overhead that implies. Simply put, we
know the change came from A, so no need to send the change to A again.

If we do allow step 3 to exist, we still need to send the origin id.
This is because there is no apriori way to know the origin id is not
required, such as the case where we have concurrent updates which is
effectively a race condition between actions on separate nodes. Not
sending the origin id because it is not required in some cases is
equivalent to saying we can skip locking because a race condition does
not happen in all cases. Making a case that the race condition is rare
is still not a case for skipping locking. Same here: we need the
information to avoid making errors in the general case.

> Another interesting scenario is that you maintain a global counter, like in
> an inventory system, and conflicts are resolved by accumulating the updates.
> For example, if you do "UPDATE SET counter = counter + 1" simultaneously on
> two nodes, the result is that the counter is incremented by two. The
> avoid-update-if-already-identical optimization doesn't help in this case,
> the origin id is necessary.
>
> Now, let's take the inventory system example further. There are actually two
> ways to update a counter. One is when an item is checked in or out of the
> warehouse, ie. "UPDATE counter = counter + 1". Those updates should
> accumulate. But another operation resets the counter to a specific value,
> "UPDATE counter = 10", like when taking an inventory. That should not
> accumulate with other changes, but should be newest-update-wins. The origin
> id is not enough for that, because by looking at the WAL record and the
> origin id, you don't know which type of an update it was.

Yes, of course. Conflict handling in the general case requires much
additional work. This thread is one minor change of many related
changes. The patches are being submitted in smaller chunks to ease
review, so sometimes there are cross links between things.

> So, I don't like the idea of adding the origin id to the record header. It's
> only required in some occasions, and on some record types.

That conclusion doesn't follow from your stated arguments.

> And I'm worried
> it might not even be enough in more complicated scenarios.

It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


> Perhaps we need a more generic WAL record annotation system, where a plugin
> can tack arbitrary information to WAL records. The extra information could
> be stored in the WAL record after the rmgr payload, similar to how backup
> blocks are stored. WAL replay could just ignore the annotations, but a
> replication system could use it to store the origin id or whatever extra
> information it needs.

Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.

One question I raised in my review was whether this extra information
should be added by a variable length header, so I already asked this
very question. So far there is no evidence that the additional code
complexity would be warranted. If it became so in the future, it can
be modified again. At this stage there is no need, so the proposal is
to add the field to every WAL record without regard to the setting of
wal_level because there is no measurable downside to doing so. The
downsides of additional complexity are clear and real however, so I
wish to avoid them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http:/

Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Heikki Linnakangas

On 01.06.2012 03:02, Jeff Janes wrote:

I've attached a new patch which addresses several of your concerns,
and adds the documentation.  The description is much longer than the
descriptions of other nearby options, which mostly just give a simple
statement of what they do rather than a description of why that is
useful.  I don't know if that means I'm starting a good trend, or a
bad one, or I'm just putting the exposition in the wrong place.

In addition to showing the benefits of coding things on the server
side when that is applicable, it also allows hackers to stress parts
of the server code that are not easy to stress otherwise.


As you mentioned in your original email over a year ago, most of this 
could be done as a custom script. It's nice to have another workload to 
test, but then again, there's an infinite number of workloads that might 
be interesting.


You can achieve the same with this custom script:

\set loops 512

do $$ DECLARE  sum integer default 0; amount integer; account_id 
integer; BEGIN FOR i IN 1..:loops LOOP   account_id=1 + floor(random() * 
:scale);   SELECT abalance into strict amount FROM pgbench_accounts 
WHERE aid = account_id;   sum := sum + amount; END LOOP; END; $$;


It's a bit awkward because it has to be all on one line, and you don't 
get the auto-detection of scale. But those are really the only 
differences AFAICS.


I think we would be better off improving pgbench to support those things 
in custom scripts. It would be nice to be able to write initialization 
steps that only run once in the beginning of the test. You could then 
put the "SELECT COUNT(*) FROM pgbench_branches" there, to do the scale 
auto-detection.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Simon Riggs
On 20 June 2012 15:32, Heikki Linnakangas
 wrote:
> On 01.06.2012 03:02, Jeff Janes wrote:
>>
>> I've attached a new patch which addresses several of your concerns,
>> and adds the documentation.  The description is much longer than the
>> descriptions of other nearby options, which mostly just give a simple
>> statement of what they do rather than a description of why that is
>> useful.  I don't know if that means I'm starting a good trend, or a
>> bad one, or I'm just putting the exposition in the wrong place.
>>
>> In addition to showing the benefits of coding things on the server
>> side when that is applicable, it also allows hackers to stress parts
>> of the server code that are not easy to stress otherwise.
>
>
> As you mentioned in your original email over a year ago, most of this could
> be done as a custom script. It's nice to have another workload to test, but
> then again, there's an infinite number of workloads that might be
> interesting.
>
> You can achieve the same with this custom script:
>
> \set loops 512
>
> do $$ DECLARE  sum integer default 0; amount integer; account_id integer;
> BEGIN FOR i IN 1..:loops LOOP   account_id=1 + floor(random() * :scale);
> SELECT abalance into strict amount FROM pgbench_accounts WHERE aid =
> account_id;   sum := sum + amount; END LOOP; END; $$;
>
> It's a bit awkward because it has to be all on one line, and you don't get
> the auto-detection of scale. But those are really the only differences
> AFAICS.
>
> I think we would be better off improving pgbench to support those things in
> custom scripts. It would be nice to be able to write initialization steps
> that only run once in the beginning of the test. You could then put the
> "SELECT COUNT(*) FROM pgbench_branches" there, to do the scale
> auto-detection.

I'm sure Jeff submitted this because of the need for a standard test,
rather than the wish to actually modify pgbench itself.

Can I suggest that we include a list of standard scripts with pgbench
for this purpose? These can then be copied alongside the binary when
we do an install.

That way this patch can become a standard script, plus an entry in the
docs to list this.

We could even include scripts for the usual cases, to allow them to be
more easily viewed/copied/modified.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 10:32, Simon Riggs wrote:

On 20 June 2012 14:40, Heikki Linnakangas

And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution 
mechanisms, what extra information do you need for those, and where do 
you put it?



Perhaps we need a more generic WAL record annotation system, where a plugin
can tack arbitrary information to WAL records. The extra information could
be stored in the WAL record after the rmgr payload, similar to how backup
blocks are stored. WAL replay could just ignore the annotations, but a
replication system could use it to store the origin id or whatever extra
information it needs.


Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the 
table, then I'll have to just outright object to this patch. Let's 
implement what we can without the origin id, and revisit this later.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Too frequent message of pgbench -i?

2012-06-20 Thread Tatsuo Ishii
Currently pgbench -i prints following message every 10k tuples created.

fprintf(stderr, "%d tuples done.\n", j);

I think it's long time ago when the frequency of message seemed to be
appropriate because computer is getting so fast these days and every
10k message seems to be too often for me. Can we change the frequency
from 10k to 100k? Or should we make the frequency be adjustable from a
command line? Or even better, the frequency should be changed
automatically according to scale factor?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 15:45, Heikki Linnakangas
 wrote:
> On 20.06.2012 10:32, Simon Riggs wrote:
>>
>> On 20 June 2012 14:40, Heikki Linnakangas
>>>
>>> And I'm worried
>>>
>>> it might not even be enough in more complicated scenarios.
>>
>>
>> It is not the only required conflict mechanism, and has never been
>> claimed to be so. It is simply one piece of information needed, at
>> various times.
>
>
> So, if the origin id is not sufficient for some conflict resolution
> mechanisms, what extra information do you need for those, and where do you
> put it?

As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
  UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update
  UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


>>> Perhaps we need a more generic WAL record annotation system, where a
>>> plugin
>>> can tack arbitrary information to WAL records. The extra information
>>> could
>>> be stored in the WAL record after the rmgr payload, similar to how backup
>>> blocks are stored. WAL replay could just ignore the annotations, but a
>>> replication system could use it to store the origin id or whatever extra
>>> information it needs.
>>
>>
>> Additional information required by logical information will be handled
>> by a new wal_level.
>>
>> The discussion here is about adding origin_node_id *only*, which needs
>> to be added on each WAL record.
>
>
> If that's all we can discuss here is, and all other options are off the
> table, then I'll have to just outright object to this patch. Let's implement
> what we can without the origin id, and revisit this later.

As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.

We have the choice to add the required information as a static or as a
variable length addition to the WAL record header. Since there is no
additional requirement for expansion of the header at this point and
no measurable harm in doing so, I suggest we avoid the complexity and
lack of robustness that a variable length header would cause. Of
course, we can go there later if needed, but there is no current need.
If a variable length header had been suggested, it is certain somebody
would say that was overcooked and just do it as static.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 11:17, Simon Riggs wrote:

On 20 June 2012 15:45, Heikki Linnakangas
  wrote:

On 20.06.2012 10:32, Simon Riggs wrote:


On 20 June 2012 14:40, Heikki Linnakangas


And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do you
put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


Yeah, I'd like to hear briefly how you would handle that without any 
further changes to the WAL record header.



Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the
table, then I'll have to just outright object to this patch. Let's implement
what we can without the origin id, and revisit this later.


As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.


It's only needed for multi-master replication, where the same table can 
be updated from multiple nodes. Just leave that out for now. There's 
plenty of functionality and issues left even without that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Release versioning inconsistency

2012-06-20 Thread Marti Raudsepp
Hi list,

The recent 9.2 beta releases have used a slightly different numbering
scheme than all previous releases.

It used to be that tarballs for version $VER were always available at:
  http://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.bz2

However, the new releases now use "v9.2.0beta2" for the directory
name, but "9.2beta2" in the tarball file. No big deal for most people,
but it will confuse people who have scripts to download PostgreSQL
tarballs automatically (e.g. packagers).

For example:
  http://ftp.postgresql.org/pub/source/v9.2.0beta2/postgresql-9.2beta2.tar.bz2

Is there any reason behind this change?

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 16:23, Heikki Linnakangas
 wrote:

> It's only needed for multi-master replication, where the same table can be
> updated from multiple nodes. Just leave that out for now. There's plenty of
> functionality and issues left even without that.

Huh? Multi-master replication is what is being built here and many
people want that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 11:34, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
  wrote:


It's only needed for multi-master replication, where the same table can be
updated from multiple nodes. Just leave that out for now. There's plenty of
functionality and issues left even without that.


Huh? Multi-master replication is what is being built here and many
people want that.


Sure, but presumably you're going to implement master-slave first, and 
build multi-master on top of that. What I'm saying is that we can leave 
out the origin-id for now, since we don't have agreement on it, and 
revisit it after master-slave replication is working.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Too frequent message of pgbench -i?

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 11:04, Tatsuo Ishii wrote:

Currently pgbench -i prints following message every 10k tuples created.

fprintf(stderr, "%d tuples done.\n", j);

I think it's long time ago when the frequency of message seemed to be
appropriate because computer is getting so fast these days and every
10k message seems to be too often for me. Can we change the frequency
from 10k to 100k? Or should we make the frequency be adjustable from a
command line? Or even better, the frequency should be changed
automatically according to scale factor?


Or print the message every n seconds, instead of every n lines. I think 
a new command-line option would be overkill. I would be fine with just 
bumping it to every 100k lines.


So +1 on doing something about it, not sure what.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne  
wrote:
> > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas  
wrote:
> >>> Btw, what do you mean with "conflating" the stream? I don't really see
> >>> that being proposed.
> >> 
> >> It seems to me that you are intent on using the WAL stream as the
> >> logical change stream.  I think that's a bad design.  Instead, you
> >> should extract changes from WAL and then ship them around in a format
> >> that is specific to logical replication.
> > 
> > Yeah, that seems worth elaborating on.
> > 
> > What has been said several times is that it's pretty necessary to
> > capture the logical changes into WAL.  That seems pretty needful, in
> > order that the replication data gets fsync()ed avidly, and so that we
> > don't add in the race condition of needing to fsync() something *else*
> > almost exactly as avidly as is the case for WAL today..
> 
> Check.
> 
> > But it's undesirable to pull *all* the bulk of contents of WAL around
> > if it's only part of the data that is going to get applied.  On a
> > "physical streaming" replica, any logical data that gets captured will
> > be useless.  And on a "logical replica," they "physical" bits of WAL
> > will be useless.
> > 
> > What I *want* you to mean is that there would be:
> > a) WAL readers that pull the "physical bits", and
> > b) WAL readers that just pull "logical bits."
> > 
> > I expect it would be fine to have a tool that pulls LCRs out of WAL to
> > prepare that to be sent to remote locations.  Is that what you have in
> > mind?
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.  This discussion illustrates why:
> we're talking about consuming scarce bit-space in WAL records for a
> feature that only a tiny minority of users will use, and it's still
> not really enough bit space.  That stinks.  If LCR transmission is a
> separate protocol, this problem can be engineered away at a higher
> level.
As I said before, I definitely agree that we want to have a separate transport 
format once we have decoding nailed down. We still need to ship wal around if 
the decoding happens in a different instance, but *after* that it can be 
shipped in something more convenient/appropriate.

> Suppose we have three servers, A, B, and C, that are doing
> multi-master replication in a loop.  A sends LCRs to B, B sends them
> to C, and C sends them back to A.  Obviously, we need to make sure
> that each server applies each set of changes just once, but it
> suffices to have enough information in WAL to distinguish between
> replication transactions and non-replication transactions - that is,
> one bit.  So suppose a change is made on server A.  A generates LCRs
> from WAL, and tags each LCR with node_id = A.  It then sends those
> LCRs to B.  B applies them, flagging the apply transaction in WAL as a
> replication transaction, AND ALSO sends the LCRs to C.  The LCR
> generator on B sees the WAL from apply, but because it's flagged as a
> replication transaction, it does not generate LCRs.  So C receives
> LCRs from B just once, without any need for the node_id to to be known
> in WAL.  C can now also apply those LCRs (again flagging the apply
> transaction as replication) and it can also skip sending them to A,
> because it seems that they originated at A.
One bit is fine if you have only very simple replication topologies. Once you 
think about globally distributed databases its a bit different. You describe 
some of that below, but just to reiterate: 
Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF 
in europe). You may only want to have full intercontinental interconnect 
between two of those (say A and D). If you only have one bit to represent the 
origin thats not going to work because you won't be able discern the changes 
from BC on A from the changes from those originating on DEF.

Another topology which is interesting is circular replications (i.e. changes 
get shipped A->B, B->C, C->A) which is a sensible topology if you only have a 
low change rate and a relatively high number of nodes because you don't need 
the full combinatorial amount of connections.

You still have the origin_id's be meaningful in the local context though. As 
described before, in the communication between the different nodes you can 
simply replace 16bit node id with some fancy UUID or such. And do the reverse 
when replaying LCRs.

> Now suppose we have a more complex topology.  Suppose we have a
> cluster of four servers A .. D which, for improved tolerance against
> network outages, are all connected pairwise.  Normally all the links
> are up, so each server sends all the LCRs it generates directly to all
> other servers.  But how do we prevent cycles?  A gen

Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 10:28 AM, Marti Raudsepp  wrote:
> Hi list,
>
> The recent 9.2 beta releases have used a slightly different numbering
> scheme than all previous releases.
>
> It used to be that tarballs for version $VER were always available at:
>  http://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.bz2
>
> However, the new releases now use "v9.2.0beta2" for the directory
> name, but "9.2beta2" in the tarball file. No big deal for most people,
> but it will confuse people who have scripts to download PostgreSQL
> tarballs automatically (e.g. packagers).
>
> For example:
>  http://ftp.postgresql.org/pub/source/v9.2.0beta2/postgresql-9.2beta2.tar.bz2
>
> Is there any reason behind this change?

Not behind the first one. I believe that's just me and a bad memory -
I got it wrong. (I do believe that using the v9.2.0beta marker is
*better*, because then it sorts properly. But likely not enough much
better to be inconsistent with previous versions)

For beta2, the only reason was to keep it consistent with beta1.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Marti Raudsepp
On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander  wrote:
> (I do believe that using the v9.2.0beta marker is
> *better*, because then it sorts properly. But likely not enough much
> better to be inconsistent with previous versions)

Good point. Maybe that's a reason to change the versioning scheme and
stick with "9.2.0betaX" everywhere. Including calling the final
release "9.2.0" instead of simply "9.2"?

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 16:44, Heikki Linnakangas
 wrote:
> On 20.06.2012 11:34, Simon Riggs wrote:
>>
>> On 20 June 2012 16:23, Heikki Linnakangas
>>   wrote:
>>
>>> It's only needed for multi-master replication, where the same table can
>>> be
>>> updated from multiple nodes. Just leave that out for now. There's plenty
>>> of
>>> functionality and issues left even without that.
>>
>>
>> Huh? Multi-master replication is what is being built here and many
>> people want that.
>
>
> Sure, but presumably you're going to implement master-slave first, and build
> multi-master on top of that. What I'm saying is that we can leave out the
> origin-id for now, since we don't have agreement on it, and revisit it after
> master-slave replication is working.

I am comfortable with the idea of deferring applying the patch, but I
don't see any need to defer agreeing the patch is OK, so it can be
applied easily later. It does beg the question of when exactly we
would defer it to though. When would that be?

If you have a reason for disagreement, please raise it now, having
seen explanations/comments on various concerns. Of course, people have
made initial objections, which is fine but its not reasonable to
assume that such complaints continue to exist after. Perhaps there are
other thoughts?

The idea that logical rep is some kind of useful end goal in itself is
slightly misleading. If the thought is to block multi-master
completely on that basis, that would be a shame. Logical rep is the
mechanism for implementing multi-master.

Deferring this could easily end up with a huge patch in last CF, and
then it will be rejected/deferred. Patch submission here is following
the requested process - as early as possible, production ready, small
meaningful patches that build towards a final goal. This is especially
true for format changes, which is why this patch is here now. Doing it
differently just makes patch wrangling and review more difficult,
which reduces overall quality and slows down development.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Eisentraut
On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote:
> So if you take the word "Aßlar" here - that is equivalent to "Asslar",
> and so strcoll("Aßlar", "Asslar") will return 0 if you have the right
> LC_COLLATE

This is not actually correct.  glibc will sort Asslar before Aßlar, and
that is correct in my mind.

When a Wikipedia page on some particular language's alphabet says
something like "$letterA and $letterB are equivalent", what it really
means is that they are sorted the same compared to other letters, but
are distinct when ties are broken.

>  (if you tried this out for yourself and found that I was
> actually lying through my teeth, pretend I said Hungarian instead of
> German and "some really obscure character" rather than ß).

Yeah, there are obviously exceptions, which led to the original change
being made, but they are not as wide-spread as they appear to be.

The real issue in this area, I suspect, will be dealing with Unicode
combining sequences versus equivalent precombined characters.  But
support for that is generally crappy, so it's not urgent to deal with
it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-20 Thread Andres Freund
Hi Robert, Hi All!

On Wednesday, June 20, 2012 03:08:48 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund  
wrote:
> >> Well, the words are fuzzy, but I would define logical replication to
> >> be something which is independent of the binary format in which stuff
> >> gets stored on disk.  If it's not independent of the disk format, then
> >> you can't do heterogenous replication (between versions, or between
> >> products).  That precise limitation is the main thing that drives
> >> people to use anything other than SR in the first place, IME.
> > 
> > Not in mine. The main limitation I see is that you cannot write anything
> > on the standby. Which sucks majorly for many things. Its pretty much
> > impossible to "fix" that for SR outside of very limited cases.
> > While many scenarios don't need multimaster *many* need to write outside
> > of the standby's replication set.
> Well, that's certainly a common problem, even if it's not IME the most
> common, but I don't think we need to argue about which one is more
> common, because I'm not arguing against it.  The point, though, is
> that if the logical format is independent of the on-disk format, the
> things we can do are a strict superset of the things we can do if it
> isn't.  I don't want to insist that catalogs be the same (or else you
> get garbage when you decode tuples).  I want to tolerate the fact that
> they may very well be different.  That will in no way preclude writing
> outside the standby's replication set, nor will it prevent
> multi-master replication.  It will, however, enable heterogenous
> replication, which is a very important use case.  It will also mean
> that innocent mistakes (like somehow ending up with a column that is
> text on one server and numeric on another server) produce
> comprehensible error messages, rather than garbage.
I agree with most of that. I think that some parts of the above need to be 
optional because you do loose too much for other scenarious.
I *definitely* want to build the *infrastructure* which make it easy to 
implement all of the above but I find it a bit much to require that from the 
get-go. Its important that everything is reusable for that, yes. Does a 
patchset that wants to implement tightly coupled multimaster need to implement 
everything for that? No.
If we raise the barrier for anything around this topic so high we will *NEVER* 
get anywhere. Its a huge topic with loads of people wanting loads of different 
things. And that will hurt people wanting some feature which matches 90% of 
the proposed goals *far* more.

> > Its not only the logging side which is a limitation in todays replication
> > scenarios. The apply side scales even worse because its *very* hard to
> > distribute it between multiple backends.

> I don't think that making LCR format = on-disk format is going to
> solve that problem.  To solve that problem, we need to track
> dependencies between transactions, so that if tuple A is modified by
> T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
> committed after both T1 and T2 - touches none of the same data as T1
> or T2 - then we can apply it in parallel, so long as we don't commit
> until T1 and T2 have committed (because allowing T3 to commit early
> would produce a serialization anomaly from the point of view of a
> concurrent reader).
Well, doing apply on such low level, without reencoding the data increased 
throughput nearly threefold even for trivial types. So it pushes of the point 
where we need to do the above quite a bit.

> >> Because the routines that decode tuples don't include enough sanity
> >> checks to prevent running off the end of the block, or even the end of
> >> memory completely.  Consider a corrupt TOAST pointer that indicates
> >> that there is a gigabyte of data stored in an 8kB block.  One of the
> >> common symptoms of corruption IME is TOAST requests for -3 bytes of
> >> memory.
> > Yes, but we need to put safeguards against that sort of thing anyway. So
> > sure, we can have bugs but this is not a fundamental limitation.
> There's a reason we haven't done that already, though: it's probably
> going to stink for performance.  If it turns out that it doesn't stink
> for performance, great.  But if it causes a 5% slowdown on common use
> cases, I suspect we're not gonna do it, and I bet I can construct a
> case where it's worse than that (think: 400 column table with lots of
> varlenas, sorting by column 400 to return column 399).  I think it's
> treading on dangerous ground to assume we're going to be able to "just
> go fix" this.
I am talking about ensuring that the catalog is the same on the decoding site 
not about making all decoding totally safe in the face of corrupted 
information.

> > Postgis uses one information table in a few more complex functions but
> > not in anything low-level. Evidenced by the fact that it was totally
> > normal for that to go out of sync before < 2.0.
> > 
> > But even if

Re: [HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-20 Thread Jeff Janes
On Tue, Jun 19, 2012 at 8:06 PM, Robert Haas  wrote:
> On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes  wrote:
>> But in the 9.2 branch, the slow phenotype was re-introduced in
>> 1575fbcb795fc331f4, although perhaps the details of who is locking
>> what differs.  I haven't yet sorted that out.
>
> It very much does.  That commit prevents people from creating a
> relation in - or renaming a relation into - a schema that is being
> concurrently dropped, which in previous releases would have resulted
> in inconsistent catalog contents.  I admit that it harms your test
> case, but how likely is it that someone is going to put every single
> table into its own schema?

Yep, I see that even having 10 tables per scheme (which I think was
the case for the person who started this line of inquiry) rather than
just 1 greatly reduces this regression.  It basically goes from about
2x slower to about 1.1x slower.  So I think that pretty much settles
the issue for 9.2.


> And have shared_buffers low enough for
> this to be the dominant cost?

That one is not so unlikely.  After all, lowering shared_buffers is a
defensive move to get around the FlushRelationBuffers problem, so
hopefully people faced with the task of restoring such a dump would
take advantage of it, if they have a maintenance window in which to do
so.  Also, the FlushRelationBuffers issue is linear while the locks
are quadratic, so eventually the locking would regain dominance even
if shared_buffers is large.

> I think in real-world scenarios this
> isn't going to be a problem - although, of course, making the lock
> manager faster would be nifty if we can do it, and this might be a
> good test case.

The resource owner reassign lock patch does effectively solve the lock
manager problem of restoring dumps of such databases, as well as the
similar problem in creating such dumps.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables

2012-06-20 Thread Etsuro Fujita
Hi KaiGai-san,

Thank you for the review.

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
> Sent: Wednesday, June 20, 2012 1:26 AM
> To: Etsuro Fujita
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
> foreign tables
>
> Hi Fujita-san,
>
> Could you rebase this patch towards the latest tree?
> It was unavailable to apply the patch cleanly.

Sorry, I updated the patch.  Please find attached an updated version of the
patch.

> I looked over the patch, then noticed a few points.
>
> At ProcessCopyOptions(), defel->arg can be NIL, isn't it?
> If so, cstate->convert_binary is not a suitable flag to check redundant
> option. It seems to me cstate->convert_selectively is more suitable flag
> to check it.
>
> +   else if (strcmp(defel->defname, "convert_binary") == 0)
> +   {
> +   if (cstate->convert_binary)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_SYNTAX_ERROR),
> +errmsg("conflicting or redundant options")));
> +   cstate->convert_selectively = true;
> +   if (defel->arg == NULL || IsA(defel->arg, List))
> +   cstate->convert_binary = (List *) defel->arg;
> +   else
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("argument to option \"%s\" must be a
> list of column names",
> +   defel->defname)));
> +   }

Yes, defel->arg can be NIL.  defel->arg is a List structure listing all the
columns needed to be converted to binary representation, which is NIL for
the case where no columns are needed to be converted.  For example,
defel->arg is NIL for SELECT COUNT(*).  In this case, while
cstate->convert_selectively is set to true, no columns are converted at
NextCopyFrom().  Most efficient case!  In short, cstate->convert_selectively
represents whether to do selective binary conversion at NextCopyFrom(), and
cstate->convert_binary represents all the columns to be converted at
NextCopyFrom(), that can be NIL.

> At NextCopyFrom(), this routine computes default values if configured.
> In case when these values are not referenced, it might be possible to skip
> unnecessary calculations.
> Is it unavailable to add logic to avoid to construct cstate->defmap on
> unreferenced columns at  BeginCopyFrom()?

I think that we don't need to add the above logic because file_fdw does
BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom()
doesn't construct cstate->defmap at all.

I fixed a bug plus some minor optimization in check_binary_conversion() that
is renamed to check_selective_binary_conversion() in the updated version,
and now file_fdw gives up selective binary conversion for the following
cases:

  a) BINARY format
  b) CSV/TEXT format and whole row reference
  c) CSV/TEXT format and all the user attributes needed


Best regards,
Etsuro Fujita

> Thanks,
>
> 2012/5/11 Etsuro Fujita :
> >> -Original Message-
> >> From: Robert Haas [mailto:robertmh...@gmail.com]
> >> Sent: Friday, May 11, 2012 1:36 AM
> >> To: Etsuro Fujita
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV
> >> file foreign tables
> >>
> >> On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita
> > 
> >> wrote:
> >> > I would like to propose to improve parsing efficiency of
> >> > contrib/file_fdw by selective parsing proposed by Alagiannis et
> >> > al.[1], which means that for a CSV/TEXT file foreign table,
> >> > file_fdw performs binary conversion only for the columns needed for
> >> > query processing.  Attached is a WIP patch implementing the feature.
> >>
> >> Can you add this to the next CommitFest?  Looks interesting.
> >
> > Done.
> >
> > Best regards,
> > Etsuro Fujita
> >
> >> --
> >> Robert Haas
> >> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
> >> Company
> >>
> >
> >
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
> > make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
>
>
>
> --
> KaiGai Kohei 
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



begin 666 file_fdw_sel_bin_conv_v2.patch
M9&EF9B M+6=I="!A+V-O;G1R:6(O9FEL95]F9'"!E,V(Y,C(S+BYF9#DT
M,3)F(#$P,#8T- HM+2T@82]C;VYT7-A='1R+F@B
M"B C:6YC;'5D92 B8V%T86QO9R]P9U]F;W)E:6=N7W1A8FQE+F@B"B C:6YC
M;'5D92 B8V]M;6%N9',O8V]P>2YH(@H@(VEN8VQU9&4@(F-O;6UA;F1S+V1E
M9G)E;2YH(@I 0" M,CDL-B K,S L-R! 0 H@(VEN8VQU9&4@(F]P=&EM:7IE
MF4H4&QA;FYE5-T871E"6-S=&%T93L*0$ @+34U.2PV("LU-S4L,3 @
M0$ @9FEL94)E9VEN1F]R96EG;E-C86XH1F]R96EG;E-C86Y3=&%T92 J;F]D
M92P@:6YT(&5F;&%G'!E8W1E
M9"!38V%N5'5P;&53;&]T('-I9VYA='5R92X*0$ @+38Y-2PV("LW,

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 11:00, Peter Eisentraut  wrote:
> On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote:
>> So if you take the word "Aßlar" here - that is equivalent to "Asslar",
>> and so strcoll("Aßlar", "Asslar") will return 0 if you have the right
>> LC_COLLATE
>
> This is not actually correct.  glibc will sort Asslar before Aßlar, and
> that is correct in my mind.

Uh, what happened here was that I assumed that it was correct, and
then went to verify it and found that it wasn't before sending the
mail, and couldn't immediately find any hard data about what
characters this did apply to, I decided to turn it into a joke. I say
this, and yet you've included that bit of the e-mail inline in your
reply, so maybe it just wasn't a very good joke.

> When a Wikipedia page on some particular language's alphabet says
> something like "$letterA and $letterB are equivalent", what it really
> means is that they are sorted the same compared to other letters, but
> are distinct when ties are broken.

I know.

>>  (if you tried this out for yourself and found that I was
>> actually lying through my teeth, pretend I said Hungarian instead of
>> German and "some really obscure character" rather than ß).
>
> Yeah, there are obviously exceptions, which led to the original change
> being made, but they are not as wide-spread as they appear to be.

True.

> The real issue in this area, I suspect, will be dealing with Unicode
> combining sequences versus equivalent precombined characters.  But
> support for that is generally crappy, so it's not urgent to deal with
> it.

I agree that it isn't urgent. However, I have an ulterior motive,
which is that in allowing for this, we remove the need to strcmp()
after each strcoll(), and consequently it becomes possible to use
strxfrm() instead. Now, we could also use a hack that's going to make
the strxfrm() blobs even bulkier still (basically, concatenate the
original text to the blob before strcmp()), but I don't want to go
there if it can possibly be avoided.

I should also point out that we pride ourselves on following the
letter of the standard when that makes sense, and we are currently not
doing that in respect of the Unicode standard.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun19, 2012, at 17:36 , Robert Haas wrote:
> On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
>  wrote:
>> On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
>>> The fly in the ointment with any of these ideas is that the "configure
>>> list" is not a list of exact cipher names, as per Magnus' comment that
>>> the current default includes tests like "!aNULL".  I am not sure that
>>> we know how to evaluate such conditions if we are applying an
>>> after-the-fact check on the selected cipher.  Does OpenSSL expose any
>>> API for evaluating whether a selected cipher meets such a test?
>> 
>> I'm not sure whether there's an API for it, but you can certainly check
>> manually with "openssl ciphers -v", for example:
>> 
>> $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
>> NULL-SHASSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=SHA1
>> NULL-MD5SSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=MD5
>> 
>> ...etc...
>> 
>> So unless the openssl includes the code twice there must be a way to
>> extract the list from the library.
> 
> There doubtless is, but I'd being willing to wager that you won't be
> able to figure out the exact method without reading the source code
> for 'opennssl ciphers' to see how it was done there, and most likely
> you'll find that at least one of the functions they use has no man
> page.  Documentation isn't their strong point.

Yes, unfortunately.

I wonder though if shouldn't restrict the allowed ciphers list to being
a simple list of supported ciphers. If our goal is to support multiple
SSL libraries transparently then surely having openssl-specific syntax
in the config file isn't exactly great anyway...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 05:01:16 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund  
wrote:
> >> > 1. dllist.h has double the element overhead by having an inline value
> >> > pointer (which is not needed when embedding) and a pointer to the list
> >> > (which I have a hard time seing as being useful)
> >> > 2. only double linked list, mine provided single and double linked
> >> > ones 3. missing macros to use when embedded in a larger struct
> >> > (containerof() wrappers and for(...) support basically)
> >> > 4. most things are external function calls...
> >> > 5. way much more branches/complexity in most of the functions. My
> >> > implementation doesn't use any branches for the typical easy
> >> > modifications (push, pop, remove element somewhere) and only one for
> >> > the typical tests (empty, has-next, ...)
> >> > 
> >> > The performance and memory aspects were crucial for the aforementioned
> >> > toy project (slab allocator for postgres). Its not that crucial for
> >> > the applycache where the lists currently are mostly used although its
> >> > also relatively performance sensitive and obviously does a lot of
> >> > list manipulation/iteration.
> >> > 
> >> > If I had to decide I would add the missing api in dllist.h to my
> >> > implementation and then remove it. Its barely used - and only in an
> >> > embedded fashion - as far as I can see.
> >> > I can understand though if that argument is met with doubt by others
> >> > ;). If thats the way it has to go I would add some more convenience
> >> > support for embedding data to dllist.h and settle for that.
> >> 
> >> I think it might be simpler to leave the name as Dllist and just
> >> overhaul the implementation along the lines you suggest, rather than
> >> replacing it with something completely different.  Mostly, I don't
> >> want to add a third thing if we can avoid it, given that Dllist as it
> >> exists today is used only lightly.
> > 
> > Well, if its the name, I have no problem with changing it, but I don't
> > see how you can keep the api as it currently is and address my points.
> > 
> > If there is some buyin I can try to go either way (keeping the existing
> > name, changing the api, adjusting the callers or just adjust the
> > callers, throw away the old implementation) I just don't want to get
> > into that just to see somebody isn't agreeing with the fundamental idea.
> My guess is that it wouldn't be too hard to remove some of the extra
> pointers.  Anyone who is using Dllist as a non-inline list could be
> converted to List * instead. 
There are only three users of the whole dllist.h. Catcache, autovacuum and 
postmaster. The latter two just keep a list of databases around. So any change 
will only be moderatively intrusive.

> Also, the performance-critical things
> could be reimplemented as macros. 

> I question, though, whether we really need both single and doubly linked
> lists.  That seems like it's almost certainly micro-optimization that we are
> better off not doing.
It was certainly worthwile for the memory manager (lower per allocation 
overhead). You might be right that its not worth it for many other possible 
usecases in pg. Its not much code though.

*looks around*

A quick grep found:

single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h (ok, 
irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c

I stopped at that point because while surely not of all of the above usecases 
could be replaced by a common implementation several could from a quick look. 
Also, several pg_list.h users could benefit from a conversion. So I think 
adding a single linked list implementation is worthwile.

> > The most contentious point is probably relying on USE_INLINE being
> > available anywhere. Which I believe to be the point now that we have
> > gotten rid of some platforms.
> I would be hesitant to chuck that even though I realize it's unlikely
> that we really need !USE_INLINE.  But see sortsupport for an example
> of how we've handled this in the recent past.
I agree its possible to resolve this. But hell, do we really need to add all 
that ugliness in 2012? I don't think its worth the effort of support ancient 
compilers that don't support inline anymore. If we could stop trying to 
support catering for probably non-existing compilers we could remove some 
*very* ugly long macros for example (e.g. in htup.h).

If support for !USE_INLINE is required I would prefer to have an header define 
the functions like

#ifdef USE_INLINE
#define OPTIONALLY_INLINE static inline
#define USE_LINKED_LIST_IMPL
#endif

#ifdef USE_LINKED_LIST_IMPL

OPTIONALLY_INLINE void myFuncCall(){
...
}
#endif

which then gets included with #define USE_LINKED_LIST_IMPL by some c file 
defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
Its too much code to duplicate imo.

Andres
-- 
 Andres Freund http://w

Re: [HACKERS] Ability to listen on two unix sockets

2012-06-20 Thread Honza Horak

On 06/15/2012 05:40 PM, Honza Horak wrote:

I realized the patch has some difficulties -- namely the socket path in the 
data dir lock file, which currently uses one port for socket and the same for 
interface. So to allow users to use arbitrary port for all unix sockets, we'd 
need to add another line only for unix socket, which doesn't apply for other 
platforms. Or we could just say that the first socket will allways use the 
default port (PostPortNumber), which is a solution I prefer currently, but will 
be glad for any other opinion. This is also why there is still un-necesary 
string splitting in pg_ctl.c, which will be removed after the issue above is 
solved.



This is an enhanced patch, which forbids using a port number in the 
first socket directory entry.


Honza
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index cfdb33a..679c40a 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -838,7 +838,7 @@ omicron bryanh  guest1
 unix_socket_permissions (and possibly
 unix_socket_group) configuration parameters as
 described in .  Or you
-could set the unix_socket_directory
+could set the unix_socket_directories
 configuration parameter to place the socket file in a suitably
 restricted directory.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 074afee..d1727f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -453,17 +453,26 @@ SET ENABLE_SEQSCAN TO OFF;
   
  
 
- 
-  unix_socket_directory (string)
+ 
+  unix_socket_directories (string)
   
-   unix_socket_directory configuration parameter
+   unix_socket_directories configuration parameter
   
   

-Specifies the directory of the Unix-domain socket on which the
+Specifies the directories of the Unix-domain sockets on which the
 server is to listen for
 connections from client applications.  The default is normally
 /tmp, but can be changed at build time.
+Directories are separated by ',' and additional port 
+number can be set, separated from directory by ':'. Port number will 
+only be used as a part of the socket file name. For example,
+'/var/run, /tmp:5431' would create socket files
+/var/run/.s.PGSQL.5432 and
+/tmp/.s.PGSQL.5431. 
+It is not possible to define port number in the first
+entry. If set, it will be ignored and default port
+will be used. 
 This parameter can only be set at server start.

 
@@ -472,7 +481,7 @@ SET ENABLE_SEQSCAN TO OFF;
 .s.PGSQL. where
  is the server's port number, an ordinary file
 named .s.PGSQL..lock will be
-created in the unix_socket_directory directory.  Neither
+created in the unix_socket_directories directories.  Neither
 file should ever be removed manually.

 
@@ -6593,7 +6602,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)


 -k x
-unix_socket_directory = x
+unix_socket_directories = x


 -l
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 7ba18f0..6c74844 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1784,7 +1784,7 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433
   
The simplest way to prevent spoofing for local
connections is to use a Unix domain socket directory () that has write permission only
+   linkend="guc-unix-socket-directories">) that has write permission only
for a trusted local user.  This prevents a malicious user from creating
their own socket file in that directory.  If you are concerned that
some applications might still reference /tmp for the
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index e3ae92d..50d4167 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
+#include "postmaster/postmaster.h"
 #include "replication/walreceiver.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -349,7 +350,20 @@ AuxiliaryProcessMain(int argc, char *argv[])
 
 	/* If standalone, create lockfile for data directory */
 	if (!IsUnderPostmaster)
-		CreateDataDirLockFile(false);
+	{
+#ifdef HAVE_UNIX_SOCKETS
+		List	   *socketsList;
+		char	   *mainSocket = NULL;
+#endif
+		/*
+		 * We need to parse UnixSocketDirs here, because we want the first 
+		 * socket directory, which will be used in postmaster.pid. 
+		 */
+#ifdef HAVE_UNIX_SOCKETS
+		SplitUnixDirectories(UnixSocketDirs, &socketsList, &mainSocket);
+#endif
+		CreateDataDirLockFile(false, mainSocket);
+	}
 
 	SetProcessingMode(BootstrapProcessing);
 	IgnoreSystemIndexes = true;
diff -

Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance

2012-06-20 Thread Magnus Hagander
On Mon, Jun 11, 2012 at 6:06 PM, Fujii Masao  wrote:
> On Tue, Jun 12, 2012 at 12:47 AM, Magnus Hagander  wrote:
>> On Mon, Jun 11, 2012 at 5:37 PM, Fujii Masao  wrote:
>>> On Mon, Jun 11, 2012 at 3:24 AM, Magnus Hagander  
>>> wrote:
 On Sun, Jun 10, 2012 at 6:08 PM, Fujii Masao  wrote:
> On Sun, Jun 10, 2012 at 11:45 PM, Magnus Hagander  
> wrote:
>> On Sun, Jun 10, 2012 at 4:29 PM, Fujii Masao  
>> wrote:
>>> On Sun, Jun 10, 2012 at 11:10 PM, Magnus Hagander  
>>> wrote:
 On Sun, Jun 10, 2012 at 4:08 PM, Fujii Masao  
 wrote:
> On Sun, Jun 10, 2012 at 10:34 PM, Fujii Masao  
> wrote:
>> On Sun, Jun 10, 2012 at 9:25 PM, Fujii Masao  
>> wrote:
>>> On Sun, Jun 10, 2012 at 7:43 PM, Magnus Hagander 
>>>  wrote:
 On Sat, Jun 9, 2012 at 2:51 PM, Tom Lane  
 wrote:
> Fujii Masao  writes:
>> This seems a bug. I think we should prevent pg_basebackup from
>> becoming synchronous standby. Thought?
>
> Absolutely.  If we have replication clients that are not actually
> capable of being standbys, there *must* be a way for the master
> to know that.

 I thought we fixed this already by sending InvalidXlogRecPtr as 
 flush
 location? And that this only applied in 9.2?

 Are you saying we picked pg_basebackup *in backup mode* (not log
 streaming) as synchronous standby?
>>>
>>> Yes.
>>>
 If so then yes, that is
 *definitely* a bug that should be fixed. We should never select a
 connection that's not even streaming log as standby!
>>>
>>> Agreed. Attached patch prevents pg_basebackup from becoming sync
>>> standby. Also this patch fixes another problem: currently only 
>>> walsender
>>> which reaches STREAMING state can become sync walsender. OTOH,
>>> sync walsender thinks that walsender with higher priority will be 
>>> sync one
>>> whether its state is STREAMING, and switches to potential sync 
>>> walsender.
>>> So when the standby with higher priority connects to the master, we
>>> might have no sync standby until it reaches the STREAMING state.
>>> To fix this problem, the patch switches walsender's state from sync 
>>> to
>>> potential *after* walsender with higher priority has reached the
>>> STREAMING state.
>>>
>>> We also should not select (1) background stream process forked from
>>> pg_basebackup and (2) pg_receivexlog as sync standby because they
>>> don't send back replication progress. To address this, I'm thinking 
>>> to
>>> introduce new option "NOSYNC" in "START_REPLICATION" command
>>> as follows, and to change (1) and (2) so that they specify NOSYNC.
>>>
>>>    START_REPLICATION XXX/XXX [NOSYNC]
>>>
>>> If the standby specifies NOSYNC option, it's never assigned as sync
>>> standby even if its name is in synchronous_standby_names. Thought?
>>
>> The standby which always sends InvalidXLogRecPtr back should not
>> become sync one. So instead of NOSYNC option, by checking whether
>> InvalidXLogRecPtr is sent, we can avoid problematic sync standby.
>
> We should not do this because Magnus is proposing the patch
> (http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php)
> which breaks the above assumption at all. So we should introduce
> something like NOSYNC option.

 Wouldn't the better choice there in that case be to give a switch to
 pg_receivexlog if you *want* it to be able to become a sync replica,
 and by default disallow it? And then keep the backend just treating
 InvalidXlogRecPtr as don't-become-sync-replica.
>>>
>>> I don't object to making pg_receivexlog as sync standby at all. So at 
>>> least
>>> for me, that switch is not necessary. What I'm worried about is the
>>> background stream process forked from pg_basebackup. I think that
>>> it should not run as sync standby but sending back its replication 
>>> progress
>>> seems helpful because a user can see the progress from 
>>> pg_stat_replication.
>>> So I'm thinking that something like NOSYNC option is required.
>>
>> On principle, no. By default, yes.
>>
>> How about:
>> pg_basebackup background: *never* sends flush location, and therefor
>> won't become sync replica
>> pg_receivexlog *optionally* sends flush location. by defualt own't
>> become sync replica, but can be made so with a switch
>
> Wouldn't a user who sees NULL in flush_location from pg_stat_replication
> misunderstand

Re: [HACKERS] WAL format changes

2012-06-20 Thread Magnus Hagander
On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas  wrote:
> On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
>  wrote:
>> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
>> a uint64, on top of my other WAL format patches. I think we should go ahead
>> with this.
>
> +1.
>
>> The LSNs on pages are still stored in the old format, to avoid changing the
>> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
>> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
>> is required.
>
> Seems fine.
>
>> Should we keep the old representation in the replication protocol messages?
>> That would make it simpler to write a client that works with different
>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>> should mandate network-byte order for all the integer and XLogRecPtr fields
>> in the replication protocol. That would make it easier to write a client
>> that works across different architectures, in >= 9.3. The contents of the
>> WAL would of course be architecture-dependent, but it would be nice if
>> pg_receivexlog and similar tools could nevertheless be
>> architecture-independent.
>
> I share Andres' question about how we're doing this already.  I think
> if we're going to break this, I'd rather do it in 9.3 than 5 years
> from now.  At this point it's just a minor annoyance, but it'll
> probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp  wrote:
> On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander  wrote:
>> (I do believe that using the v9.2.0beta marker is
>> *better*, because then it sorts properly. But likely not enough much
>> better to be inconsistent with previous versions)
>
> Good point. Maybe that's a reason to change the versioning scheme and
> stick with "9.2.0betaX" everywhere. Including calling the final
> release "9.2.0" instead of simply "9.2"?

That might actually be a good idea. We can't really change the way we
named the betas, but it's not too late to consider naming the actual
release as 9.2.0...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Dickson S. Guedes
2012/6/20 Magnus Hagander :
> On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp  wrote:
>> On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander  
>> wrote:
>>> (I do believe that using the v9.2.0beta marker is
>>> *better*, because then it sorts properly. But likely not enough much
>>> better to be inconsistent with previous versions)
>>
>> Good point. Maybe that's a reason to change the versioning scheme and
>> stick with "9.2.0betaX" everywhere. Including calling the final
>> release "9.2.0" instead of simply "9.2"?
>
> That might actually be a good idea. We can't really change the way we
> named the betas, but it's not too late to consider naming the actual
> release as 9.2.0...


May be a symlink could be created just do fit the same pattern that other
versions do and keeps the actual links (for beta) working.

I'm using the same pattern in `pgvm` [1]  and it is failing to fetch
beta versions :(


[1] https://github.com/guedes/pgvm/blob/master/include/sites


regards
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance

2012-06-20 Thread Simon Riggs
On 11 June 2012 23:47, Magnus Hagander  wrote

>> You agreed to add something like NOSYNC option into START_REPLICATION 
>> command?
>
> I'm on the fence. I was hoping somebody else would chime in with an
> opinion as well.

Why would you add it to synchronous_standby_names and then explicitly ignore it?

I don't see why you'd want this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Peter Eisentraut
On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote:
> On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp  wrote:
> > On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander  
> > wrote:
> >> (I do believe that using the v9.2.0beta marker is
> >> *better*, because then it sorts properly. But likely not enough much
> >> better to be inconsistent with previous versions)
> >
> > Good point. Maybe that's a reason to change the versioning scheme and
> > stick with "9.2.0betaX" everywhere. Including calling the final
> > release "9.2.0" instead of simply "9.2"?
> 
> That might actually be a good idea. We can't really change the way we
> named the betas, but it's not too late to consider naming the actual
> release as 9.2.0...

The final release was always going to be called 9.2.0, but naming the
beta 9.2.0betaX is wrong.  There was a previous discussion about that
particular point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] reviving AC_PROG_INSTALL

2012-06-20 Thread Peter Eisentraut
As I had written in [0], using /usr/bin/install instead of install-sh
can significantly speed up the time to run make install, and hence make
check.  Using /usr/bin/install is standard in all autotools-using
projects.  PostgreSQL has removed the use of /usr/bin/install because of
the incident described in [1], which was basically configure picking up
some random "install" program in the path, which didn't do what we
wanted.  The Autoconf test has been much improved in the meantime, it
checks whether calling the program actually installs a file now, so it's
unlikely that another false positive would be picked up.

Therefore I propose to put back the AC_PROG_INSTALL configure check and
associated bits that were taken out back then.  See attached patch.


[0]: 
http://petereisentraut.blogspot.fi/2012/03/postgresql-make-install-times.html
[1]: http://archives.postgresql.org/pgsql-hackers/2001-03/msg00312.php
diff --git a/configure b/configure
index 6f8ebd8..fd29770 100755
--- a/configure
+++ b/configure
@@ -693,6 +693,9 @@ MKDIR_P
 AWK
 LN_S
 TAR
+INSTALL_DATA
+INSTALL_SCRIPT
+INSTALL_PROGRAM
 WINDRES
 DLLWRAP
 DLLTOOL
@@ -6855,6 +6858,106 @@ fi
 
 fi
 
+# Find a good install program.  We prefer a C program (faster),
+# so one script is as good as another.  But avoid the broken or
+# incompatible versions:
+# SysV /etc/install, /usr/sbin/install
+# SunOS /usr/etc/install
+# IRIX /sbin/install
+# AIX /bin/install
+# AmigaOS /C/install, which installs bootblocks on floppy discs
+# AIX 4 /usr/bin/installbsd, which doesn't work without a -g flag
+# AFS /usr/afsws/bin/install, which mishandles nonexistent args
+# SVR4 /usr/ucb/install, which tries to use the nonexistent group "staff"
+# OS/2's system install, which has a completely different semantic
+# ./install, which can be erroneously created by make from ./install.sh.
+# Reject install programs that cannot install multiple files.
+{ $as_echo "$as_me:$LINENO: checking for a BSD-compatible install" >&5
+$as_echo_n "checking for a BSD-compatible install... " >&6; }
+if test -z "$INSTALL"; then
+if test "${ac_cv_path_install+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+  # Account for people who put trailing slashes in PATH elements.
+case $as_dir/ in
+  ./ | .// | /cC/* | \
+  /etc/* | /usr/sbin/* | /usr/etc/* | /sbin/* | /usr/afsws/bin/* | \
+  ?:\\/os2\\/install\\/* | ?:\\/OS2\\/INSTALL\\/* | \
+  /usr/ucb/* ) ;;
+  *)
+# OSF1 and SCO ODT 3.0 have their own names for install.
+# Don't use installbsd from OSF since it installs stuff as root
+# by default.
+for ac_prog in ginstall scoinst install; do
+  for ac_exec_ext in '' $ac_executable_extensions; do
+	if { test -f "$as_dir/$ac_prog$ac_exec_ext" && $as_test_x "$as_dir/$ac_prog$ac_exec_ext"; }; then
+	  if test $ac_prog = install &&
+	grep dspmsg "$as_dir/$ac_prog$ac_exec_ext" >/dev/null 2>&1; then
+	# AIX install.  It has an incompatible calling convention.
+	:
+	  elif test $ac_prog = install &&
+	grep pwplus "$as_dir/$ac_prog$ac_exec_ext" >/dev/null 2>&1; then
+	# program-specific install script used by HP pwplus--don't use.
+	:
+	  else
+	rm -rf conftest.one conftest.two conftest.dir
+	echo one > conftest.one
+	echo two > conftest.two
+	mkdir conftest.dir
+	if "$as_dir/$ac_prog$ac_exec_ext" -c conftest.one conftest.two "`pwd`/conftest.dir" &&
+	  test -s conftest.one && test -s conftest.two &&
+	  test -s conftest.dir/conftest.one &&
+	  test -s conftest.dir/conftest.two
+	then
+	  ac_cv_path_install="$as_dir/$ac_prog$ac_exec_ext -c"
+	  break 3
+	fi
+	  fi
+	fi
+  done
+done
+;;
+esac
+
+done
+IFS=$as_save_IFS
+
+rm -rf conftest.one conftest.two conftest.dir
+
+fi
+  if test "${ac_cv_path_install+set}" = set; then
+INSTALL=$ac_cv_path_install
+  else
+# As a last resort, use the slow shell script.  Don't cache a
+# value for INSTALL within a source directory, because that will
+# break other packages using the cache if that directory is
+# removed, or if the value is a relative name.
+INSTALL=$ac_install_sh
+  fi
+fi
+{ $as_echo "$as_me:$LINENO: result: $INSTALL" >&5
+$as_echo "$INSTALL" >&6; }
+
+# Use test -z because SunOS4 sh mishandles braces in ${var-val}.
+# It thinks the first close brace ends the variable substitution.
+test -z "$INSTALL_PROGRAM" && INSTALL_PROGRAM='${INSTALL}'
+
+test -z "$INSTALL_SCRIPT" && INSTALL_SCRIPT='${INSTALL}'
+
+test -z "$INSTALL_DATA" && INSTALL_DATA='${INSTALL} -m 644'
+
+# When Autoconf chooses install-sh as install program it tries to generate
+# a relative path to it in each makefile where it subsitutes it. This clashes
+# with our Makefile.global concept. This workaround helps.
+case $INSTALL in
+  *install-sh*) INSTALL='';;
+esac
+
 # Extract the first word of "tar", so it can be a program name with args.
 set 

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 19 June 2012 19:44, Peter Geoghegan  wrote:
> PostgreSQL supported Unicode before 2005, when the tie-breaker was
> introduced. I know at least one Swede who used Postgres95. I just took
> a look at the REL6_4 branch, and it looks much the same in 1999 as it
> did in 2005, in that there is no tie-breaker after the strcoll(). Now,
> that being the case, and Hungarian in particular having a whole bunch
> of these equivalencies, I have to wonder if the original complainant's
> problem really was diagnosed correctly. It could of had something to
> do with the fact that texteq() was confused about whether it reported
> equality or equivalency - it may have taken that long for the (len1 !=
> len2) fastpath thing (only holds for equality, not equivalence,
> despite the fact that the 2005-era strcoll() call checks equivalence
> within texteq() ) to trip someone out, because texteq() would have
> thereby given inconsistent answers in a very subtle way, that were not
> correct either according to the Hungarian locale, nor according to
> simple bitwise equality.

It seems likely that this is more-or-less correct. The two equivalent
strings had a variable number of characters, so the fastpath made
texteq not accord with varstr_cmp(), even though texteq() itself also
only had a single strcoll() call.

So there was some tuples with the hungarian string "potty" in the 2005
bug report. They were not visible for any of the queries seen in test
cases, even the "good" ones. There were a few tuples with equivalent
strings like "potyty" that were visible.

The index scans didn't fail to return the expected tuples because the
indexes were internally inconsistent or otherwise corrupt. Rather, the
_bt_checkkeys() function returned early because the faulty "half
equivalence, half equality" texteq() comparator reported that the
tuple returned didn't satisfy the qual, and on that basis the index
scan stopped. This usually wouldn't happen with Swedish, because their
equivalencies tend to be one character long, and are less common.

So far, so good, but how did this not blow-up sooner? Did Hungarians
only start using Postgres in late 2005, immediately after the 8.1
release? Hardly.

Commit c1d62bfd00f4d1ea0647e12947ca1de9fea39b33, made in late 2003,
"Add operator strategy and comparison-value datatype fields to
ScanKey", may be part of the problem here.

Consider this test within _bt_checkkeys(), that was changed by that commit:

-   if (key->sk_flags & SK_COMMUTE)
-   test = FunctionCall2(&key->sk_func,
-key->sk_argument, datum);
-   else
-   test = FunctionCall2(&key->sk_func,
-datum, key->sk_argument);
+   test = FunctionCall2(&key->sk_func, datum, key->sk_argument);

-   if (DatumGetBool(test) == !!(key->sk_flags & SK_NEGATE))
+   if (!DatumGetBool(test))

I think that this change may have made the difference between the
Hungarians getting away with it and not getting away with it. Might it
have been that for text, they were using some operator that wasn't '='
(perhaps one which has no fastpath, and thus correctly made a
representation about equivalency) rather than texteq prior to this
commit? I didn't eyeball the pg_amop entries of the era myself, but it
seems quite possible. In any case, I find it hard to believe that it
took at least ten years for this problem to manifest itself just
because it took that long for a Hungarian with a strcoll()
implementation that correctly represented equivalency to use Postgres.
A year is a more plausible window.

If we do introduce an idea of equivalency to make all this work, that
means there'll have to be equivalency verification when equality
verification returned false in a number of places, including the
above. For Gist, there is an equivalent test will still vary based on
the strategy number used dubbed "the consistent function", which seems
analogous to the above.

So, you're going to have an extra strcoll()/strxfrm() + strcmp() here,
as part of a "not-equal-but-maybe-equivalent" test, which is bad.
However, if that means that we can cache a text constant as a
strxfrm() blob, and compare in a strxfrm()-wise fashion, that will
more than pay for itself, even for btree traversal alone. For a naive
strxfrm() + strcoll() implementation, that will save just under half
of the work, and everyone knows that the cost of the comparison is
what dominates here, particularly for certain collations.

We'd probably formalise it to the point where there'd be a btree
strategy number and fully-fledged equivalency operator that the user
could conceivably use themselves.

There seems to be scope-creep here. I'm not sure that I should
continue with this as part of this review. Maybe this should be
something that I work on for the next commitfest.

It would be nice to hear what others thought of these ideas before I
actually start writing a patch that both fixes these problems (our
behaviour is incorrect for 

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner
> Heikki Linnakangas  wrote:
 
> I don't like the idea of adding the origin id to the record header.
> It's only required in some occasions, and on some record types.
 
Right.
 
> And I'm worried it might not even be enough in more complicated
> scenarios.
>
> Perhaps we need a more generic WAL record annotation system, where
> a plugin can tack arbitrary information to WAL records. The extra
> information could be stored in the WAL record after the rmgr
> payload, similar to how backup blocks are stored. WAL replay could
> just ignore the annotations, but a replication system could use it
> to store the origin id or whatever extra information it needs.
 
Not only would that handle absolute versus relative updates and
origin id, but application frameworks could take advantage of such a
system for passing transaction metadata.  I've held back on one
concern so far that I'll bring up now because this suggestion would
address it nicely.
 
Our current trigger-driven logical replication includes a summary
which includes transaction run time, commit time, the transaction
type identifier, the source code line from which that transaction was
invoked, the user ID with which the user connected to the application
(which isn't the same as the database login), etc.  Being able to
"decorate" a database transaction with arbitrary (from the DBMS POV)
metadata would be very valuable.  In fact, our shop can't maintain
the current level of capabilities without *some* way to associate
such information with a transaction.
 
I think that using up the only unused space in the fixed header to
capture one piece of the transaction metadata needed for logical
replication, and that only in some configurations, is short-sighted.
If we solve the general problem of transaction metadata, this one
specific case will fall out of that.
 
I think removing origin ID from this patch and submitting a separate
patch for a generalized transaction metadata system is the sensible
way to go.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund  wrote:
>> My guess is that it wouldn't be too hard to remove some of the extra
>> pointers.  Anyone who is using Dllist as a non-inline list could be
>> converted to List * instead.
> There are only three users of the whole dllist.h. Catcache, autovacuum and
> postmaster. The latter two just keep a list of databases around. So any change
> will only be moderatively intrusive.

Yeah.

>> Also, the performance-critical things
>> could be reimplemented as macros.
>
>> I question, though, whether we really need both single and doubly linked
>> lists.  That seems like it's almost certainly micro-optimization that we are
>> better off not doing.
> It was certainly worthwile for the memory manager (lower per allocation
> overhead). You might be right that its not worth it for many other possible
> usecases in pg. Its not much code though.
>
> *looks around*
>
> A quick grep found:
>
> single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h (ok,
> irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
> Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c
>
> I stopped at that point because while surely not of all of the above usecases
> could be replaced by a common implementation several could from a quick look.
> Also, several pg_list.h users could benefit from a conversion. So I think
> adding a single linked list implementation is worthwile.

I can believe that, although I fear it may be a distraction in the
grand scheme of getting logical replication implemented.  There should
be very few places where this is actually performance-critical, and
Tom will complain about large amounts of code churn that don't improve
performance.

If we're going to do that, how about transforming dllist.h into the
doubly-linked list and adding sllist.h for the singly-linked list?

>> > The most contentious point is probably relying on USE_INLINE being
>> > available anywhere. Which I believe to be the point now that we have
>> > gotten rid of some platforms.
>> I would be hesitant to chuck that even though I realize it's unlikely
>> that we really need !USE_INLINE.  But see sortsupport for an example
>> of how we've handled this in the recent past.
> I agree its possible to resolve this. But hell, do we really need to add all
> that ugliness in 2012? I don't think its worth the effort of support ancient
> compilers that don't support inline anymore. If we could stop trying to
> support catering for probably non-existing compilers we could remove some
> *very* ugly long macros for example (e.g. in htup.h).

I don't feel qualified to make a decision on this one, so will defer
to the opinions of others.

> If support for !USE_INLINE is required I would prefer to have an header define
> the functions like
>
> #ifdef USE_INLINE
> #define OPTIONALLY_INLINE static inline
> #define USE_LINKED_LIST_IMPL
> #endif
>
> #ifdef USE_LINKED_LIST_IMPL
>
> OPTIONALLY_INLINE void myFuncCall(){
> ...
> }
> #endif
>
> which then gets included with #define USE_LINKED_LIST_IMPL by some c file
> defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
> Its too much code to duplicate imo.

Neat trick.  Maybe we should revise the sortsupport stuff to do it that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund  wrote:
> As I said before, I definitely agree that we want to have a separate transport
> format once we have decoding nailed down. We still need to ship wal around if
> the decoding happens in a different instance, but *after* that it can be
> shipped in something more convenient/appropriate.

Right, OK, we agree on this then.

> One bit is fine if you have only very simple replication topologies. Once you
> think about globally distributed databases its a bit different. You describe
> some of that below, but just to reiterate:
> Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF
> in europe). You may only want to have full intercontinental interconnect
> between two of those (say A and D). If you only have one bit to represent the
> origin thats not going to work because you won't be able discern the changes
> from BC on A from the changes from those originating on DEF.

I don't see the problem.  A certainly knows via which link the LCRs arrived.

So: change happens on A.  A sends the change to B, C, and D.  B and C
apply the change.  One bit is enough to keep them from regenerating
new LCRs that get sent back to A.  So they're fine.  D also receives
the changes (from A) and applies them, but it also does not need to
regenerate LCRs.  Instead, it can take the LCRs that it has already
got (from A) and send those to E and F.

Or: change happens on B.  B sends the changes to A.  Since A knows the
network topology, it sends the changes to C and D.  D sends them to E
and F.  Nobody except B needs to *generate* LCRs.  All any other node
needs to do is suppress *redundant* LCR generation.

> Another topology which is interesting is circular replications (i.e. changes
> get shipped A->B, B->C, C->A) which is a sensible topology if you only have a
> low change rate and a relatively high number of nodes because you don't need
> the full combinatorial amount of connections.

I think this one is OK too.  You just generate LCRs on the origin node
and then pass them around the ring at every step.  When the next hop
would be the origin node then you're done.

I think you may be imagining that A generates LCRs and sends them to
B.  B applies them, and then from the WAL just generated, it produces
new LCRs which then get sent to C.  If you do that, then, yes,
everything that you need to disentangle various network topologies
must be present in WAL.  But what I'm saying is: don't do it like
that.  Generate the LCRs just ONCE, at the origin node, and then pass
them around the network, applying them at every node.  Then, the
information that is needed in WAL is confined to one bit: the
knowledge of whether or not a particular transaction is local (and
thus LCRs should be generated) or non-local (and thus they shouldn't,
because the origin already generated them and thus we're just handing
them around to apply everywhere).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 02:51:30 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund  
wrote:
> >> Also, the performance-critical things
> >> could be reimplemented as macros.
> >> 
> >> I question, though, whether we really need both single and doubly linked
> >> lists.  That seems like it's almost certainly micro-optimization that we
> >> are better off not doing.
> > 
> > It was certainly worthwile for the memory manager (lower per allocation
> > overhead). You might be right that its not worth it for many other
> > possible usecases in pg. Its not much code though.
> > 
> > *looks around*
> > 
> > A quick grep found:
> > 
> > single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h
> > (ok, irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
> > Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c
> > 
> > I stopped at that point because while surely not of all of the above
> > usecases could be replaced by a common implementation several could from
> > a quick look. Also, several pg_list.h users could benefit from a
> > conversion. So I think adding a single linked list implementation is
> > worthwile.
> 
> I can believe that, although I fear it may be a distraction in the
> grand scheme of getting logical replication implemented.  There should
> be very few places where this is actually performance-critical, and
> Tom will complain about large amounts of code churn that don't improve
> performance.
Uh. I don't want to just go around and replace anything randomly. Actually I 
don't want to change anything for now except whats necessary to get the patch 
in. The point I tried to make was just that the relatively widespread usage of 
similar structure make it likely that it can be used in more places in future.

> If we're going to do that, how about transforming dllist.h into the
> doubly-linked list and adding sllist.h for the singly-linked list?
I would be fine with that.

I will go and try to cook up a patch, assuming for now that we rely on inline, 
the ugliness can be added back afterwards.

> >> > The most contentious point is probably relying on USE_INLINE being
> >> > available anywhere. Which I believe to be the point now that we have
> >> > gotten rid of some platforms.

> >> I would be hesitant to chuck that even though I realize it's unlikely
> >> that we really need !USE_INLINE.  But see sortsupport for an example
> >> of how we've handled this in the recent past.
> > 
> > I agree its possible to resolve this. But hell, do we really need to add
> > all that ugliness in 2012? I don't think its worth the effort of support
> > ancient compilers that don't support inline anymore. If we could stop
> > trying to support catering for probably non-existing compilers we could
> > remove some *very* ugly long macros for example (e.g. in htup.h).
> 
> I don't feel qualified to make a decision on this one, so will defer
> to the opinions of others.
Ok.

> > If support for !USE_INLINE is required I would prefer to have an header
> > define the functions like
> > 
> > #ifdef USE_INLINE
> > #define OPTIONALLY_INLINE static inline
> > #define USE_LINKED_LIST_IMPL
> > #endif
> > 
> > #ifdef USE_LINKED_LIST_IMPL
> > 
> > OPTIONALLY_INLINE void myFuncCall(){
> > ...
> > }
> > #endif
> > 
> > which then gets included with #define USE_LINKED_LIST_IMPL by some c file
> > defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
> > Its too much code to duplicate imo.
> 
> Neat trick.  Maybe we should revise the sortsupport stuff to do it that
> way.
Either that or at least add a comment to both that its duplicated...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
> The idea that logical rep is some kind of useful end goal in itself is
> slightly misleading. If the thought is to block multi-master
> completely on that basis, that would be a shame. Logical rep is the
> mechanism for implementing multi-master.

If you're saying that single-master logical replication isn't useful,
I disagree.  Of course, having both single-master and multi-master
replication together is even more useful.  But I think getting even
single-master logical replication working well in a single release
cycle is going to be a job and a half.  Thinking that we're going to
get MMR in one release is not realistic.  The only way to make it
realistic is to put MMR ahead of every other goal that people have for
logical replication, including robustness and stability.  It's
entirely premature to be designing features for MMR when we don't even
have the design for SMR nailed down yet.  And that's even assuming we
EVER want MMR in core, which has not even really been argued, let
alone agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
> > The idea that logical rep is some kind of useful end goal in itself is
> > slightly misleading. If the thought is to block multi-master
> > completely on that basis, that would be a shame. Logical rep is the
> > mechanism for implementing multi-master.
> 
> If you're saying that single-master logical replication isn't useful,
> I disagree.  Of course, having both single-master and multi-master
> replication together is even more useful.  But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.  Thinking that we're going to
> get MMR in one release is not realistic.  The only way to make it
> realistic is to put MMR ahead of every other goal that people have for
> logical replication, including robustness and stability.  It's
> entirely premature to be designing features for MMR when we don't even
> have the design for SMR nailed down yet.  And that's even assuming we
> EVER want MMR in core, which has not even really been argued, let
> alone agreed.
I agree it has not been agreed uppon, but I certainly would consider 
submitting a prototype implementing it an argument for doing it ;)

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Amit Kapila

> I'm almost inclined to suggest that we not get next-LSN from WAL, but
> by scanning all the pages in the main data store and computing the max
> observed LSN.  This is clearly not very attractive from a performance
> standpoint, but it would avoid the obvious failure mode where you lost
> some recent WAL segments along with pg_control.

According to my analysis, this will have some problem. 
I will explain the problem by taking example scenario.

Example Scenario -
Let us assume that database crashes and it can be recovered by doing crash 
recovery.
Now assume we have Data files and WAL files intact and only control file is 
lost. 
Now user uses pg_resetxlog to generate pg_control file and we uses new 
algorithm to generate next-LSN.

Summary of events before database crash-
1. Checkpoint was in progress and it has already noted next-LSN location 
(LSN-107) and mark the dirty pages as BM_CHECKPOINT_NEEDED.
2. At this point a new transaction dirties 2 pages, first it dirties a fresh 
page (for this change LSN-108) 
   and then it dirties one which is already marked as BM_CHECKPOINT_NEEDED (for 
this change LSN-109).
3. CheckPoint starts flushing pages.
4. It will now flush the page with LSN-109 but not the page 108.
4. Checkpoint finishes.
5. Database crashes.

Normal Crash Recovery - 
it will start the replay from 107 and after recovery the database will be in 
consistent state.

Pg_resetxlog -
It will generate the next-LSN point as 109 which when used for recovery will 
generate inconsistent database.
However if we would have relied on WAL, it would have got next-LSN as 107.

This is just an Example case to show that there can be some problems using the 
algorithm for generating
next-LSN from pages. However it doesn't prove that generating from WAL will be 
correct.

Please correct my understanding if I am wrong.
 
With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Add some more documentation for array indexes/operators

2012-06-20 Thread Ryan Kelly
I had trouble finding what operators arrays supported or which ones
had index support or even determining that arrays could be indexed from
the documentation from the array data type. So, patch.

-Ryan Kelly
diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index 3508ba3..51d996d 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -573,6 +573,33 @@ SELECT * FROM
   This function is described in .
  
 
+ 
+  You can also search with arrays using the @>,
+  <@, and && operators which check
+  whether the left operand contains, is contained by, or overlaps with
+  the right operand, respectively. For instance,
+
+
+SELECT * FROM sal_emp WHERE pay_by_quarter && ARRAY[1];
+
+
+  Will find all rows where the pay_by_quarter
+  overlaps with an array containing 1, that is, any
+  employee with a quarterly salary of 1 will be matched.
+ 
+
+ 
+  See  for more details about array operator
+  behavior.
+ 
+
+ 
+  Additionally, a number of operators support indexed operations. This means
+  that the example above could utilize an index to efficiently locate those 
+  employees with the desired quarterly salary. See 
+  for more details about which operators support indexed operations.
+ 
+
  
   
Arrays are not sets; searching for specific array elements
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cd374ac..2cef8e4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10282,7 +10282,8 @@ SELECT NULLIF(value, '(none)') ...
 
   
See  for more details about array operator
-   behavior.
+   behavior. See  for more details about
+   which operators support indexed operations.
   
 
   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund  wrote:
> Uh. I don't want to just go around and replace anything randomly. Actually I
> don't want to change anything for now except whats necessary to get the patch
> in. The point I tried to make was just that the relatively widespread usage of
> similar structure make it likely that it can be used in more places in future.

Well, the question is for anywhere you might be thinking of using
this: why not just use List?  We do that in a lot of other places, and
there's not much reason to event something new unless there is a
problem with what we already have.  I assume this is related to
logical replication somehow, but it's not clear to me exactly what
problem you hit doing this in the obvious way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 21:19, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
>> The idea that logical rep is some kind of useful end goal in itself is
>> slightly misleading. If the thought is to block multi-master
>> completely on that basis, that would be a shame. Logical rep is the
>> mechanism for implementing multi-master.
>
> If you're saying that single-master logical replication isn't useful,
> I disagree.  Of course, having both single-master and multi-master
> replication together is even more useful.

> But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.

OK, so your estimate is 1.5 people to do that. And if we have more
people, should they sit around doing nothing?

> Thinking that we're going to
> get MMR in one release is not realistic.

If you block it, then the above becomes true, whether or not it starts true.

You may not want MMR, but others do. I see no reason to prevent people
from having it, which is what you suggest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:24:58 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund  
wrote:
> > Uh. I don't want to just go around and replace anything randomly.
> > Actually I don't want to change anything for now except whats necessary
> > to get the patch in. The point I tried to make was just that the
> > relatively widespread usage of similar structure make it likely that it
> > can be used in more places in future.
> 
> Well, the question is for anywhere you might be thinking of using
> this: why not just use List?  We do that in a lot of other places, and
> there's not much reason to event something new unless there is a
> problem with what we already have.  I assume this is related to
> logical replication somehow, but it's not clear to me exactly what
> problem you hit doing this in the obvious way.
It incurs a rather high performance overhead due to added memory allocations 
and added pointer indirections. Thats fine for most of the current users of 
the List interface, but certainly not for all. In other places you cannot even 
have memory allocations because the list lives in shared memory.

E.g. in the ApplyCache, where I use the submitted ilist.h stuff, when 
reconstructing transactions you add to a potentially really long linked list 
of individual changes for every interesting wal record. Before I prevented 
memory allocations in that path it took about 12-14% of the time when applying 
changes in the same backend. Afterwards it wasn't visible in the profile 
anymore.

Several of the pieces of code I pointed out in a previous email use open-coded 
list implementation exactly to prevent those problems.

If you look at the parsing, planning & execution of trivial statements you 
will also notice the overhead of memory allocations. A good bit of those is 
caused by list manipulation. Check Stephen Frost's "Pre-alloc ListCell's 
optimization" for workarounds..

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Aidan Van Dyk
On Wed, Jun 20, 2012 at 9:21 AM, Amit Kapila  wrote:

> Example Scenario -

> Now assume we have Data files and WAL files intact and only control file is 
> lost.


Just so I understand correctly, the aim of this is to "fix" the
situation where out of the thousands of files and 100s of GB of data
in my pg directory, the *only* corruption is that a single file
pg_control file is missing?

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs  wrote:
> On 20 June 2012 21:19, Robert Haas  wrote:
>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
>>> The idea that logical rep is some kind of useful end goal in itself is
>>> slightly misleading. If the thought is to block multi-master
>>> completely on that basis, that would be a shame. Logical rep is the
>>> mechanism for implementing multi-master.
>>
>> If you're saying that single-master logical replication isn't useful,
>> I disagree.  Of course, having both single-master and multi-master
>> replication together is even more useful.
>
>> But I think getting even
>> single-master logical replication working well in a single release
>> cycle is going to be a job and a half.
>
> OK, so your estimate is 1.5 people to do that. And if we have more
> people, should they sit around doing nothing?

Oh, give me a break.  You're willfully missing my point.  And to quote
Fred Brooks, nine women can't make a baby in one month.

>> Thinking that we're going to
>> get MMR in one release is not realistic.
>
> If you block it, then the above becomes true, whether or not it starts true.

If I had no rational basis for my objections, that would be true.
You've got four people objecting to this patch now, all of whom happen
to be committers.  Whether or not MMR goes into core, who knows, but
it doesn't seem that this patch is going to fly.

My main point in bringing this up is that if you pick a project that
is too large, you will fail.  As I would rather see this project
succeed, I recommend that you don't do that.  Both you and Andres seem
to believe that MMR is a reasonable first target to shoot at, but I
don't think anyone else - including the Slony developers who have
commented on this issue - endorses that position.  At PGCon, you were
talking about getting a new set of features into PG over the next 3-5
years.  Now, it seems like you want to compress that timeline to a
year.  I don't think that's going to work.  You also requested that
people tell you sooner when large patches were in danger of not making
the release.  Now I'm doing that, VERY early, and you're apparently
angry about it.  If the only satisfactory outcome of this conversation
is that everyone agrees with the design pattern you've already decided
on, then you haven't left yourself very much room to walk away
satisfied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund  
wrote:
> > One bit is fine if you have only very simple replication topologies. Once
> > you think about globally distributed databases its a bit different. You
> > describe some of that below, but just to reiterate:
> > Imagine having 6 nodes, 3 on one of two continents (ABC in north america,
> > DEF in europe). You may only want to have full intercontinental
> > interconnect between two of those (say A and D). If you only have one
> > bit to represent the origin thats not going to work because you won't be
> > able discern the changes from BC on A from the changes from those
> > originating on DEF.
> 
> I don't see the problem.  A certainly knows via which link the LCRs
> arrived.

> So: change happens on A.  A sends the change to B, C, and D.  B and C
> apply the change.  One bit is enough to keep them from regenerating
> new LCRs that get sent back to A.  So they're fine.  D also receives
> the changes (from A) and applies them, but it also does not need to
> regenerate LCRs.  Instead, it can take the LCRs that it has already
> got (from A) and send those to E and F.

> Or: change happens on B.  B sends the changes to A.  Since A knows the
> network topology, it sends the changes to C and D.  D sends them to E
> and F.  Nobody except B needs to *generate* LCRs.  All any other node
> needs to do is suppress *redundant* LCR generation.
> 
> > Another topology which is interesting is circular replications (i.e.
> > changes get shipped A->B, B->C, C->A) which is a sensible topology if
> > you only have a low change rate and a relatively high number of nodes
> > because you don't need the full combinatorial amount of connections.
> 
> I think this one is OK too.  You just generate LCRs on the origin node
> and then pass them around the ring at every step.  When the next hop
> would be the origin node then you're done.
> 
> I think you may be imagining that A generates LCRs and sends them to
> B.  B applies them, and then from the WAL just generated, it produces
> new LCRs which then get sent to C. 
Yes, thats what I am proposing.

> If you do that, then, yes,
> everything that you need to disentangle various network topologies
> must be present in WAL.  But what I'm saying is: don't do it like
> that.  Generate the LCRs just ONCE, at the origin node, and then pass
> them around the network, applying them at every node.  Then, the
> information that is needed in WAL is confined to one bit: the
> knowledge of whether or not a particular transaction is local (and
> thus LCRs should be generated) or non-local (and thus they shouldn't,
> because the origin already generated them and thus we're just handing
> them around to apply everywhere).
Sure, you can do it that way, but I don't think its a good idea. If you do it 
my way you *guarantee* that when replaying changes from node B on node C you 
have replayed changes from A at least as far as B has. Thats a really nice 
property for MM.
You *can* get same with your solution but it starts to get complicated rather 
fast. While my/our proposed solution is trivial to implement.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 20:37, Kevin Grittner  wrote:
>> Heikki Linnakangas  wrote:
>
>> I don't like the idea of adding the origin id to the record header.
>> It's only required in some occasions, and on some record types.
>
> Right.

Wrong, as explained.

>> And I'm worried it might not even be enough in more complicated
>> scenarios.
>>
>> Perhaps we need a more generic WAL record annotation system, where
>> a plugin can tack arbitrary information to WAL records. The extra
>> information could be stored in the WAL record after the rmgr
>> payload, similar to how backup blocks are stored. WAL replay could
>> just ignore the annotations, but a replication system could use it
>> to store the origin id or whatever extra information it needs.
>
> Not only would that handle absolute versus relative updates and
> origin id, but application frameworks could take advantage of such a
> system for passing transaction metadata.  I've held back on one
> concern so far that I'll bring up now because this suggestion would
> address it nicely.
>
> Our current trigger-driven logical replication includes a summary
> which includes transaction run time, commit time, the transaction
> type identifier, the source code line from which that transaction was
> invoked, the user ID with which the user connected to the application
> (which isn't the same as the database login), etc.  Being able to
> "decorate" a database transaction with arbitrary (from the DBMS POV)
> metadata would be very valuable.  In fact, our shop can't maintain
> the current level of capabilities without *some* way to associate
> such information with a transaction.

> I think that using up the only unused space in the fixed header to
> capture one piece of the transaction metadata needed for logical
> replication, and that only in some configurations, is short-sighted.
> If we solve the general problem of transaction metadata, this one
> specific case will fall out of that.

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being "used up" as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.

It isn't true that this is needed only for some configurations of
multi-master, per discussion.

This is not transaction metadata, it is WAL record metadata required
for multi-master replication, see later point.

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.

> I think removing origin ID from this patch and submitting a separate
> patch for a generalized transaction metadata system is the sensible
> way to go.

We already have a very flexible WAL system for recording data of
interest to various resource managers. If you wish to annotate a
transaction, you can either generate a new kind of WAL record or you
can enhance a commit record. There are already unused flag bits on
commit records for just such a purpose.

XLOG_NOOP records can already be generated by your application if you
wish to inject additional metadata to the WAL stream. So no changes
are required for you to implement the generalised transaction metadata
scheme you say you require.

Not sure how or why that relates to requirements for multi-master.


Please note that I've suggested review changes to Andres' work myself.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Hannu Valtonen
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/19/2012 01:47 AM, Christopher Browne wrote:
> That numbering scheme gets pretty anti-intuitive fairly quickly,
> from whence we took the approach of having a couple digits
> indicating data centre followed by a digit indicating which node in
> that data centre.
> 
> If that all sounds incoherent, well, the more nodes you have
> around, the more difficult it becomes to make sure you *do* have a
> coherent picture of your cluster.
> 
> I recall the Slony-II project having a notion of attaching a
> permanent UUID-based node ID to each node.  As long as there is
> somewhere decent to find a symbolically significant node "name," I
> like the idea of the ID *not* being in a tiny range, and being
> UUID/OID-like...
> 

Just as a sidenote, MySQL's new global transaction ids use a UUID for
the serverid bit of it. [1]

- - Hannu Valtonen

[1]
http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq
vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T
4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw
ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119
mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4
SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s
cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x
f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d
S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse
OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE
GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1
adx5JGZK5k0xtSewTj7O
=lK2f
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund  wrote:
>> If you do that, then, yes,
>> everything that you need to disentangle various network topologies
>> must be present in WAL.  But what I'm saying is: don't do it like
>> that.  Generate the LCRs just ONCE, at the origin node, and then pass
>> them around the network, applying them at every node.  Then, the
>> information that is needed in WAL is confined to one bit: the
>> knowledge of whether or not a particular transaction is local (and
>> thus LCRs should be generated) or non-local (and thus they shouldn't,
>> because the origin already generated them and thus we're just handing
>> them around to apply everywhere).
> Sure, you can do it that way, but I don't think its a good idea. If you do it
> my way you *guarantee* that when replaying changes from node B on node C you
> have replayed changes from A at least as far as B has. Thats a really nice
> property for MM.
> You *can* get same with your solution but it starts to get complicated rather
> fast. While my/our proposed solution is trivial to implement.

That's an interesting point.  I agree that's a useful property, and
might be a reason not to just use a single-bit flag, but I still think
you'd be better off handling that requirement via some other method,
like logging the node ID once per transaction or something.  That lets
you have as much metadata as you end up needing, which is a lot more
flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs  wrote:
> > On 20 June 2012 21:19, Robert Haas  wrote:
> >> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  
wrote:
> >>> The idea that logical rep is some kind of useful end goal in itself is
> >>> slightly misleading. If the thought is to block multi-master
> >>> completely on that basis, that would be a shame. Logical rep is the
> >>> mechanism for implementing multi-master.
> >> 
> >> If you're saying that single-master logical replication isn't useful,
> >> I disagree.  Of course, having both single-master and multi-master
> >> replication together is even more useful.
> >> 
> >> But I think getting even
> >> single-master logical replication working well in a single release
> >> cycle is going to be a job and a half.
> >> Thinking that we're going to
> >> get MMR in one release is not realistic.
> > If you block it, then the above becomes true, whether or not it starts
> > true.
> My main point in bringing this up is that if you pick a project that
> is too large, you will fail.
Were not the only ones here that are performing scope creep though... I think 
about all people who have posted in the whole thread except maybe Tom and 
Marko are guilty of doing so.

I still think its rather sensible to focus on exactly duplicated schemas in a 
very first version just because that leaves out some of the complexity while 
paving the road for other nice things.

> You've got four people objecting to this patch now, all of whom happen
> to be committers.  Whether or not MMR goes into core, who knows, but
> it doesn't seem that this patch is going to fly.
I find that a bit too early to say. Sure it won't fly exactly as proposed, but 
hell, who cares? What I want to get in is a solution to the specific problem 
the patch targets. At least you have, not sure about others, accepted that the 
problem needs a solution.
We do not agree yet how that solution looks should like but thats not exactly 
surprising as we started discussing the problem only a good day ago.

If people agree that your proposed way of just one flag bit is the way to go 
we will have to live with that. But thats different from saying the whole 
thing is dead.

> As I would rather see this project
> succeed, I recommend that you don't do that.  Both you and Andres seem
> to believe that MMR is a reasonable first target to shoot at, but I
> don't think anyone else - including the Slony developers who have
> commented on this issue - endorses that position. 
I don't think we get full MMR into 9.3. What I am proposing is that we build 
in the few pieces that are required to implement MMR *ontop* of whats 
hopefully in 9.3.
And I think thats a realistic goal.

> At PGCon, you were
> talking about getting a new set of features into PG over the next 3-5
> years.  Now, it seems like you want to compress that timeline to a
> year.
Well, I obviously would like to be everything be done in a release, but I also 
would like to go hiking for a year, have a restored sailing boat and some 
more. That doesn't make it reality...
To make it absolutely clear: I definitely don't think its realistic to have 
everything in 9.3. And I don't think Simon does so either. What I want is to 
have the basic building blocks in 9.3.
The difference probably just is whats determined as a building block.

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund  
wrote:
> >> If you do that, then, yes,
> >> everything that you need to disentangle various network topologies
> >> must be present in WAL.  But what I'm saying is: don't do it like
> >> that.  Generate the LCRs just ONCE, at the origin node, and then pass
> >> them around the network, applying them at every node.  Then, the
> >> information that is needed in WAL is confined to one bit: the
> >> knowledge of whether or not a particular transaction is local (and
> >> thus LCRs should be generated) or non-local (and thus they shouldn't,
> >> because the origin already generated them and thus we're just handing
> >> them around to apply everywhere).
> > 
> > Sure, you can do it that way, but I don't think its a good idea. If you
> > do it my way you *guarantee* that when replaying changes from node B on
> > node C you have replayed changes from A at least as far as B has. Thats
> > a really nice property for MM.
> > You *can* get same with your solution but it starts to get complicated
> > rather fast. While my/our proposed solution is trivial to implement.
> 
> That's an interesting point.  I agree that's a useful property, and
> might be a reason not to just use a single-bit flag, but I still think
> you'd be better off handling that requirement via some other method,
> like logging the node ID once per transaction or something.  That lets
> you have as much metadata as you end up needing, which is a lot more
> flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
> said.
If it comes down to that I can definitely live with that. I still think making 
the filtering trivial so it can be done without any logic on a low level is a 
very desirable property but if not, so be it.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 21:42, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs  wrote:
>> On 20 June 2012 21:19, Robert Haas  wrote:
>>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.
>>>
>>> If you're saying that single-master logical replication isn't useful,
>>> I disagree.  Of course, having both single-master and multi-master
>>> replication together is even more useful.
>>
>>> But I think getting even
>>> single-master logical replication working well in a single release
>>> cycle is going to be a job and a half.
>>
>> OK, so your estimate is 1.5 people to do that. And if we have more
>> people, should they sit around doing nothing?
>
> Oh, give me a break.  You're willfully missing my point.  And to quote
> Fred Brooks, nine women can't make a baby in one month.


No, I'm not. The question is not how quickly can N people achieve a
single thing, but how long will it take a few skilled people working
on carefully selected tasks that have few dependencies between them to
achieve something.

We have significantly more preparation, development time and resources
than any other project previously performed for PostgreSQL, that I am
aware of.

Stating that it is impossible to perform a task in a certain period of
time without even considering those points is clearly rubbish. I've
arrived at my thinking based upon detailed project planning of what
was possible in the time.

How exactly did you arrive at your conclusion? Why is yours right and
mine wrong?



>>> Thinking that we're going to
>>> get MMR in one release is not realistic.
>>
>> If you block it, then the above becomes true, whether or not it starts true.
>
> If I had no rational basis for my objections, that would be true.
> You've got four people objecting to this patch now, all of whom happen
> to be committers.  Whether or not MMR goes into core, who knows, but
> it doesn't seem that this patch is going to fly.

No, I have four people who had initial objections and who have not
commented on the fact that the points made are regrettably incorrect.
I don't expect everybody commenting on the design to have perfect
knowledge of the whole design, so I expect people to make errors in
their comments. I also expect people to take note of what has been
said before making further objections or drawing conclusions.

Since at least 3 of the people making such comments did not attend the
full briefing meeting in Ottawa, I am not particularly surprised.
However, I do expect people that didn't come to the meeting to
recognise that they are likely to be missing information and to listen
closely, as I listen to them.

"When the facts change, I change my mind. What do you do, sir?"


> My main point in bringing this up is that if you pick a project that
> is too large, you will fail.  As I would rather see this project
> succeed, I recommend that you don't do that.  Both you and Andres seem
> to believe that MMR is a reasonable first target to shoot at, but I
> don't think anyone else - including the Slony developers who have
> commented on this issue - endorses that position.  At PGCon, you were
> talking about getting a new set of features into PG over the next 3-5
> years.  Now, it seems like you want to compress that timeline to a
> year.  I don't think that's going to work.  You also requested that
> people tell you sooner when large patches were in danger of not making
> the release.  Now I'm doing that, VERY early, and you're apparently
> angry about it.  If the only satisfactory outcome of this conversation
> is that everyone agrees with the design pattern you've already decided
> on, then you haven't left yourself very much room to walk away
> satisfied.

Note that I have already myself given review feedback to Andres and
that change has visibly occurred during this thread via public debate.

Claiming that I only stick to what has already been decided is
patently false, with me at least.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Greg Stark
On Sun, Jun 17, 2012 at 9:26 PM, Tom Lane  wrote:
> The trick for hashing such datatypes is to be able to guarantee that
> "equal" values hash to the same hash code, which is typically possible
> as long as you know the equality rules well enough.  We could possibly
> do that for text with pure-strcoll equality if we knew all the details
> of what strcoll would consider "equal", but we do not.

It occurs to me that strxfrm would answer this question. If we made
the hash function hash the result of strxfrm then we could make
equality use strcoll and not fall back to strcmp.

I'm suspect in a green field that's what we would do though the cpu
cost might be enough to think hard about it. I'm not sure it's worth
considering switching though.

The cases where it matters to users incidentally is when you have a
multi-column sort order and have values that are supposed to sort
equal in the first column but print differently. Given that there
seems to be some controversy in the locale definitions -- most locals
seem to use "insignificant" factors like accents or ligatures as
tie-breakers and avoid claiming different sequences are equal even
when the language usually treats them as equivalent -- it doesn't seem
super important to maintain the property for the few locales that fall
the other way. Unless my impression is wrong and there's a good
principled reason why some locales treat nearly equivalent strings one
way and some treat them the other.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 15:10, Greg Stark  wrote:
> On Sun, Jun 17, 2012 at 9:26 PM, Tom Lane  wrote:
>> The trick for hashing such datatypes is to be able to guarantee that
>> "equal" values hash to the same hash code, which is typically possible
>> as long as you know the equality rules well enough.  We could possibly
>> do that for text with pure-strcoll equality if we knew all the details
>> of what strcoll would consider "equal", but we do not.
>
> It occurs to me that strxfrm would answer this question. If we made
> the hash function hash the result of strxfrm then we could make
> equality use strcoll and not fall back to strcmp.

What about per-column collations?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks

2012-06-20 Thread Jaime Casanova
On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera
 wrote:
> Hi,
>
> This is v12 of the foreign key locks patch.
>

Hi Álvaro,

Just noticed that this patch needs a rebase because of the refactoring
Tom did in ri_triggers.c

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 16:23, Heikki Linnakangas
 wrote:
> On 20.06.2012 11:17, Simon Riggs wrote:
>>
>> On 20 June 2012 15:45, Heikki Linnakangas
>>   wrote:
>>>
>>> On 20.06.2012 10:32, Simon Riggs wrote:


 On 20 June 2012 14:40, Heikki Linnakangas
>
>
> And I'm worried
> it might not even be enough in more complicated scenarios.


 It is not the only required conflict mechanism, and has never been
 claimed to be so. It is simply one piece of information needed, at
 various times.
>>>
>>>
>>> So, if the origin id is not sufficient for some conflict resolution
>>> mechanisms, what extra information do you need for those, and where do
>>> you
>>> put it?
>>
>>
>> As explained elsewhere, wal_level = logical (or similar) would be used
>> to provide any additional logical information required.
>>
>> Update and Delete WAL records already need to be different in that
>> mode, so additional info would be placed there, if there were any.
>>
>> In the case of reflexive updates you raised, a typical response in
>> other DBMS would be to represent the query
>>   UPDATE SET counter = counter + 1
>> by sending just the "+1" part, not the current value of counter, as
>> would be the case with the non-reflexive update
>>   UPDATE SET counter = 1
>>
>> Handling such things in Postgres would require some subtlety, which
>> would not be resolved in first release but is pretty certain not to
>> require any changes to the WAL record header as a way of resolving it.
>> Having already thought about it, I'd estimate that is a very long
>> discussion and not relevant to the OT, but if you wish to have it
>> here, I won't stop you.
>
>
> Yeah, I'd like to hear briefly how you would handle that without any further
> changes to the WAL record header.

I already did:

>> Update and Delete WAL records already need to be different in that
>> mode, so additional info would be placed there, if there were any.

The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Greg Stark
On Wed, Jun 20, 2012 at 3:19 PM, Peter Geoghegan  wrote:
>> It occurs to me that strxfrm would answer this question. If we made
>> the hash function hash the result of strxfrm then we could make
>> equality use strcoll and not fall back to strcmp.
>
> What about per-column collations?

Well collations aren't really per-column, they're specific to the
comparison in the expression at hand. The per-column collation is just
the default for comparisons against that column.

So for a hash join for example you would use build the hash table
using strxfrm for the collation being used for the join  expression.
For hash indexes the index would only be valid for a specific
collation, just like a btree index is.

But this all seems like a lot of work for a case that most locales
seem to think isn't worth worrying about.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ADMIN] PANIC while doing failover (streaming replication)

2012-06-20 Thread Matheus Ricardo Espanhol
Hi,

Any news on this issue? My slave server is not trusted, all the warnings below
are related to
indexes of the main tables:

2012-06-14 11:46:23 BRT [18654]: [34603-1] user=,db= LOG:  recovery restart
point at 435/4E899CE8
2012-06-14 11:46:23 BRT [18654]: [34604-1] user=,db= DETAIL:  last
completed transaction was at log time 2012-06-14 11:46:22.770348-03
2012-06-14 11:48:53 BRT [18654]: [34605-1] user=,db= LOG:  restartpoint
starting: time
2012-06-14 11:51:09 BRT [18653]: [18-1] user=,db= LOG:  received promote
request
2012-06-14 11:51:09 BRT [11879]: [2-1] user=,db= FATAL:  terminating
walreceiver process due to administrator command
2012-06-14 11:51:09 BRT [18653]: [19-1] user=,db= LOG:  invalid record
length at 435/5079CAE8
2012-06-14 11:51:09 BRT [18653]: [20-1] user=,db= LOG:  redo done at
435/5079CAA8
2012-06-14 11:51:09 BRT [18653]: [21-1] user=,db= LOG:  last completed
transaction was at log time 2012-06-14 11:51:08.969771-03
2012-06-14 11:51:09 BRT [18653]: [22-1] user=,db= LOG:  selected new
timeline ID: 2
2012-06-14 11:51:09 BRT [18653]: [23-1] user=,db= LOG:  archive recovery
complete
2012-06-14 11:51:09 BRT [18653]: [24-1] user=,db= WARNING:  page 193569 of
relation base/16407/80416524 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [25-1] user=,db= WARNING:  page 129229 of
relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [26-1] user=,db= WARNING:  page 134146 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [27-1] user=,db= WARNING:  page 318013 of
relation base/16407/38334887 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [28-1] user=,db= WARNING:  page 134143 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [29-1] user=,db= WARNING:  page 156203 of
relation base/16407/38334883 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [30-1] user=,db= WARNING:  page 318009 of
relation base/16407/38334887 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [31-1] user=,db= WARNING:  page 370150 of
relation base/16407/38334888 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [32-1] user=,db= WARNING:  page 133811 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [33-1] user=,db= WARNING:  page 317963 of
relation base/16407/38334884 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [34-1] user=,db= WARNING:  page 133808 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [35-1] user=,db= WARNING:  page 133809 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [36-1] user=,db= WARNING:  page 133810 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [37-1] user=,db= WARNING:  page 129231 of
relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [38-1] user=,db= WARNING:  page 14329 of
relation base/16407/80430266 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [39-1] user=,db= WARNING:  page 134145 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [40-1] user=,db= WARNING:  page 134144 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [41-1] user=,db= WARNING:  page 317957 of
relation base/16407/38334884 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [42-1] user=,db= WARNING:  page 129230 of
relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [43-1] user=,db= WARNING:  page 156201 of
relation base/16407/38334883 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [44-1] user=,db= PANIC:  WAL contains
references to invalid pages
2012-06-14 11:51:09 BRT [18651]: [4-1] user=,db= LOG:  startup process (PID
18653) was terminated by signal 6: Aborted
2012-06-14 11:51:09 BRT [18651]: [5-1] user=,db= LOG:  terminating any
other active server processes

PostgreSQL version 9.1.3.

Thanks in advance for any suggestions.


-Matheus



2011/7/1 Fujii Masao 

> On Fri, Jul 1, 2011 at 2:18 PM, Mikko Partio  wrote:
> > Hello list
> > I have two a machine cluster with PostgreSQL 9.0.4 and streaming
> > replication. In a normal situation I did a failover -- touched the
> trigger
> > file in standby to promote it to production mode. I have done this
> > previously without any complications but now the
> to-be-production-database
> > had a PANIC and shut itself down. From the logs:
> > postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db=
> > LOG:  streaming replication successfully connected to primary
> > postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1]
> user=,db=
> > LOG:  trigger file found: /postgresql/data/finish_replication.trigger
> > postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db=
> > FATAL:  terminating walreceiver process due to administrator command
> > postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1]
> user=,db=
> > LOG:  redo done at AE/8B1855C8
> > postgres[10736]: [12-1] 20

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Tom Lane
Peter Geoghegan  writes:
> I think that this change may have made the difference between the
> Hungarians getting away with it and not getting away with it. Might it
> have been that for text, they were using some operator that wasn't '='
> (perhaps one which has no fastpath, and thus correctly made a
> representation about equivalency) rather than texteq prior to this
> commit?

Uh, no.  There aren't any magic variants of equality now, and there were
not back then either.  I'm inclined to think that the "Hungarian
problem" did exist long before we fixed it.

> So, you're going to have an extra strcoll()/strxfrm() + strcmp() here,
> as part of a "not-equal-but-maybe-equivalent" test, which is bad.
> However, if that means that we can cache a text constant as a
> strxfrm() blob, and compare in a strxfrm()-wise fashion, that will
> more than pay for itself, even for btree traversal alone.

Um ... are you proposing to replace text btree index entries with
strxfrm values?  Because if you aren't, I don't see that this is likely
to win anything.  And if you are, it loses on the grounds of (a) index
bloat and (b) loss of ability to do index-only scans.

> It would be nice to hear what others thought of these ideas before I
> actually start writing a patch that both fixes these problems (our
> behaviour is incorrect for some locales according to the Unicode
> standard),  facilitates a strxfrm() optimisation, and actually adds a
> strxfrm() optimisation.

Personally I think this is not a direction we want to go in.  I think
that it's going to end up a significant performance loss in many cases,
break backwards compatibility in numerous ways, and provide a useful
behavioral improvement to only a small minority of users.  If you check
the archives, we get far more complaints from people who think their
locale definition is wrong than from people who are unhappy because
we don't hew exactly to the corner cases of their locale.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Alvaro Herrera

Excerpts from Florian Pflug's message of mié jun 20 06:35:29 -0400 2012:
> On Jun19, 2012, at 17:36 , Robert Haas wrote:
> > On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
> >  wrote:
> >> On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
> >>> The fly in the ointment with any of these ideas is that the "configure
> >>> list" is not a list of exact cipher names, as per Magnus' comment that
> >>> the current default includes tests like "!aNULL".  I am not sure that
> >>> we know how to evaluate such conditions if we are applying an
> >>> after-the-fact check on the selected cipher.  Does OpenSSL expose any
> >>> API for evaluating whether a selected cipher meets such a test?
> >> 
> >> I'm not sure whether there's an API for it, but you can certainly check
> >> manually with "openssl ciphers -v", for example:
> >> 
> >> $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
> >> NULL-SHASSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=SHA1
> >> NULL-MD5SSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=MD5
> >> 
> >> ...etc...
> >> 
> >> So unless the openssl includes the code twice there must be a way to
> >> extract the list from the library.
> > 
> > There doubtless is, but I'd being willing to wager that you won't be
> > able to figure out the exact method without reading the source code
> > for 'opennssl ciphers' to see how it was done there, and most likely
> > you'll find that at least one of the functions they use has no man
> > page.  Documentation isn't their strong point.
> 
> Yes, unfortunately.

I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
the list of ciphers starting from the stringified configuration
parameter and iterate on them.  The problem is figuring out whether any
given cipher meets some criteria; all the stuff that the command prints
after the cipher name comes from a "get cipher description" API call and
it doesn't look like there's any simple way of getting the individual
bits in some better form (assuming we don't want to parse the
description string).

Now if the cipher name is enough for whatever it is that we want, then
that looks easy.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Tom Lane
Amit Kapila  writes:
>> I'm almost inclined to suggest that we not get next-LSN from WAL, but
>> by scanning all the pages in the main data store and computing the max
>> observed LSN.  This is clearly not very attractive from a performance
>> standpoint, but it would avoid the obvious failure mode where you lost
>> some recent WAL segments along with pg_control.

> According to my analysis, this will have some problem. 

I think you're missing the point.  There is no possible way to guarantee
database consistency after applying pg_resetxlog, unless the database
had been cleanly shut down beforehand.  The reset will lose the xlog
information that was needed to restore consistency.  So arguing from
examples that demonstrate this is rather pointless.  Rather, the value
of pg_resetxlog is to be able to start the database at all so that info
can be extracted from it.  What we are looking for is not perfection,
because that's impossible, but just to not make a bad situation worse.
The reason I'm concerned about selecting a next-LSN that's certainly
beyond every LSN in the database is that not doing so could result in
introducing further corruption, which would be entirely avoidable with
more care in choosing the next-LSN.

> Pg_resetxlog -
> It will generate the next-LSN point as 109 which when used for recovery will 
> generate inconsistent database.
> However if we would have relied on WAL, it would have got next-LSN as 107.

Umm ... the entire point of pg_resetxlog is to throw away WAL.  Not to
rely on it.

It's conceivable that there would be some use in a tool that searches
the available WAL files for the latest checkpoint record and recreates a
pg_control file pointing at that checkpoint, without zapping the WAL
files.  This would be much different in purpose and usage from
pg_resetxlog, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Tom Lane
Peter Eisentraut  writes:
> On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote:
>> That might actually be a good idea. We can't really change the way we
>> named the betas, but it's not too late to consider naming the actual
>> release as 9.2.0...

> The final release was always going to be called 9.2.0, but naming the
> beta 9.2.0betaX is wrong.  There was a previous discussion about that
> particular point.

Yes.  There is no reason to change the naming scheme we have been using
for years now (at least since version_stamp.pl was invented for 7.4).
The only problem is that somebody got the name of the directory wrong on
the FTP server.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg default's verbosity?

2012-06-20 Thread Peter Eisentraut
On tis, 2012-06-19 at 02:15 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > There might be something to the idea of demoting a few of the things
> > we've traditionally had as NOTICEs, though.  IME, the following two
> > messages account for a huge percentage of the chatter:
> 
> > NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
> > serial column "foo.a"
> > NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> > "foo_pkey" for table "foo"
> 
> Personally, I'd have no problem with flat-out dropping (not demoting)
> both of those two specific messages.  I seem to recall that Bruce has
> lobbied for them heavily in the past, though.

I don't like these messages any more than the next guy, but why drop
only those, and not any of the other NOTICE-level messages?  The meaning
of NOTICE is pretty much, if this is the first time you're using
PostgreSQL, let me tell you a little bit about how we're doing things
here.  If you've run your SQL script more than 3 times, you won't need
them anymore.  So set your client_min_messages to WARNING then.  That
should be pretty much standard for running SQL scripts, in addition to
all the other stuff listed here:
http://petereisentraut.blogspot.fi/2010/03/running-sql-scripts-with-psql.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Tom Lane
Florian Pflug  writes:
> I wonder though if shouldn't restrict the allowed ciphers list to being
> a simple list of supported ciphers. If our goal is to support multiple
> SSL libraries transparently then surely having openssl-specific syntax
> in the config file isn't exactly great anyway...

No, we don't want to go there, because then we'd have to worry about
keeping the default list in sync with what's supported by the particular
version of the particular library we chance to be using.  That's about
as far from transparent as you can get.  A notation like "DEFAULT"
is really quite ideal for our purposes in that respect.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner
Simon Riggs  wrote:
> Kevin Grittner  wrote:
>>> Heikki Linnakangas  wrote:
>>
>>> I don't like the idea of adding the origin id to the record
>>> header. It's only required in some occasions, and on some record
>>> types.
>>
>> Right.
> 
> Wrong, as explained.
 
The point is not wrong; you are simply not responding to what is
being said.
 
You have not explained why an origin ID is required when there is no
replication, or if there is master/slave logical replication, or
there are multiple masters with non-overlapping primary keys
replicating to a single table in a consolidated database, or each
master replicates to all other masters directly, or any of various
other scenarios raised on this thread.  You've only explained why
it's necessary for certain configurations of multi-master
replication where all rows in a table can be updated on any of the
masters.  I understand that this is the configuration you find most
interesting, at least for initial implementation.  That does not
mean that the other situations don't exist as use cases or should be
not be considered in the overall design.
 
I don't think there is anyone here who would not love to see this
effort succeed, all the way to multi-master replication in the
configuration you are emphasizing.  What is happening is that people
are expressing concerns about parts of the design which they feel
are problematic, and brainstorming about possible alternatives.  As
I'm sure you know, fixing a design problem at this stage in
development is a lot less expensive than letting the problem slide
and trying to deal with it later.
 
> It isn't true that this is needed only for some configurations of
> multi-master, per discussion.
 
I didn't get that out of the discussion; I saw a lot of cases
mentioned as not needing it to which you simply did not respond.
 
> This is not transaction metadata, it is WAL record metadata
> required for multi-master replication, see later point.
> 
> We need to add information to every WAL record that is used as the
> source for generating LCRs.
 
If the origin ID of a transaction doesn't count as transaction
metadata (i.e., data about the transaction), what does?  It may be a
metadata element about which you have special concerns, but it is
transaction metadata.  You don't plan on supporting individual WAL
records within a transaction containing different values for origin
ID, do you?  If not, why is it something to store in every WAL
record rather than once per transaction?  That's not intended to be
a rhetorical question.  I think it's because you're still thinking
of the WAL stream as *the medium* for logical replication data
rather than *the source* of logical replication data.
 
As long as the WAL stream is the medium, options are very
constrained.  You can code a very fast engine to handle a single
type of configuration that way, and perhaps that should be a
supported feature, but it's not a configuration I've needed yet. 
(Well, on reflection, if it had been available and easy to use, I
can think of *one* time I *might* have used it for a pair of nodes.)
It seems to me that you are so focused on this one use case that you
are not considering how design choices which facilitate fast
development of that use case paint us into a corner in terms of
expanding to other use cases.
 
>> I think removing origin ID from this patch and submitting a
>> separate patch for a generalized transaction metadata system is
>> the sensible way to go.
> 
> We already have a very flexible WAL system for recording data of
> interest to various resource managers. If you wish to annotate a
> transaction, you can either generate a new kind of WAL record or
> you can enhance a commit record.
 
Right.  Like many of us are suggesting should be done for origin ID.
 
> XLOG_NOOP records can already be generated by your application if
> you wish to inject additional metadata to the WAL stream. So no
> changes are required for you to implement the generalised
> transaction metadata scheme you say you require.
 
I'm glad it's that easy.  Are there SQL functions to for that yet?
 
> Not sure how or why that relates to requirements for multi-master.
 
That depends on whether you want to leave the door open to other
logical replication than the one use case on which you are currently
focused.  I even consider some of those other cases multi-master,
especially when multiple databases are replicating to a single table
on another server.  I'm not clear on your definition -- it seems to
be rather more narrow.  Maybe we need to define some terms somewhere
to facilitate discussion.  Is there a Wiki page where that would
make sense?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund  wrote:
> Were not the only ones here that are performing scope creep though... I think
> about all people who have posted in the whole thread except maybe Tom and
> Marko are guilty of doing so.
>
> I still think its rather sensible to focus on exactly duplicated schemas in a
> very first version just because that leaves out some of the complexity while
> paving the road for other nice things.

Well, I guess what I want to know is: what does focusing on exactly
duplicated schemas mean?  If it means we'll disable DDL for tables
when we turn on replication, that's basically the Slony approach: when
you want to make a DDL change, you have to quiesce replication, do it,
and then resume replication.  I would possibly be OK with that
approach.  If it means that we'll hope that the schemas are duplicated
and start spewing garbage data when they're not, then I'm not
definitely not OK with that approach.  If it means using event
triggers to keep the catalogs synchronized, then I don't think I don't
think that's adequately robust.  The user could add more event
triggers that run before or after the ones the replication system
adds, and then you are back to garbage decoding (or crashes).  They
could also modify the catalogs directly, although it's possible we
don't care quite as much about that case (but on the other hand people
do sometimes need to do it to solve real problems).  Although I am
100% OK with pairing back the initial feature set - indeed, I strongly
recommend it - I think that robustness is not a feature which can be
left out in v1 and added in later.  All the robustness has to be
designed in at the start, or we will never have it.

On the whole, I think we're spending far too much time talking about
code and far too little time talking about what the overall design
should look like.  We are having a discussion about whether or not MMR
should be supported by sticking a 16-bit node ID into every WAL record
without having first decided whether we should support MMR, whether
that requires node IDs, whether they should be integers, whether those
integers should be 16 bits in size, whether they should be present in
WAL, and whether or not the record header is the right place to put
them.  There's a right order in which to resolve those questions, and
this isn't it.  More generally, I think there is a ton of complexity
that we're probably overlooking here in focusing in on specific coding
details.  I think the most interesting comment made to date is Steve
Singer's observation that very little of Slony is concerned with
changeset extraction or apply.  Now, on the flip side, all of these
patches seem to be concerned with changeset extraction and apply.
That suggests that we're missing some pretty significant pieces
somewhere in this design.  I think those pieces are things like error
recovery, fault tolerance, user interface design, and control logic.
Slony has spent years trying to get those things right.  Whether or
not they actually have gotten them right is of course an arguable
point, but we're unlikely to do better by ignoring all of those issues
and implementing whatever is most technically expedient.

>> You've got four people objecting to this patch now, all of whom happen
>> to be committers.  Whether or not MMR goes into core, who knows, but
>> it doesn't seem that this patch is going to fly.
> I find that a bit too early to say. Sure it won't fly exactly as proposed, but
> hell, who cares? What I want to get in is a solution to the specific problem
> the patch targets. At least you have, not sure about others, accepted that the
> problem needs a solution.
> We do not agree yet how that solution looks should like but thats not exactly
> surprising as we started discussing the problem only a good day ago.

Oh, no argument with any of that.  I strongly object to the idea of
shoving this patch through as-is, but I don't object to solving the
problem in some other, more appropriate way.  I think that won't look
much like this patch, though; it will be some new patch.

> If people agree that your proposed way of just one flag bit is the way to go
> we will have to live with that. But thats different from saying the whole
> thing is dead.

I think you've convinced me that a single flag-bit is not enough, but
I don't think you've convinced anyone that it belongs in the record
header.

>> As I would rather see this project
>> succeed, I recommend that you don't do that.  Both you and Andres seem
>> to believe that MMR is a reasonable first target to shoot at, but I
>> don't think anyone else - including the Slony developers who have
>> commented on this issue - endorses that position.
> I don't think we get full MMR into 9.3. What I am proposing is that we build
> in the few pieces that are required to implement MMR *ontop* of whats
> hopefully in 9.3.
> And I think thats a realistic goal.

I can't quite follow that sentence, but my general sense is that,
wh

Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun20, 2012, at 17:34 , Tom Lane wrote:
> Florian Pflug  writes:
>> I wonder though if shouldn't restrict the allowed ciphers list to being
>> a simple list of supported ciphers. If our goal is to support multiple
>> SSL libraries transparently then surely having openssl-specific syntax
>> in the config file isn't exactly great anyway...
> 
> No, we don't want to go there, because then we'd have to worry about
> keeping the default list in sync with what's supported by the particular
> version of the particular library we chance to be using.  That's about
> as far from transparent as you can get.  A notation like "DEFAULT"
> is really quite ideal for our purposes in that respect.

No argument with that, but does that mean we have to allow the full
syntax supported by OpenSSL (i.e., those +,-,! prefixes)? Maybe we could
map an empty list to DEFAULT and otherwise interpret it as a list of 
ciphers?

It'd make the whole NULL-cipher business easy, because once we know that
the cipher specified doesn't contain !NULL (which removes NULL *permanently*),
we can simply append NULL to allow "all these ciphers plus NULL".

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
David Fetter  writes:
> Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
> the header line in should be parsed to establish which columns are
> where and rearranging the output if needed.

This is not "fix a POLA violation".  This is a non-backwards-compatible
new feature, which would have to have a switch to turn it on.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 11:02 AM, David Fetter wrote:

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?



First you are talking about COPY IN, not COPY OUT, surely.

This is not a bug, it is documented in exactly the place that all other 
COPY options are documented. The file_fdw page refers the reader to the 
COPY docs for details. Unless you want us to duplicate the entire COPY 
docs in the file_fdw page this seems entirely reasonable.


The current behaviour was discussed at some length back when we 
implemented the HEADER feature, IIRC, and is quite intentional. I don't 
think we should alter the current behaviour, as plenty of people rely on 
it, some to my certain knowledge. I do see a reasonable case for adding 
a new behaviour which takes notice of the header line, although it's 
likely to have plenty of wrinkles.


Reordering columns like you suggest might well have a significant impact 
on COPY performance, BTW. Also note that I created the file_text_array 
FDW precisely for people who want to be able to cherry pick and reorder 
columns. See 



cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Tom Lane
Alvaro Herrera  writes:
> I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
> the list of ciphers starting from the stringified configuration
> parameter and iterate on them.

Do you mean that it will produce an expansion of the set of ciphers
meeting criteria like "!aNULL"?  If so, I think we are set; we can
easily check to see if the active cipher is in that list, no?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> Simon Riggs  wrote:
> > This is not transaction metadata, it is WAL record metadata
> > required for multi-master replication, see later point.

> > We need to add information to every WAL record that is used as the
> > source for generating LCRs.
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does?  It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata.  You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you?  If not, why is it something to store in every WAL
> record rather than once per transaction?  That's not intended to be
> a rhetorical question.
Its definitely possible to store it per transaction (see the discussion around 
http://archives.postgresql.org/message-
id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via 
the originating node a considerably more complex thing. With our proposal you 
can do it without any complexity involved, on a low level. Storing it per 
transaction means you can only stream out the data to other nodes *after* 
fully reassembling the transaction. Thats a pitty, especially if we go for a 
design where the decoding happens in a proxy instance.

Other metadata will not be needed on such a low level.

I also have to admit that I am very hesitant to start developing some generic 
"transaction metadata" framework atm. That seems to be a good way to spend a 
good part of time in discussion and disagreeing. Imo thats something for 
later.

> I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.
I don't think thats true. See the above referenced subthread for reasons why I 
think the origin id is important.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs  wrote:
 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.
>>>
>>> OK, so your estimate is 1.5 people to do that. And if we have more
>>> people, should they sit around doing nothing?
>>
>> Oh, give me a break.  You're willfully missing my point.  And to quote
>> Fred Brooks, nine women can't make a baby in one month.
>
> No, I'm not. The question is not how quickly can N people achieve a
> single thing, but how long will it take a few skilled people working
> on carefully selected tasks that have few dependencies between them to
> achieve something.

The bottleneck is getting the design right, not writing the code.
Selecting tasks for people to work on without an agreement on the
design will not advance the process, unless we just accept whatever
code you choose to right based on whatever design you happen to pick.

> How exactly did you arrive at your conclusion? Why is yours right and
> mine wrong?

I estimated the amount of work that would be required to do this right
and compared it to other large projects that have been successfully
done in the past.  I think you are looking at something on the order
of magnitude of the Windows port, which took about four releases to
become stable, or the SE-Linux project, which still isn't
feature-complete.  Even if it's only a HS-sized project, that took two
releases, as did streaming replication.  SSI got committed within one
release cycle, but there were several years of design and advocacy
work before any code was written, so that, too, was really a
multi-year project.

I'll confine my comments on the second part of the question to the
observation that it is a bit early to know who is right and who is
wrong, but the question could just as easily be turned on its head.

> No, I have four people who had initial objections and who have not
> commented on the fact that the points made are regrettably incorrect.

I think Kevin addressed this point better than I can.  Asserting
something doesn't make it true, and you haven't offered any rational
argument against the points that have been made, probably because
there isn't one.  We *cannot* justify steeling 100% of the available
bit space for a feature that many people won't use and may not be
enough to address the real requirement anyway.

> Since at least 3 of the people making such comments did not attend the
> full briefing meeting in Ottawa, I am not particularly surprised.
> However, I do expect people that didn't come to the meeting to
> recognise that they are likely to be missing information and to listen
> closely, as I listen to them.

Participation in the community development process is not contingent
on having flown to Ottawa in May, or on having decided to spend that
evening at your briefing meeting.  Attributing to ignorance what is
adequately explained by honest disagreement is impolite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 15:55, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> I think that this change may have made the difference between the
>> Hungarians getting away with it and not getting away with it. Might it
>> have been that for text, they were using some operator that wasn't '='
>> (perhaps one which has no fastpath, and thus correctly made a
>> representation about equivalency) rather than texteq prior to this
>> commit?
>
> Uh, no.  There aren't any magic variants of equality now, and there were
> not back then either.  I'm inclined to think that the "Hungarian
> problem" did exist long before we fixed it.

I was suggesting that an operator other than equality may have been
used for some reason. It probably isn't a significant issue though.

> Um ... are you proposing to replace text btree index entries with
> strxfrm values?  Because if you aren't, I don't see that this is likely
> to win anything.  And if you are, it loses on the grounds of (a) index
> bloat and (b) loss of ability to do index-only scans.

No, I'm suggesting it would probably be at least a bit of a win here
to cache the constant, and only have to do a strxfrm() + strcmp() per
comparison. Not enough to justify all the added complexity of course,
but I wouldn't seek to justify this on the basis of improvements to
the speed of btree traversal.  It would obviously be much more
valuable for tuple sorting.

> Personally I think this is not a direction we want to go in.  I think
> that it's going to end up a significant performance loss in many cases,
> break backwards compatibility in numerous ways, and provide a useful
> behavioral improvement to only a small minority of users.

I may have over-emphasized the improvement in correctness that this
would bring, which I personally consider to be very much a secondary
benefit. The fact is that this is likely to be a fairly significant
performance win, because strxfrm() is quite simply the way you're
supposed to do collation-aware sorting, and is documented as such. For
that reason, C standard library implementations should not be expected
to emphasize its performance - they assume that you're using strxfrm()
+ their highly optimised strcmp() (as I've said, the glibc
implementation is written in ASM, and makes use of hand-written SSSE3
instructions on x86_64 for example).

The only performance downside that I can think of is the added check
for equivalence for each tuple within _bt_checkkeys() - perhaps you
were thinking of something else that hasn't occurred to me though.
Were you?

The added overhead is only going to be paid only once per index scan,
and not once per tuple returned by an index scan. Since equality
implies equivalence for us, there is no need to check equivalence
until something is unequal, and when that happens, and equivalence
doesn't hold, we're done. Meanwhile, that entire index scan just got
measurably faster from caching the constant's strxfrm() blob.

Now, maybe it is a bit funky that this equivalence test has to happen
even though the vast majority of locales don't care. If the only
alternative is to bloat up the strxfrm() blobs with the original
string, I'd judge that the funkiness is well worth it.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Fujii Masao
On Wed, Jun 20, 2012 at 12:40 PM, Amit Kapila  wrote:
>>> I believe if WAL files are proper as mentioned in Alvaro's mail, the
> purposed logic should generate
>>> correct values.
>>> Do you see any problem in logic purposed in my original mail.
>>> Can I resume my work on this feature?
>
>> Maybe I'm missing your point, but... why don't you just use PITR to
>> recover from the corruption of pg_control?
>
> AFAIK PITR can be used in a scenario where there is a base back-up and we
> have archived
> the WAL files after that, now it can use WAL files to apply on base-backup.

Yes. If you want to recover the database from the media crash like the
corruption of pg_control file, you basically should take a base backup
and set up continuous archiving.

> In this scenario we don't know a point from where to start the next replay.
> So I believe it will be difficult to use PITR in this scenario.

You can find out the point from the complete pg_control file which was
restored from the backup.

If pg_control is corrupted, we can easily imagine that other database files
would also be corrupted. I wonder how many cases where only pg_control
file gets corrupted are. In that case, pg_resetxlog is unhelpful at all.
You need to use PITR, intead.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote:
> David Fetter  writes:
> > Rather than being totally ignored in the COPY OUT (CSV HEADER)
> > case, the header line in should be parsed to establish which
> > columns are where and rearranging the output if needed.
> 
> This is not "fix a POLA violation".  This is a
> non-backwards-compatible new feature, which would have to have a
> switch to turn it on.

OK, new proposal:

COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
David Fetter  writes:
> OK, new proposal:

> COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
> an option which requires that HEADER be enabled and mandates that the
> column names in the header match the columns coming in.

> Has someone got a better name for this option than
> KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-06-20 Thread John Lumby

---
Problem I'm trying to solve:
    For partitioned tables,  make it possible to use RETURNING clause on INSERT 
INTO
   together with DO INSTEAD rule

[  Note  -  wherever I say INSERT I also mean UPDATE and DELETE ]

---
Current behaviour :

    An INSERT which has a RETURNING clause and which is to be rewritten based on
    a rule will be accepted if the rule is an "unconditional DO INSTEAD".
    In general I believe "unconditional" means "no WHERE clause", but in 
practice
    if the rule is of the form 
   CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    this is treated as conditional and the query is rejected.    

    Testcase:
    A table T is partitioned and has two or more columns,  one of which 
    is an id column declared as
 id bigint DEFAULT nextval('history_id_seq'::regclass) NOT NULL
    and the application issues
  "INSERT into history  (column-list which excludes id)  values () 
RETURNING id"

    I can get the re-direction of the INSERT *without* RETURNING to work using
    either trigger or rule, in which the trigger/rule invokes a procedure, but
    whichever way I do it, I could not get this RETURNING clause to work.

    For a trigger,
  the INSERT ... RETURNING was accepted but returned no rows, (as I would
  expect), and for the RULE, the INSERT ... RETURNING was rejected with :
 
  ERROR:  cannot perform INSERT RETURNING on relation "history"
  HINT:  You need an unconditional ON INSERT DO INSTEAD rule with a 
RETURNING clause.

  but this hint was not much help,  since :

   For a rule,
 CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    ERROR:  syntax error at or near "returning"
    LINE 1: ...DO INSTEAD SELECT history_insert_partitioned(NEW) returning ...

    Here the function history_insert_partitioned is something like
    CREATE FUNCTION history_insert_partitioned( NEW public.history) RETURNS 
BIGINT AS $$
    DECLARE
   ...
    BEGIN
   ...
 < acccess NEW fields e.g. timestamp>
 <  construct partitioned table name>
  < EXECUTE 'INSERT INTO ' partitioned table
   ...
    RETURN history_id;
    END;
    $$
    LANGUAGE plpgsql

---
Some references to discussion of this requirement :
  .  http://wiki.postgresql.org/wiki/Todo
 item "Make it possible to use RETURNING together with conditional DO 
INSTEAD rules,
   such as for partitioning setups"
  .  http://archives.postgresql.org/pgsql-general/2012-06/msg00377.php
  .  http://archives.postgresql.org/pgsql-general/2010-12/msg00542.php
  .  
http://acodapella.blogspot.it/2011/06/hibernate-postgresql-table-partitioning.html

---
Proposal:
  .  I propose to extend the rule system to recognize cases where the INSERT 
query specifies
 RETURNING and the rule promises to return a row,  and to then permit this 
query to run
 and return the expected row.   In effect,  to widen the definition of 
"unconditional"
 to handle cases such as my testcase.
  .  One comment is that all the infrastructure for returning one row from the 
re-written query
 is already present in the code,  and the non-trivial question is how to 
ensure the
 new design is safe in preventing any rewrite that actually would not 
return a row.
  .  In this patch,   I have chosen to make use of the LIMIT clause  -
 I add a side-effect implication to a LIMIT clause when it occurs in a 
rewrite of an INSERT
 to mean "this rule will return a row".
 So,  with my patch, same testcase,  same function 
history_insert_partitioned and new rule
    CREATE RULE insert_part_history as ON INSERT to history \
   DO INSTEAD SELECT history_insert_partitioned(NEW) LIMIT 1
 the INSERT is accepted and returns the id.
 This use of LIMIT clause is probably contentious but I wished to avoid 
introducing new
 SQL syntax,  and the LIMIT clause does have a connotation of returning 
rows.

---
I attach patch based on clone of postgresql.git as of yesterday (120619-145751 
EST)
I have tested the patch with INSERT and UPDATE  (not tested with DELETE but 
should work).
The patch is not expected to be final but just to show how I did it.

John
  --- src/backend/optimizer/plan/planner.c.orig	2012-06-19 14:59:21.264574275 -0400
+++ src/backend/optimizer/plan/planner.c	2012-06-19 15:10:54.776590758 -0400
@@ -226,7 +226,8 @@ standard_planner(Query *parse, int curso
 
 	result->commandType = parse->commandType;
 	result->queryId = parse->queryId;
-	result->hasReturning = (

Re: [HACKERS] libpq compression

2012-06-20 Thread Alvaro Herrera
Excerpts from Tom Lane's message of mié jun 20 11:49:51 -0400 2012:
> 
> Alvaro Herrera  writes:
> > I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
> > the list of ciphers starting from the stringified configuration
> > parameter and iterate on them.
> 
> Do you mean that it will produce an expansion of the set of ciphers
> meeting criteria like "!aNULL"?

Attached is a simple program that does that.  You pass 'ALL:!aNULL' as
its first arg and it produces such a list.

> If so, I think we are set; we can
> easily check to see if the active cipher is in that list, no?

Great.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
#include 
#include 
#include 

#include 
#include 

int main(int argc, char *argv[])
{
const SSL_METHOD *method = TLSv1_client_method();
SSL_CTX*ctx;
SSL*ssl = NULL;
char   *ciphers;
int i;
   
if (argc < 2)
{
fprintf(stderr, "ciphers not specified\n");
exit(1);
}

ciphers = argv[1];

SSL_library_init();

ctx = SSL_CTX_new(method);
if (!ctx)
{
fprintf(stderr, "something went wrong\n");
exit(1);
}

if (!SSL_CTX_set_cipher_list(ctx, ciphers))
{
fprintf(stderr, "unable to set cipher list\n");
exit(1);
}

ssl = SSL_new(ctx);
if (!ssl)
{
fprintf(stderr, "unable to create the SSL object\n");
exit(1);
}

for (i = 0;; i++)
{
const char   *cipher;

cipher = SSL_get_cipher_list(ssl, i);
if (cipher == NULL)
{
fprintf(stderr, "end of cipher list?\n");
break;
}
printf("cipher: %s\n", cipher);
}

SSL_CTX_free(ctx);
SSL_free(ssl);

return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Tom Lane
Peter Geoghegan  writes:
> No, I'm suggesting it would probably be at least a bit of a win here
> to cache the constant, and only have to do a strxfrm() + strcmp() per
> comparison.

Um, have you got any hard evidence to support that notion?  The
traditional advice is that strcoll is faster than using strxfrm unless
the same strings are to be compared repeatedly.  I'm not convinced that
saving the strxfrm on just one side will swing the balance.

> The fact is that this is likely to be a fairly significant
> performance win, because strxfrm() is quite simply the way you're
> supposed to do collation-aware sorting, and is documented as such. For
> that reason, C standard library implementations should not be expected
> to emphasize its performance - they assume that you're using strxfrm()
> + their highly optimised strcmp()

Have you got any evidence in support of this claim, or is it just
wishful thinking about what's likely to be inside libc?  I'd also note
that any comparisons you may have seen about this are certainly not
accounting for the effects of data bloat from strxfrm (ie, possible
spill to disk, more merge passes, etc).

In any case, if you have to redefine the meaning of equality in order
to justify a performance patch, I'm prepared to walk away at the start.
The range of likely performance costs/benefits across different locales
and different implementations is so wide that if you can't show it to be
a win even with the strcmp tiebreaker, it's not likely to be a reliable
win without that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 12:35 PM, Florian Pflug  wrote:
> On Jun19, 2012, at 17:36 , Robert Haas wrote:
>> On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
>>  wrote:
>>> On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
 The fly in the ointment with any of these ideas is that the "configure
 list" is not a list of exact cipher names, as per Magnus' comment that
 the current default includes tests like "!aNULL".  I am not sure that
 we know how to evaluate such conditions if we are applying an
 after-the-fact check on the selected cipher.  Does OpenSSL expose any
 API for evaluating whether a selected cipher meets such a test?
>>>
>>> I'm not sure whether there's an API for it, but you can certainly check
>>> manually with "openssl ciphers -v", for example:
>>>
>>> $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
>>> NULL-SHA                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=SHA1
>>> NULL-MD5                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=MD5
>>>
>>> ...etc...
>>>
>>> So unless the openssl includes the code twice there must be a way to
>>> extract the list from the library.
>>
>> There doubtless is, but I'd being willing to wager that you won't be
>> able to figure out the exact method without reading the source code
>> for 'opennssl ciphers' to see how it was done there, and most likely
>> you'll find that at least one of the functions they use has no man
>> page.  Documentation isn't their strong point.
>
> Yes, unfortunately.
>
> I wonder though if shouldn't restrict the allowed ciphers list to being
> a simple list of supported ciphers. If our goal is to support multiple
> SSL libraries transparently then surely having openssl-specific syntax
> in the config file isn't exactly great anyway...

That is a very good point. Before we design *another* feature that
relies on it, we should verify if the syntax is compatible in the
other libraries that would be interesting (gnutls and NSS primarily),
and if it's not that at least the *functionality* exists ina
compatible way. So we don't put ourselves in a position where we can't
proceed.

And yes, that's vapourware so far. But I know at least Claes (added to
CC) has said he wants to work on it during this summer, and I've
promised to help him out and review as well, if/when he gets that far.
But even without that, we should try to keep the door to these other
library implementations as open as possible.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:
> David Fetter  writes:
> > OK, new proposal:
> 
> > COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
> > an option which requires that HEADER be enabled and mandates that the
> > column names in the header match the columns coming in.
> 
> > Has someone got a better name for this option than
> > KEEP_HEADER_COLUMN_NAMES?
> 
> Well, if it's just checking that the list matches, then maybe
> CHECK_HEADER would do.

OK

> In your original proposal, I was rather wondering what should happen
> if the incoming file didn't have the same set of columns called out
> in the COPY command's column list.  (That is, while *rearranging*
> the columns might be thought non-astonishing, I'm less convinced
> about a copy operation that inserts or defaults columns differently
> from what the command said should happen.)  If we're just checking
> for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t).  Just to emphasize, the column
ordering is different in the places where they match.

Would

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER)

now "just work" (possibly with some performance penalty) and 

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER)

only "work," i.e. import gobbledygook, in the case where every t entry
in foo.csv happened to be castable to NUMERIC?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Alvaro Herrera

Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012:
> On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

> > In your original proposal, I was rather wondering what should happen
> > if the incoming file didn't have the same set of columns called out
> > in the COPY command's column list.  (That is, while *rearranging*
> > the columns might be thought non-astonishing, I'm less convinced
> > about a copy operation that inserts or defaults columns differently
> > from what the command said should happen.)  If we're just checking
> > for a match, that question goes away.
> 
> Let's imagine we have a table foo(id serial, t text, n numeric) and a
> CSV file foo.csv with headers (n, t).  Just to emphasize, the column
> ordering is different in the places where they match.

For the record, IIRC we had one person trying to do this in the spanish
list not long ago.

Another related case: you have a file with headers and columns (n, t, x, y, z)
but your table only has n and t.  How would you tell COPY to discard the
junk columns?  Currently it just complains that they are there.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
Hi,

On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund  
wrote:
> > Were not the only ones here that are performing scope creep though... I
> > think about all people who have posted in the whole thread except maybe
> > Tom and Marko are guilty of doing so.
> > 
> > I still think its rather sensible to focus on exactly duplicated schemas
> > in a very first version just because that leaves out some of the
> > complexity while paving the road for other nice things.
> 
> Well, I guess what I want to know is: what does focusing on exactly
> duplicated schemas mean?  If it means we'll disable DDL for tables
> when we turn on replication, that's basically the Slony approach: when
> you want to make a DDL change, you have to quiesce replication, do it,
> and then resume replication.  I would possibly be OK with that
> approach.  If it means that we'll hope that the schemas are duplicated
> and start spewing garbage data when they're not, then I'm not
> definitely not OK with that approach.  If it means using event
> triggers to keep the catalogs synchronized, then I don't think I don't
> think that's adequately robust.  The user could add more event
> triggers that run before or after the ones the replication system
> adds, and then you are back to garbage decoding (or crashes). 
I would prefer the event trigger way because that seems to be the most 
extensible/reusable. It would allow a fully replicated databases and catalog 
only instances.
I think we need to design event triggers in a way you cannot simply circumvent 
them. We already have the case that if users try to screw around system 
triggers we give back wrong answers with the planner relying on foreign keys 
btw.
If the problem is having user trigger after system triggers: Lets make that 
impossible. Forbidding DDL on the other instances once we have that isn't that 
hard.

Perhaps all that will get simpler if we can make reading the catalog via 
custom built snapshots work as you proposed otherwhere in this thread. That 
would make checking errors way much easier even if you just want to apply to  
a database with exactly the same schema. Thats the next thing I plan to work 
on.

> They could also modify the catalogs directly, although it's possible we
> don't care quite as much about that case (but on the other hand people
> do sometimes need to do it to solve real problems).
With that you already can crash the database perfectly fine today. I think 
trying to care for that is a waste of time.

> Although I am
> 100% OK with pairing back the initial feature set - indeed, I strongly
> recommend it - I think that robustness is not a feature which can be
> left out in v1 and added in later.  All the robustness has to be
> designed in at the start, or we will never have it.
I definitely don't intend to cut down on robustness.

> On the whole, I think we're spending far too much time talking about
> code and far too little time talking about what the overall design
> should look like.
Agreed.

> We are having a discussion about whether or not MMR
> should be supported by sticking a 16-bit node ID into every WAL record
> without having first decided whether we should support MMR, whether
> that requires node IDs, whether they should be integers, whether those
> integers should be 16 bits in size, whether they should be present in
> WAL, and whether or not the record header is the right place to put
> them.  There's a right order in which to resolve those questions, and
> this isn't it.  More generally, I think there is a ton of complexity
> that we're probably overlooking here in focusing in on specific coding
> details.  I think the most interesting comment made to date is Steve
> Singer's observation that very little of Slony is concerned with
> changeset extraction or apply.  Now, on the flip side, all of these
> patches seem to be concerned with changeset extraction and apply.
> That suggests that we're missing some pretty significant pieces
> somewhere in this design.  I think those pieces are things like error
> recovery, fault tolerance, user interface design, and control logic.
> Slony has spent years trying to get those things right.  Whether or
> not they actually have gotten them right is of course an arguable
> point, but we're unlikely to do better by ignoring all of those issues
> and implementing whatever is most technically expedient.
I agree that the focus isn't 100% optimal and that there are *loads* of issues 
we haven't event started to look at. But you need a point to start and 
extraction & apply seems to be a good one because you can actually test it 
without the other issues solved which is not really the case the other way 
round.
Also its possible to plug in the newly built changeset extraction into 
existing solutions to make them more efficient while retaining most of their 
respective framework.

So I disagree that thats the wrong part to start with.

> >> You've got f

Re: [HACKERS] foreign key locks

2012-06-20 Thread Tom Lane
Jaime Casanova  writes:
> On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera
>  wrote:
>> This is v12 of the foreign key locks patch.

> Just noticed that this patch needs a rebase because of the refactoring
> Tom did in ri_triggers.c

Hold on a bit before you work on that code --- I've got one more bit of
hacking I want to try before walking away from it.  I did some oprofile
work on Dean's example from

and noticed that it looks like ri_FetchConstraintInfo is eating a
noticeable fraction of the runtime, which is happening because it is
called to deconstruct the relevant pg_constraint row for each tuple
we consider firing the trigger for (and then again, if we do fire the
trigger).  I'm thinking it'd be worth maintaining a backend-local cache
for the deconstructed data, and am going to go try that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 12:18 PM, Tom Lane wrote:

David Fetter  writes:

OK, new proposal:
COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.
Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.





In the past people have asked me to have copy use the CSV header line in 
place of supplying a column list in the COPY command. I can certainly 
see some utility in that, and I think it might achieve what David wants. 
Using that scenario it would be an error to supply an explicit column 
list and also use the header line. But then I don't think CHECK_HEADER 
would be the right name for the option. In any case, specifying a name 
before we settle on an exact behaviour seems wrong :-)


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
Another related case: you have a file with headers and columns (n, t, 
x, y, z) but your table only has n and t. How would you tell COPY to 
discard the junk columns? Currently it just complains that they are 
there. 



That's one of the main use cases for file_text_array_fdw.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 23:34, Kevin Grittner  wrote:
> Simon Riggs  wrote:
>> Kevin Grittner  wrote:
 Heikki Linnakangas  wrote:
>>>
 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.
>>>
>>> Right.
>>
>> Wrong, as explained.
>
> The point is not wrong; you are simply not responding to what is
> being said.

Heikki said that the origin ID was not required for all MMR
configs/scenarios. IMHO that is wrong, with explanation given.

By agreeing with him, I assumed you were sharing that assertion,
rather than saying something else.


> You have not explained why an origin ID is required when there is no
> replication, or if there is master/slave logical replication, or
...

You're right; I never claimed it was needed. Origin Id is only needed
for multi-master replication and that is the only context I've
discussed it.


>> This is not transaction metadata, it is WAL record metadata
>> required for multi-master replication, see later point.
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs.
>
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does?  It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata.  You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you?  If not, why is it something to store in every WAL
> record rather than once per transaction?  That's not intended to be
> a rhetorical question.  I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.

> As long as the WAL stream is the medium, options are very
> constrained.  You can code a very fast engine to handle a single
> type of configuration that way, and perhaps that should be a
> supported feature, but it's not a configuration I've needed yet.
> (Well, on reflection, if it had been available and easy to use, I
> can think of *one* time I *might* have used it for a pair of nodes.)
> It seems to me that you are so focused on this one use case that you
> are not considering how design choices which facilitate fast
> development of that use case paint us into a corner in terms of
> expanding to other use cases.

>>> I think removing origin ID from this patch and submitting a
>>> separate patch for a generalized transaction metadata system is
>>> the sensible way to go.
>>
>> We already have a very flexible WAL system for recording data of
>> interest to various resource managers. If you wish to annotate a
>> transaction, you can either generate a new kind of WAL record or
>> you can enhance a commit record.
>
> Right.  Like many of us are suggesting should be done for origin ID.
>
>> XLOG_NOOP records can already be generated by your application if
>> you wish to inject additional metadata to the WAL stream. So no
>> changes are required for you to implement the generalised
>> transaction metadata scheme you say you require.
>
> I'm glad it's that easy.  Are there SQL functions to for that yet?


Yes, another possible design is to generate a new kind of WAL record
for the origin id.

Doing it that way will slow down multi-master by a measurable amount,
and slightly bloat the WAL stream.

The proposed way uses space that is currently wasted and likely to
remain so. Only 2 bytes of 6 bytes available are proposed for use,
with a flag design that allows future extension if required. When MMR
is not in use, the WAL records would look completely identical to the
way they look now, in size, settings and speed of writing them.

Putting the origin id onto each WAL record allows very fast and simple
stateless filtering. I suggest using it because those bytes have been
sitting there unused for close to 10 years now and no better use
springs to mind.

The proposed design is the fastest way of implementing MMR, without
any loss for non-users.

As I noted before, slowing down MMR by a small amount causes geometric
losses in performance across the whole cluster.


>> Not sure how or why that relates to requirements for multi-master.
>
> That depends on whether you want to leave the door open to other
> logical replication than the one use case on which you are currently
> focused.  I even consider some of those other cases multi-master,
> especially when multiple databases are replicating to a single table
> on another server.  I'm not clear on your definition -- it seems to
> be rather more narrow.  Maybe we need to define some terms somewhere
> to facilitate discussion.  Is there a Wiki page where that would
> make sense?

The project is called BiDirectional Replication to ensure that people
understood this is not just multi-master. But that doesn't mean that
multi-master can't have its own specific requirements.

Adding origin

Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Josh Berkus

> In the past people have asked me to have copy use the CSV header line in
> place of supplying a column list in the COPY command. I can certainly
> see some utility in that, and I think it might achieve what David wants.
> Using that scenario it would be an error to supply an explicit column
> list and also use the header line. But then I don't think CHECK_HEADER
> would be the right name for the option. In any case, specifying a name
> before we settle on an exact behaviour seems wrong :-)

Actually, I can see three valid and valuable-to-users behaviors here:

1) current behavior, header line is ignored.

2) CHECK HEADER: a column list is supplied, but a "check header" flag is
set.  If the column list and header list don't match *exactly*, you get
an error.

3) USE HEADER: no column list is supplied, but a "use header" flag is
set.  A column list is created to match the columns from the CSV header.
 Of necessity, this will consist of all or some of the columns in the
table.  If columns are supplied which are not in the table, then you get
an error (as well as if columns are missing which are NOT NULL, as you
get at present).

(2) is more valuable to people who want to check data integrity
rigorously, and test for unexpected API changes.  (3) is more valuable
for regular users who want CSV import to "just work".  (1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Christopher Browne
On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund  wrote:
> On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
>> Simon Riggs  wrote:
>> > This is not transaction metadata, it is WAL record metadata
>> > required for multi-master replication, see later point.
>
>> > We need to add information to every WAL record that is used as the
>> > source for generating LCRs.
>> If the origin ID of a transaction doesn't count as transaction
>> metadata (i.e., data about the transaction), what does?  It may be a
>> metadata element about which you have special concerns, but it is
>> transaction metadata.  You don't plan on supporting individual WAL
>> records within a transaction containing different values for origin
>> ID, do you?  If not, why is it something to store in every WAL
>> record rather than once per transaction?  That's not intended to be
>> a rhetorical question.
> Its definitely possible to store it per transaction (see the discussion around
> http://archives.postgresql.org/message-
> id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via
> the originating node a considerably more complex thing. With our proposal you
> can do it without any complexity involved, on a low level. Storing it per
> transaction means you can only stream out the data to other nodes *after*
> fully reassembling the transaction. Thats a pitty, especially if we go for a
> design where the decoding happens in a proxy instance.

I guess I'm not seeing the purpose to having the origin node id in the
WAL stream either.

We have it in the Slony sl_log_* stream, however there is a crucial
difference, in that sl_log_* is expressly a shared structure.  In
contrast, WAL isn't directly sharable; you don't mix together multiple
WAL streams.

It seems as though the point in time at which you need to know the
origin ID is the moment at which you're deciding to read data from the
WAL files, and knowing which stream you are reading from is an
assertion that might be satisfied by looking at configuration that
doesn't need to be in the WAL stream itself.  It might be *nice* for
the WAL stream to be self-identifying, but that doesn't seem to be
forcibly necessary.

The case where it *would* be needful is if you are in the process of
assembling together updates coming in from multiple masters, and need
to know:
   - This INSERT was replicated from node #1, so should be ignored downstream
   - That INSERT was replicated from node #2, so should be ignored downstream
   - This UPDATE came from the local node, so needs to be passed to
downstream users

Or perhaps something else is behind the node id being deeply embedded
into the stream that I'm not seeing altogether.

> Other metadata will not be needed on such a low level.
>
> I also have to admit that I am very hesitant to start developing some generic
> "transaction metadata" framework atm. That seems to be a good way to spend a
> good part of time in discussion and disagreeing. Imo thats something for
> later.

Well, I see there being a use in there being at least 3 sorts of LCR records:
a) Capturing literal SQL that is to replayed downstream.  This
parallels two use cases existing in existing replication systems:
   i) In pre-2.2 versions of Slony, statements are replayed literally.
 So there's a stream of INSERT/UPDATE/DELETE statements.
   ii) DDL capture and replay.  In existing replication systems, DDL
isn't captured implicitly, the way Dimitri's Event Triggers are to do,
but rather is captured explicitly.
   There should be a function to allow injecting such SQL explicitly;
that is sure to be a useful sort of thing to be able to do.

b) Capturing tuple updates in a binary form that can be turned readily
into heap updates on a replica.
Unfortunately, this form is likely not to play well when
replicating across platforms or Postgres versions, so I suspect that
this performance optimization should be implemented as a *last*
resort, rather than first.  Michael Jackson had some "rules of
optimization" that said "don't do it", and, for the expert, "don't do
it YET..."

c) Capturing tuple data in some reasonably portable and readily
re-writable form.
Slony 2.2 changes from "SQL fragments" (of a) i) above) to storing
updates as an array of text values indicating:
 - relation name
 - attribute names
 - attribute values, serialized into strings
I don't know that this provably represents the *BEST*
representation, but it definitely will be portable where b) would not
be, and lends itself to being able to reuse query plans, where a)
requires extraordinary amounts of parsing work, today.  So I'm pretty
sure it's better than a) and b) for a sizable set of cases.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund  wrote:
> I would prefer the event trigger way because that seems to be the most
> extensible/reusable. It would allow a fully replicated databases and catalog
> only instances.
> I think we need to design event triggers in a way you cannot simply circumvent
> them. We already have the case that if users try to screw around system
> triggers we give back wrong answers with the planner relying on foreign keys
> btw.
> If the problem is having user trigger after system triggers: Lets make that
> impossible. Forbidding DDL on the other instances once we have that isn't that
> hard.

So, this is interesting.  I think something like this could solve the
problem, but then why not just make it built-in code that runs from
the same place as the event trigger rather than using the trigger
mechanism per se?  Presumably the "trigger" code's purpose is going to
be to inject additional data into the WAL stream (am I wrong?) which
is not something you're going to be able to do from PL/pgsql anyway,
so you don't really need a trigger, just a call to some C function -
which not only has the advantage of being not bypassable, but is also
faster.

> Perhaps all that will get simpler if we can make reading the catalog via
> custom built snapshots work as you proposed otherwhere in this thread. That
> would make checking errors way much easier even if you just want to apply to
> a database with exactly the same schema. Thats the next thing I plan to work
> on.

I realized a problem with that idea this morning: it might work for
reading things, but if anyone attempts to write data you've got big
problems.  Maybe we could get away with forbidding that, not sure.
Would be nice to get some input from other hackers on this.

>> They could also modify the catalogs directly, although it's possible we
>> don't care quite as much about that case (but on the other hand people
>> do sometimes need to do it to solve real problems).
> With that you already can crash the database perfectly fine today. I think
> trying to care for that is a waste of time.

You're probably right.

> I agree that the focus isn't 100% optimal and that there are *loads* of issues
> we haven't event started to look at. But you need a point to start and
> extraction & apply seems to be a good one because you can actually test it
> without the other issues solved which is not really the case the other way
> round.
> Also its possible to plug in the newly built changeset extraction into
> existing solutions to make them more efficient while retaining most of their
> respective framework.
>
> So I disagree that thats the wrong part to start with.

I think extraction is a very sensible place to start; actually, I
think it's the best possible place to start.  But this particular
thread is about adding origin_ids to WAL, which I think is definitely
not the best place to start.

> I definitely do want to provide code that generates a textual representation
> of the changes. As you say, even if its not used for anything its needed for
> debugging. Not sure if it should be sql or maybe the new slony representation.
> If thats provided and reusable it should make sure that ontop of that other
> solutions can be built.

Oh, yeah.  If we can get that, I will throw a party.

> I find your supposition that I/we just want to get MMR without regard for
> anything else a bit offensive. I wrote at least three times in this thread
> that I do think its likely that we will not get more than the minimal basis
> for implementing MMR into 9.3. I wrote multiple times that I want to provide
> the basis for multiple solutions. The prototype - while obviously being
> incomplete - tried hard to be modular.
> You cannot blame us that we want the work we do to be *also* usable for what
> one of our major aims?
> What can I do to convince you/others that I am not planning to do something
> "evil" but that I try to reach as many goals at once as possible?

Sorry.  I don't think you're planning to do something evil, but before
I thought you said you did NOT want to write the code to extract
changes as text or something similar.  I think that would be a really
bad thing to skip for all kinds of reasons.  I think we need that as a
foundational technology before we do much else.  Now, once we have
that, if we can safely detect cases where it's OK to bypass decoding
to text and skip it in just those cases, I think that's great
(although possibly difficult to implement correctly).  I basically
feel that without decode-to-text, this can't possibly be a basis for
multiple solutions; it will be a basis only for itself, and extremely
difficult to debug, too.  No other replication solution can even
theoretically have any use for the raw on-disk tuple, at least not
without horrible kludgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To ma

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 12:41 PM, Tom Lane  wrote:
>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
>
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?  I'd also note
> that any comparisons you may have seen about this are certainly not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).
>
> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.
> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

On the testing I've done, the strcmp() tie-breaking rarely gets run
anyway.  Unless you're sorting data with only a few distinct values,
most comparisons are between values that are distinct under any choice
of collation, and therefore strcoll() returns 0 very rarely, and
therefore the additional runtime it consumes does not matter very
much.  Also, it's quite a bit faster than strcoll() anyway, so even
when it does run it doesn't add much to the total time.

I think the elephant in the room here is that we're relying on the OS
to do everything for us, and the OS API we use (strcoll) requires an
extra memcpy and is also dreadfully slow.  If we could solve that
problem, it would save us a lot more than worrying about the extra
strcmp().  Of course, solving that problem is hard: we either have to
get the glibc and FreeBSD libc folks to provide a better API (that
takes lengths for each input instead of relying on trailing nul
bytes), or reimplement locales within PG, or store trailing NUL bytes
that we don't really need in the index so we can apply strcoll
directly, none of which are very appealing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Marc Mamin
>  -Ursprüngliche Nachricht-
>  Von: pgsql-hackers-ow...@postgresql.org im Auftrag von Josh Berkus
>  Gesendet: Mi 6/20/2012 7:06
>  An: pgsql-hackers@postgresql.org
>  Betreff: Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

...

>   (1) is valuable
>  for backwards compatibility, and for cases where the CSV header was
>  generated by another program (Excel?) so the column names don't match.


4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)

to cover the case when column names do not match.

my 2 pences,

Marc Mamin


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 17:41, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> No, I'm suggesting it would probably be at least a bit of a win here
>> to cache the constant, and only have to do a strxfrm() + strcmp() per
>> comparison.
>
> Um, have you got any hard evidence to support that notion?  The
> traditional advice is that strcoll is faster than using strxfrm unless
> the same strings are to be compared repeatedly.  I'm not convinced that
> saving the strxfrm on just one side will swing the balance.

Fair enough,  but I'd suggest that that traditional advice assumes
that strcoll()'s space-efficiency and general avoidance of dynamic
allocation is an important factor, which, for us, it clearly isn't,
since we can re-use a single buffer in the manner of Robert's text
sortsupport patch for each and every per-tuple strxfrm() blob (when
traversing an index). This is a direct quote from glibc's strcoll_l.c
:

 Perform the first pass over the string and while doing this find
 and store the weights for each character.  Since we want this to
 be as fast as possible we are using `alloca' to store the temporary
 values.  But since there is no limit on the length of the string
 we have to use `malloc' if the string is too long.  We should be
 very conservative here.

Here, alloca is used to allocate space in a stack frame. I believe
that this is an entirely inappropriate trade-off for Postgres to be
making. strxfrm(), in constrast, leaves buffer sizing and management
up to the caller. That has to be a big part of the problem here.

>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
>
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?

According to the single-unix specification's strcoll() documentation,
"The strxfrm() and strcmp() functions should be used for sorting large
lists". If that isn't convincing enough for you, there is the fact
that glibc's strcmp() is clearly highly optimised for each and every
architecture, and that we are currently throwing away an extra
strcmp() in the event of strcoll() equality.

> I'd also note that any comparisons you may have seen about this are certainly 
> not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).

What about the fact that strcoll() may be repeatedly allocating and
freeing memory per comparison? The blobs really aren't that much
larger than the strings to be sorted, which are typically quite short.

> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.

The advantage of my proposed implementation is precisely that I won't
have to redefine the meaning of equality, and that only the text
datatype will have to care about equivalency, so you can just skip
over an explanation of equivalency for most audiences.

If you feel that strongly about it, and I have no possible hope of
getting this accepted, I'm glad that I know now rather than after
completing a significant amount of work on this. I would like to hear
other people's opinions before I drop it though.

> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

glibc is the implementation that really matters. My test-case used
en_US.UTF-8 as its locale, which has to be one of the least stressful
to strcoll() - I probably could have shown a larger improvement just
by selecting a locale that was known to have to make more passes,
like, say, hu_HU.UTF-8.

I expect to have access to a 16 core server next week, which Bull have
made available to me. Maybe I'll get some interesting performance
numbers from it.

The reason that I don't want to use the blob with original string hack
is because it's ugly, space-inefficient, unnecessary and objectively
incorrect, since it forces us to violate the conformance requirement
C9 of Unicode 3.0, marginal though that may be.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Josh Berkus

> 
> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
> 
> to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY.  At that point, check out the CSV-array FDW.

Of course, if someone writes a WIP patch which implements the above, we
can evaluate it then.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund  
wrote:
> > I would prefer the event trigger way because that seems to be the most
> > extensible/reusable. It would allow a fully replicated databases and
> > catalog only instances.
> > I think we need to design event triggers in a way you cannot simply
> > circumvent them. We already have the case that if users try to screw
> > around system triggers we give back wrong answers with the planner
> > relying on foreign keys btw.
> > If the problem is having user trigger after system triggers: Lets make
> > that impossible. Forbidding DDL on the other instances once we have that
> > isn't that hard.
> 
> So, this is interesting.  I think something like this could solve the
> problem, but then why not just make it built-in code that runs from
> the same place as the event trigger rather than using the trigger
> mechanism per se?  Presumably the "trigger" code's purpose is going to
> be to inject additional data into the WAL stream (am I wrong?) which
> is not something you're going to be able to do from PL/pgsql anyway,
> so you don't really need a trigger, just a call to some C function -
> which not only has the advantage of being not bypassable, but is also
> faster.
I would be totally fine with that. As long as event triggers provide the 
infrastructure that shouldn't be a big problem.

> > Perhaps all that will get simpler if we can make reading the catalog via
> > custom built snapshots work as you proposed otherwhere in this thread.
> > That would make checking errors way much easier even if you just want to
> > apply to a database with exactly the same schema. Thats the next thing I
> > plan to work on.
> I realized a problem with that idea this morning: it might work for
> reading things, but if anyone attempts to write data you've got big
> problems.  Maybe we could get away with forbidding that, not sure.
Hm, why is writing a problem? You mean io conversion routines writing data? 
Yes, that will be a problem. I am fine with simply forbidding that, we should 
be able to catch that and provide a sensible error message, since SSI we have 
the support for that.

> Would be nice to get some input from other hackers on this.
Oh, yes!


> > I agree that the focus isn't 100% optimal and that there are *loads* of
> > issues we haven't event started to look at. But you need a point to
> > start and extraction & apply seems to be a good one because you can
> > actually test it without the other issues solved which is not really the
> > case the other way round.
> > Also its possible to plug in the newly built changeset extraction into
> > existing solutions to make them more efficient while retaining most of
> > their respective framework.
> > 
> > So I disagree that thats the wrong part to start with.
> 
> I think extraction is a very sensible place to start; actually, I
> think it's the best possible place to start.  But this particular
> thread is about adding origin_ids to WAL, which I think is definitely
> not the best place to start.
Yep. I think the reason everyone started at it is that the patch was actually 
really simple ;).
Note that the wal enrichement & decoding patches were before the origin_id 
patch in the patchseries ;)

> > I definitely do want to provide code that generates a textual
> > representation of the changes. As you say, even if its not used for
> > anything its needed for debugging. Not sure if it should be sql or maybe
> > the new slony representation. If thats provided and reusable it should
> > make sure that ontop of that other solutions can be built.
> Oh, yeah.  If we can get that, I will throw a party.
Good ;)

> > I find your supposition that I/we just want to get MMR without regard for
> > anything else a bit offensive. I wrote at least three times in this
> > thread that I do think its likely that we will not get more than the
> > minimal basis for implementing MMR into 9.3. I wrote multiple times that
> > I want to provide the basis for multiple solutions. The prototype -
> > while obviously being incomplete - tried hard to be modular.
> > You cannot blame us that we want the work we do to be *also* usable for
> > what one of our major aims?
> > What can I do to convince you/others that I am not planning to do
> > something "evil" but that I try to reach as many goals at once as
> > possible?
> Sorry.  I don't think you're planning to do something evil, but before
> I thought you said you did NOT want to write the code to extract
> changes as text or something similar.
Hm. I might have been a bit ambiguous when saying that I do not want to 
provide everything for that use-case.
Once we have a callpoint that has a correct catalog snapshot for exactly the 
tuple in question text conversion is damn near trivial. The point where you 
get passed all that information (action, tuple, table, snapshot) is the one I 
think the patch should mainly provide.

> I think that

Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs  wrote:
> I'm sure Jeff submitted this because of the need for a standard test,
> rather than the wish to actually modify pgbench itself.
>
> Can I suggest that we include a list of standard scripts with pgbench
> for this purpose? These can then be copied alongside the binary when
> we do an install.

I was thinking along similar lines myself.  At the least, I think we
can't continue to add a short option for every new test type.
Instead, maybe we could have --test-type=WHATEVER, and perhaps that
then reads whatever.sql from some compiled-in directory.  That would
allow us to sanely support a moderately large number of tests.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-20 Thread Fujii Masao
On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander  wrote:
> On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas  wrote:
>> On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
>>  wrote:
>>> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
>>> a uint64, on top of my other WAL format patches. I think we should go ahead
>>> with this.
>>
>> +1.
>>
>>> The LSNs on pages are still stored in the old format, to avoid changing the
>>> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
>>> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
>>> is required.
>>
>> Seems fine.
>>
>>> Should we keep the old representation in the replication protocol messages?
>>> That would make it simpler to write a client that works with different
>>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>>> should mandate network-byte order for all the integer and XLogRecPtr fields
>>> in the replication protocol. That would make it easier to write a client
>>> that works across different architectures, in >= 9.3. The contents of the
>>> WAL would of course be architecture-dependent, but it would be nice if
>>> pg_receivexlog and similar tools could nevertheless be
>>> architecture-independent.
>>
>> I share Andres' question about how we're doing this already.  I think
>> if we're going to break this, I'd rather do it in 9.3 than 5 years
>> from now.  At this point it's just a minor annoyance, but it'll
>> probably get worse as people write more tools that understand WAL.
>
> If we are looking at breaking it, and we are especially concerned
> about something like pg_receivexlog... Is it something we could/should
> change in the protocl *now* for 9.2, to make it non-broken in any
> released version? As in, can we extract just the protocol change and
> backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 23:56, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs  wrote:
> But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?
>>>
>>> Oh, give me a break.  You're willfully missing my point.  And to quote
>>> Fred Brooks, nine women can't make a baby in one month.
>>
>> No, I'm not. The question is not how quickly can N people achieve a
>> single thing, but how long will it take a few skilled people working
>> on carefully selected tasks that have few dependencies between them to
>> achieve something.
>
> The bottleneck is getting the design right, not writing the code.
> Selecting tasks for people to work on without an agreement on the
> design will not advance the process, unless we just accept whatever
> code you choose to right based on whatever design you happen to pick.

Right. Which is why there are multiple people doing design on
different aspects and much work going into prototyping to give some
useful information into debates that would otherwise be bikeshedding.


>> How exactly did you arrive at your conclusion? Why is yours right and
>> mine wrong?
>
> I estimated the amount of work that would be required to do this right
> and compared it to other large projects that have been successfully
> done in the past.  I think you are looking at something on the order
> of magnitude of the Windows port, which took about four releases to
> become stable, or the SE-Linux project, which still isn't
> feature-complete.  Even if it's only a HS-sized project, that took two
> releases, as did streaming replication.  SSI got committed within one
> release cycle, but there were several years of design and advocacy
> work before any code was written, so that, too, was really a
> multi-year project.

The current BDR project uses concepts that have been in discussion for
many years, something like 6-7 years. I first wrote multi-master for
Postgres in userspace in 2007 and the current designs build upon that.
My detailed design for WAL-based logical replication was produced in
2009. I'm not really breaking new ground in the design, and we're
reusing significant parts of existing code. You don't know that cos
you didn't think to ask, which is a little surprising.

On top of that, Andres has refined many of the basic ideas and what he
presents here is a level above my initial design work. I have every
confidence that he'll work independently and produce useful code while
other aspects are considered.


> I'll confine my comments on the second part of the question to the
> observation that it is a bit early to know who is right and who is
> wrong, but the question could just as easily be turned on its head.
>
>> No, I have four people who had initial objections and who have not
>> commented on the fact that the points made are regrettably incorrect.
>
> I think Kevin addressed this point better than I can.  Asserting
> something doesn't make it true, and you haven't offered any rational
> argument against the points that have been made, probably because
> there isn't one.  We *cannot* justify steeling 100% of the available
> bit space for a feature that many people won't use and may not be
> enough to address the real requirement anyway.


You've some assertions above that aren't correct, so its good that you
recognise assertions may not be true, on both sides. I ask that you
listen a little before casting judgements. We won't get anywhere
otherwise.

The current wasted space on 64bit systems is 6 bytes per record. The
current suggestion is to use 2 of those, in a flexible manner that
allows future explansion if it is required, and only if it is
required. That flexibility covers anything else we need. Asserting
that "100% of the available space" is being used is wrong, and to
continue to assert that even when told otherwise multiple times is
something else.


>> Since at least 3 of the people making such comments did not attend the
>> full briefing meeting in Ottawa, I am not particularly surprised.
>> However, I do expect people that didn't come to the meeting to
>> recognise that they are likely to be missing information and to listen
>> closely, as I listen to them.
>
> Participation in the community development process is not contingent
> on having flown to Ottawa in May, or on having decided to spend that
> evening at your briefing meeting.  Attributing to ignorance what is
> adequately explained by honest disagreement is impolite.

I don't think people need to have attended that briefing. But if they
did not, then they do need to be aware they missed hours of community
discussion and explanation, built on top of months of careful
investigation, all of which was aimed at helping everybody understand.

Jumping straight into a discussion is what the community is about

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund  wrote:
>> I realized a problem with that idea this morning: it might work for
>> reading things, but if anyone attempts to write data you've got big
>> problems.  Maybe we could get away with forbidding that, not sure.
> Hm, why is writing a problem? You mean io conversion routines writing data?
> Yes, that will be a problem. I am fine with simply forbidding that, we should
> be able to catch that and provide a sensible error message, since SSI we have
> the support for that.

I think we could do something a little more vigorous than that, like
maybe error out if anyone tries to do anything that would write WAL or
acquire an XID.  Of course, then the question becomes: then what?  We
probably need to think about what happens after that - we don't want
an error replicating one row in one table to permanently break
replication for the entire system.

>> Sorry.  I don't think you're planning to do something evil, but before
>> I thought you said you did NOT want to write the code to extract
>> changes as text or something similar.
> Hm. I might have been a bit ambiguous when saying that I do not want to
> provide everything for that use-case.
> Once we have a callpoint that has a correct catalog snapshot for exactly the
> tuple in question text conversion is damn near trivial. The point where you
> get passed all that information (action, tuple, table, snapshot) is the one I
> think the patch should mainly provide.

This is actually a very interesting list.  We could rephrase the
high-level question about the design of this feature as "what is the
best way to make sure that you have these things available?".  Action
and tuple are trivial to get, and table isn't too hard either.  It's
really the snapshot - and all the downstream information that can only
be obtained via using that snapshot - that is the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 01:40, Andres Freund  wrote:

>> I think extraction is a very sensible place to start; actually, I
>> think it's the best possible place to start.  But this particular
>> thread is about adding origin_ids to WAL, which I think is definitely
>> not the best place to start.

> Yep. I think the reason everyone started at it is that the patch was actually
> really simple ;).
> Note that the wal enrichement & decoding patches were before the origin_id
> patch in the patchseries ;)

Actually, I thought it would be non contentious...

A format change, so early in release cycle. Already openly discussed
with no comment. Using wasted space for efficiency, for information
already in use by Slony for similar reasons.

Oh well.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >