Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 11:26, Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Simon Riggs si...@2ndquadrant.com wrote:
 The proposal is to use WAL to generate the logical change stream.
 That has been shown in testing to be around x4 faster than having
 a separate change stream, which must also be WAL logged (as Jan
 noted).

 Sure, that's why I want it.

 I think this argument is basically circular.  The reason it's 4x faster
 is that the WAL stream doesn't actually contain all the information
 needed to generate LCRs (thus all the angst about maintaining catalogs
 in sync, what to do about unfriendly datatypes, etc).  By the time the
 dust has settled and you have a workable system, you will have bloated
 WAL and given back a large chunk of that multiple, thereby invalidating
 the design premise.  Or at least that's my prediction.

The tests were conducted with the additional field added, so your
prediction is not verified.

The additional fields do not bloat WAL records - they take up exactly
the same space as before.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 01:27, Kevin Grittner wrote:

Andres Freundand...@2ndquadrant.com  wrote:


Yes, thats definitely a valid use-case. But that doesn't preclude
the other - also not uncommon - use-case where you want to have
different master which all contain up2date data.


I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.


I think it would be helpful to list down a few concrete examples of 
this. The stereotypical multi-master scenario is that you have a single 
table that's replicated to two servers, and you can insert/update/delete 
on either server. Conflict resolution stretegies vary.


The reason we need an origin id in this scenario is that otherwise this 
will happen:


1. A row is updated on node A
2. Node B receives the WAL record from A, and updates the corresponding 
row in B. This generates a new WAL record.
3. Node A receives the WAL record from B, and updates the rows again. 
This again generates a new WAL record, which is replicated to A, and you 
loop indefinitely.


If each WAL record carries an origin id, node A can use it to refrain 
from applying the WAL record it receives from B, which breaks the loop.


However, note that in this simple scenario, if the logical log replay / 
conflict resolution is smart enough to recognize that the row has 
already been updated, because the old and the new rows are identical, 
the loop is broken at step 3 even without the origin id. That works for 
the newest-update-wins and similar strategies. So the origin id is not 
absolutely necessary in this case.


Another interesting scenario is that you maintain a global counter, like 
in an inventory system, and conflicts are resolved by accumulating the 
updates. For example, if you do UPDATE SET counter = counter + 1 
simultaneously on two nodes, the result is that the counter is 
incremented by two. The avoid-update-if-already-identical optimization 
doesn't help in this case, the origin id is necessary.


Now, let's take the inventory system example further. There are actually 
two ways to update a counter. One is when an item is checked in or out 
of the warehouse, ie. UPDATE counter = counter + 1. Those updates 
should accumulate. But another operation resets the counter to a 
specific value, UPDATE counter = 10, like when taking an inventory. 
That should not accumulate with other changes, but should be 
newest-update-wins. The origin id is not enough for that, because by 
looking at the WAL record and the origin id, you don't know which type 
of an update it was.


So, I don't like the idea of adding the origin id to the record header. 
It's only required in some occasions, and on some record types. And I'm 
worried it might not even be enough in more complicated scenarios.


Perhaps we need a more generic WAL record annotation system, where a 
plugin can tack arbitrary information to WAL records. The extra 
information could be stored in the WAL record after the rmgr payload, 
similar to how backup blocks are stored. WAL replay could just ignore 
the annotations, but a replication system could use it to store the 
origin id or whatever extra information it needs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 08:35, Robert Haas robertmh...@gmail.com wrote:

 I expect it would be fine to have a tool that pulls LCRs out of WAL to
 prepare that to be sent to remote locations.  Is that what you have in
 mind?

 Yes.  I think it should be possible to generate LCRs from WAL, but I
 think that the on-the-wire format for LCRs should be different from
 the WAL format.  Trying to use the same format for both things seems
 like an unpleasant straightjacket.

You're confusing things here.

Yes, we can use a different on-the-wire format. No problem.

As I've already said, the information needs to be in WAL first before
we can put it into LCRs, so the don't use same format argument is
not relevant as to why the info must be on the WAL record in the first
place. And as said elsewhere, doing that does not cause problems in
any of these areas:  wal bloat, performance, long term restrictions on
numbers of nodeids, as has so far been claimed on this thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 14:40, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 The reason we need an origin id in this scenario is that otherwise this will
 happen:

 1. A row is updated on node A
 2. Node B receives the WAL record from A, and updates the corresponding row
 in B. This generates a new WAL record.
 3. Node A receives the WAL record from B, and updates the rows again. This
 again generates a new WAL record, which is replicated to A, and you loop
 indefinitely.

 If each WAL record carries an origin id, node A can use it to refrain from
 applying the WAL record it receives from B, which breaks the loop.

 However, note that in this simple scenario, if the logical log replay /
 conflict resolution is smart enough to recognize that the row has already
 been updated, because the old and the new rows are identical, the loop is
 broken at step 3 even without the origin id. That works for the
 newest-update-wins and similar strategies. So the origin id is not
 absolutely necessary in this case.

Including the origin id in the WAL allows us to filter out WAL records
when we generate LCRs, so we can completely avoid step 3, including
all of the CPU, disk and network overhead that implies. Simply put, we
know the change came from A, so no need to send the change to A again.

If we do allow step 3 to exist, we still need to send the origin id.
This is because there is no apriori way to know the origin id is not
required, such as the case where we have concurrent updates which is
effectively a race condition between actions on separate nodes. Not
sending the origin id because it is not required in some cases is
equivalent to saying we can skip locking because a race condition does
not happen in all cases. Making a case that the race condition is rare
is still not a case for skipping locking. Same here: we need the
information to avoid making errors in the general case.

 Another interesting scenario is that you maintain a global counter, like in
 an inventory system, and conflicts are resolved by accumulating the updates.
 For example, if you do UPDATE SET counter = counter + 1 simultaneously on
 two nodes, the result is that the counter is incremented by two. The
 avoid-update-if-already-identical optimization doesn't help in this case,
 the origin id is necessary.

 Now, let's take the inventory system example further. There are actually two
 ways to update a counter. One is when an item is checked in or out of the
 warehouse, ie. UPDATE counter = counter + 1. Those updates should
 accumulate. But another operation resets the counter to a specific value,
 UPDATE counter = 10, like when taking an inventory. That should not
 accumulate with other changes, but should be newest-update-wins. The origin
 id is not enough for that, because by looking at the WAL record and the
 origin id, you don't know which type of an update it was.

Yes, of course. Conflict handling in the general case requires much
additional work. This thread is one minor change of many related
changes. The patches are being submitted in smaller chunks to ease
review, so sometimes there are cross links between things.

 So, I don't like the idea of adding the origin id to the record header. It's
 only required in some occasions, and on some record types.

That conclusion doesn't follow from your stated arguments.

 And I'm worried
 it might not even be enough in more complicated scenarios.

It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


 Perhaps we need a more generic WAL record annotation system, where a plugin
 can tack arbitrary information to WAL records. The extra information could
 be stored in the WAL record after the rmgr payload, similar to how backup
 blocks are stored. WAL replay could just ignore the annotations, but a
 replication system could use it to store the origin id or whatever extra
 information it needs.

Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.

One question I raised in my review was whether this extra information
should be added by a variable length header, so I already asked this
very question. So far there is no evidence that the additional code
complexity would be warranted. If it became so in the future, it can
be modified again. At this stage there is no need, so the proposal is
to add the field to every WAL record without regard to the setting of
wal_level because there is no measurable downside to doing so. The
downsides of additional complexity are clear and real however, so I
wish to avoid them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Heikki Linnakangas

On 01.06.2012 03:02, Jeff Janes wrote:

I've attached a new patch which addresses several of your concerns,
and adds the documentation.  The description is much longer than the
descriptions of other nearby options, which mostly just give a simple
statement of what they do rather than a description of why that is
useful.  I don't know if that means I'm starting a good trend, or a
bad one, or I'm just putting the exposition in the wrong place.

In addition to showing the benefits of coding things on the server
side when that is applicable, it also allows hackers to stress parts
of the server code that are not easy to stress otherwise.


As you mentioned in your original email over a year ago, most of this 
could be done as a custom script. It's nice to have another workload to 
test, but then again, there's an infinite number of workloads that might 
be interesting.


You can achieve the same with this custom script:

\set loops 512

do $$ DECLARE  sum integer default 0; amount integer; account_id 
integer; BEGIN FOR i IN 1..:loops LOOP   account_id=1 + floor(random() * 
:scale);   SELECT abalance into strict amount FROM pgbench_accounts 
WHERE aid = account_id;   sum := sum + amount; END LOOP; END; $$;


It's a bit awkward because it has to be all on one line, and you don't 
get the auto-detection of scale. But those are really the only 
differences AFAICS.


I think we would be better off improving pgbench to support those things 
in custom scripts. It would be nice to be able to write initialization 
steps that only run once in the beginning of the test. You could then 
put the SELECT COUNT(*) FROM pgbench_branches there, to do the scale 
auto-detection.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Simon Riggs
On 20 June 2012 15:32, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.06.2012 03:02, Jeff Janes wrote:

 I've attached a new patch which addresses several of your concerns,
 and adds the documentation.  The description is much longer than the
 descriptions of other nearby options, which mostly just give a simple
 statement of what they do rather than a description of why that is
 useful.  I don't know if that means I'm starting a good trend, or a
 bad one, or I'm just putting the exposition in the wrong place.

 In addition to showing the benefits of coding things on the server
 side when that is applicable, it also allows hackers to stress parts
 of the server code that are not easy to stress otherwise.


 As you mentioned in your original email over a year ago, most of this could
 be done as a custom script. It's nice to have another workload to test, but
 then again, there's an infinite number of workloads that might be
 interesting.

 You can achieve the same with this custom script:

 \set loops 512

 do $$ DECLARE  sum integer default 0; amount integer; account_id integer;
 BEGIN FOR i IN 1..:loops LOOP   account_id=1 + floor(random() * :scale);
 SELECT abalance into strict amount FROM pgbench_accounts WHERE aid =
 account_id;   sum := sum + amount; END LOOP; END; $$;

 It's a bit awkward because it has to be all on one line, and you don't get
 the auto-detection of scale. But those are really the only differences
 AFAICS.

 I think we would be better off improving pgbench to support those things in
 custom scripts. It would be nice to be able to write initialization steps
 that only run once in the beginning of the test. You could then put the
 SELECT COUNT(*) FROM pgbench_branches there, to do the scale
 auto-detection.

I'm sure Jeff submitted this because of the need for a standard test,
rather than the wish to actually modify pgbench itself.

Can I suggest that we include a list of standard scripts with pgbench
for this purpose? These can then be copied alongside the binary when
we do an install.

That way this patch can become a standard script, plus an entry in the
docs to list this.

We could even include scripts for the usual cases, to allow them to be
more easily viewed/copied/modified.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 10:32, Simon Riggs wrote:

On 20 June 2012 14:40, Heikki Linnakangas

And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution 
mechanisms, what extra information do you need for those, and where do 
you put it?



Perhaps we need a more generic WAL record annotation system, where a plugin
can tack arbitrary information to WAL records. The extra information could
be stored in the WAL record after the rmgr payload, similar to how backup
blocks are stored. WAL replay could just ignore the annotations, but a
replication system could use it to store the origin id or whatever extra
information it needs.


Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the 
table, then I'll have to just outright object to this patch. Let's 
implement what we can without the origin id, and revisit this later.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Too frequent message of pgbench -i?

2012-06-20 Thread Tatsuo Ishii
Currently pgbench -i prints following message every 10k tuples created.

fprintf(stderr, %d tuples done.\n, j);

I think it's long time ago when the frequency of message seemed to be
appropriate because computer is getting so fast these days and every
10k message seems to be too often for me. Can we change the frequency
from 10k to 100k? Or should we make the frequency be adjustable from a
command line? Or even better, the frequency should be changed
automatically according to scale factor?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 15:45, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 10:32, Simon Riggs wrote:

 On 20 June 2012 14:40, Heikki Linnakangas

 And I'm worried

 it might not even be enough in more complicated scenarios.


 It is not the only required conflict mechanism, and has never been
 claimed to be so. It is simply one piece of information needed, at
 various times.


 So, if the origin id is not sufficient for some conflict resolution
 mechanisms, what extra information do you need for those, and where do you
 put it?

As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
  UPDATE SET counter = counter + 1
by sending just the +1 part, not the current value of counter, as
would be the case with the non-reflexive update
  UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


 Perhaps we need a more generic WAL record annotation system, where a
 plugin
 can tack arbitrary information to WAL records. The extra information
 could
 be stored in the WAL record after the rmgr payload, similar to how backup
 blocks are stored. WAL replay could just ignore the annotations, but a
 replication system could use it to store the origin id or whatever extra
 information it needs.


 Additional information required by logical information will be handled
 by a new wal_level.

 The discussion here is about adding origin_node_id *only*, which needs
 to be added on each WAL record.


 If that's all we can discuss here is, and all other options are off the
 table, then I'll have to just outright object to this patch. Let's implement
 what we can without the origin id, and revisit this later.

As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.

We have the choice to add the required information as a static or as a
variable length addition to the WAL record header. Since there is no
additional requirement for expansion of the header at this point and
no measurable harm in doing so, I suggest we avoid the complexity and
lack of robustness that a variable length header would cause. Of
course, we can go there later if needed, but there is no current need.
If a variable length header had been suggested, it is certain somebody
would say that was overcooked and just do it as static.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 11:17, Simon Riggs wrote:

On 20 June 2012 15:45, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 20.06.2012 10:32, Simon Riggs wrote:


On 20 June 2012 14:40, Heikki Linnakangas


And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do you
put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the +1 part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


Yeah, I'd like to hear briefly how you would handle that without any 
further changes to the WAL record header.



Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the
table, then I'll have to just outright object to this patch. Let's implement
what we can without the origin id, and revisit this later.


As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.


It's only needed for multi-master replication, where the same table can 
be updated from multiple nodes. Just leave that out for now. There's 
plenty of functionality and issues left even without that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Release versioning inconsistency

2012-06-20 Thread Marti Raudsepp
Hi list,

The recent 9.2 beta releases have used a slightly different numbering
scheme than all previous releases.

It used to be that tarballs for version $VER were always available at:
  http://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.bz2

However, the new releases now use v9.2.0beta2 for the directory
name, but 9.2beta2 in the tarball file. No big deal for most people,
but it will confuse people who have scripts to download PostgreSQL
tarballs automatically (e.g. packagers).

For example:
  http://ftp.postgresql.org/pub/source/v9.2.0beta2/postgresql-9.2beta2.tar.bz2

Is there any reason behind this change?

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 It's only needed for multi-master replication, where the same table can be
 updated from multiple nodes. Just leave that out for now. There's plenty of
 functionality and issues left even without that.

Huh? Multi-master replication is what is being built here and many
people want that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 11:34, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


It's only needed for multi-master replication, where the same table can be
updated from multiple nodes. Just leave that out for now. There's plenty of
functionality and issues left even without that.


Huh? Multi-master replication is what is being built here and many
people want that.


Sure, but presumably you're going to implement master-slave first, and 
build multi-master on top of that. What I'm saying is that we can leave 
out the origin-id for now, since we don't have agreement on it, and 
revisit it after master-slave replication is working.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Too frequent message of pgbench -i?

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 11:04, Tatsuo Ishii wrote:

Currently pgbench -i prints following message every 10k tuples created.

fprintf(stderr, %d tuples done.\n, j);

I think it's long time ago when the frequency of message seemed to be
appropriate because computer is getting so fast these days and every
10k message seems to be too often for me. Can we change the frequency
from 10k to 100k? Or should we make the frequency be adjustable from a
command line? Or even better, the frequency should be changed
automatically according to scale factor?


Or print the message every n seconds, instead of every n lines. I think 
a new command-line option would be overkill. I would be fine with just 
bumping it to every 100k lines.


So +1 on doing something about it, not sure what.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne cbbro...@gmail.com 
wrote:
  On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas robertmh...@gmail.com 
wrote:
  Btw, what do you mean with conflating the stream? I don't really see
  that being proposed.
  
  It seems to me that you are intent on using the WAL stream as the
  logical change stream.  I think that's a bad design.  Instead, you
  should extract changes from WAL and then ship them around in a format
  that is specific to logical replication.
  
  Yeah, that seems worth elaborating on.
  
  What has been said several times is that it's pretty necessary to
  capture the logical changes into WAL.  That seems pretty needful, in
  order that the replication data gets fsync()ed avidly, and so that we
  don't add in the race condition of needing to fsync() something *else*
  almost exactly as avidly as is the case for WAL today..
 
 Check.
 
  But it's undesirable to pull *all* the bulk of contents of WAL around
  if it's only part of the data that is going to get applied.  On a
  physical streaming replica, any logical data that gets captured will
  be useless.  And on a logical replica, they physical bits of WAL
  will be useless.
  
  What I *want* you to mean is that there would be:
  a) WAL readers that pull the physical bits, and
  b) WAL readers that just pull logical bits.
  
  I expect it would be fine to have a tool that pulls LCRs out of WAL to
  prepare that to be sent to remote locations.  Is that what you have in
  mind?
 Yes.  I think it should be possible to generate LCRs from WAL, but I
 think that the on-the-wire format for LCRs should be different from
 the WAL format.  Trying to use the same format for both things seems
 like an unpleasant straightjacket.  This discussion illustrates why:
 we're talking about consuming scarce bit-space in WAL records for a
 feature that only a tiny minority of users will use, and it's still
 not really enough bit space.  That stinks.  If LCR transmission is a
 separate protocol, this problem can be engineered away at a higher
 level.
As I said before, I definitely agree that we want to have a separate transport 
format once we have decoding nailed down. We still need to ship wal around if 
the decoding happens in a different instance, but *after* that it can be 
shipped in something more convenient/appropriate.

 Suppose we have three servers, A, B, and C, that are doing
 multi-master replication in a loop.  A sends LCRs to B, B sends them
 to C, and C sends them back to A.  Obviously, we need to make sure
 that each server applies each set of changes just once, but it
 suffices to have enough information in WAL to distinguish between
 replication transactions and non-replication transactions - that is,
 one bit.  So suppose a change is made on server A.  A generates LCRs
 from WAL, and tags each LCR with node_id = A.  It then sends those
 LCRs to B.  B applies them, flagging the apply transaction in WAL as a
 replication transaction, AND ALSO sends the LCRs to C.  The LCR
 generator on B sees the WAL from apply, but because it's flagged as a
 replication transaction, it does not generate LCRs.  So C receives
 LCRs from B just once, without any need for the node_id to to be known
 in WAL.  C can now also apply those LCRs (again flagging the apply
 transaction as replication) and it can also skip sending them to A,
 because it seems that they originated at A.
One bit is fine if you have only very simple replication topologies. Once you 
think about globally distributed databases its a bit different. You describe 
some of that below, but just to reiterate: 
Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF 
in europe). You may only want to have full intercontinental interconnect 
between two of those (say A and D). If you only have one bit to represent the 
origin thats not going to work because you won't be able discern the changes 
from BC on A from the changes from those originating on DEF.

Another topology which is interesting is circular replications (i.e. changes 
get shipped A-B, B-C, C-A) which is a sensible topology if you only have a 
low change rate and a relatively high number of nodes because you don't need 
the full combinatorial amount of connections.

You still have the origin_id's be meaningful in the local context though. As 
described before, in the communication between the different nodes you can 
simply replace 16bit node id with some fancy UUID or such. And do the reverse 
when replaying LCRs.

 Now suppose we have a more complex topology.  Suppose we have a
 cluster of four servers A .. D which, for improved tolerance against
 network outages, are all connected pairwise.  Normally all the links
 are up, so each server sends all the LCRs it generates directly to all
 other servers.  But how do we prevent cycles?  A generates a change
 and sends it to B, C, and D.  B then sees that the change 

Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 10:28 AM, Marti Raudsepp ma...@juffo.org wrote:
 Hi list,

 The recent 9.2 beta releases have used a slightly different numbering
 scheme than all previous releases.

 It used to be that tarballs for version $VER were always available at:
  http://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.bz2

 However, the new releases now use v9.2.0beta2 for the directory
 name, but 9.2beta2 in the tarball file. No big deal for most people,
 but it will confuse people who have scripts to download PostgreSQL
 tarballs automatically (e.g. packagers).

 For example:
  http://ftp.postgresql.org/pub/source/v9.2.0beta2/postgresql-9.2beta2.tar.bz2

 Is there any reason behind this change?

Not behind the first one. I believe that's just me and a bad memory -
I got it wrong. (I do believe that using the v9.2.0beta marker is
*better*, because then it sorts properly. But likely not enough much
better to be inconsistent with previous versions)

For beta2, the only reason was to keep it consistent with beta1.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Marti Raudsepp
On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net wrote:
 (I do believe that using the v9.2.0beta marker is
 *better*, because then it sorts properly. But likely not enough much
 better to be inconsistent with previous versions)

Good point. Maybe that's a reason to change the versioning scheme and
stick with 9.2.0betaX everywhere. Including calling the final
release 9.2.0 instead of simply 9.2?

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 16:44, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 11:34, Simon Riggs wrote:

 On 20 June 2012 16:23, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 It's only needed for multi-master replication, where the same table can
 be
 updated from multiple nodes. Just leave that out for now. There's plenty
 of
 functionality and issues left even without that.


 Huh? Multi-master replication is what is being built here and many
 people want that.


 Sure, but presumably you're going to implement master-slave first, and build
 multi-master on top of that. What I'm saying is that we can leave out the
 origin-id for now, since we don't have agreement on it, and revisit it after
 master-slave replication is working.

I am comfortable with the idea of deferring applying the patch, but I
don't see any need to defer agreeing the patch is OK, so it can be
applied easily later. It does beg the question of when exactly we
would defer it to though. When would that be?

If you have a reason for disagreement, please raise it now, having
seen explanations/comments on various concerns. Of course, people have
made initial objections, which is fine but its not reasonable to
assume that such complaints continue to exist after. Perhaps there are
other thoughts?

The idea that logical rep is some kind of useful end goal in itself is
slightly misleading. If the thought is to block multi-master
completely on that basis, that would be a shame. Logical rep is the
mechanism for implementing multi-master.

Deferring this could easily end up with a huge patch in last CF, and
then it will be rejected/deferred. Patch submission here is following
the requested process - as early as possible, production ready, small
meaningful patches that build towards a final goal. This is especially
true for format changes, which is why this patch is here now. Doing it
differently just makes patch wrangling and review more difficult,
which reduces overall quality and slows down development.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Eisentraut
On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote:
 So if you take the word Aßlar here - that is equivalent to Asslar,
 and so strcoll(Aßlar, Asslar) will return 0 if you have the right
 LC_COLLATE

This is not actually correct.  glibc will sort Asslar before Aßlar, and
that is correct in my mind.

When a Wikipedia page on some particular language's alphabet says
something like $letterA and $letterB are equivalent, what it really
means is that they are sorted the same compared to other letters, but
are distinct when ties are broken.

  (if you tried this out for yourself and found that I was
 actually lying through my teeth, pretend I said Hungarian instead of
 German and some really obscure character rather than ß).

Yeah, there are obviously exceptions, which led to the original change
being made, but they are not as wide-spread as they appear to be.

The real issue in this area, I suspect, will be dealing with Unicode
combining sequences versus equivalent precombined characters.  But
support for that is generally crappy, so it's not urgent to deal with
it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-20 Thread Andres Freund
Hi Robert, Hi All!

On Wednesday, June 20, 2012 03:08:48 AM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  Well, the words are fuzzy, but I would define logical replication to
  be something which is independent of the binary format in which stuff
  gets stored on disk.  If it's not independent of the disk format, then
  you can't do heterogenous replication (between versions, or between
  products).  That precise limitation is the main thing that drives
  people to use anything other than SR in the first place, IME.
  
  Not in mine. The main limitation I see is that you cannot write anything
  on the standby. Which sucks majorly for many things. Its pretty much
  impossible to fix that for SR outside of very limited cases.
  While many scenarios don't need multimaster *many* need to write outside
  of the standby's replication set.
 Well, that's certainly a common problem, even if it's not IME the most
 common, but I don't think we need to argue about which one is more
 common, because I'm not arguing against it.  The point, though, is
 that if the logical format is independent of the on-disk format, the
 things we can do are a strict superset of the things we can do if it
 isn't.  I don't want to insist that catalogs be the same (or else you
 get garbage when you decode tuples).  I want to tolerate the fact that
 they may very well be different.  That will in no way preclude writing
 outside the standby's replication set, nor will it prevent
 multi-master replication.  It will, however, enable heterogenous
 replication, which is a very important use case.  It will also mean
 that innocent mistakes (like somehow ending up with a column that is
 text on one server and numeric on another server) produce
 comprehensible error messages, rather than garbage.
I agree with most of that. I think that some parts of the above need to be 
optional because you do loose too much for other scenarious.
I *definitely* want to build the *infrastructure* which make it easy to 
implement all of the above but I find it a bit much to require that from the 
get-go. Its important that everything is reusable for that, yes. Does a 
patchset that wants to implement tightly coupled multimaster need to implement 
everything for that? No.
If we raise the barrier for anything around this topic so high we will *NEVER* 
get anywhere. Its a huge topic with loads of people wanting loads of different 
things. And that will hurt people wanting some feature which matches 90% of 
the proposed goals *far* more.

  Its not only the logging side which is a limitation in todays replication
  scenarios. The apply side scales even worse because its *very* hard to
  distribute it between multiple backends.

 I don't think that making LCR format = on-disk format is going to
 solve that problem.  To solve that problem, we need to track
 dependencies between transactions, so that if tuple A is modified by
 T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
 committed after both T1 and T2 - touches none of the same data as T1
 or T2 - then we can apply it in parallel, so long as we don't commit
 until T1 and T2 have committed (because allowing T3 to commit early
 would produce a serialization anomaly from the point of view of a
 concurrent reader).
Well, doing apply on such low level, without reencoding the data increased 
throughput nearly threefold even for trivial types. So it pushes of the point 
where we need to do the above quite a bit.

  Because the routines that decode tuples don't include enough sanity
  checks to prevent running off the end of the block, or even the end of
  memory completely.  Consider a corrupt TOAST pointer that indicates
  that there is a gigabyte of data stored in an 8kB block.  One of the
  common symptoms of corruption IME is TOAST requests for -3 bytes of
  memory.
  Yes, but we need to put safeguards against that sort of thing anyway. So
  sure, we can have bugs but this is not a fundamental limitation.
 There's a reason we haven't done that already, though: it's probably
 going to stink for performance.  If it turns out that it doesn't stink
 for performance, great.  But if it causes a 5% slowdown on common use
 cases, I suspect we're not gonna do it, and I bet I can construct a
 case where it's worse than that (think: 400 column table with lots of
 varlenas, sorting by column 400 to return column 399).  I think it's
 treading on dangerous ground to assume we're going to be able to just
 go fix this.
I am talking about ensuring that the catalog is the same on the decoding site 
not about making all decoding totally safe in the face of corrupted 
information.

  Postgis uses one information table in a few more complex functions but
  not in anything low-level. Evidenced by the fact that it was totally
  normal for that to go out of sync before  2.0.
  
  But even if such a thing would be needed, it wouldn't be problematic to
  make extension 

Re: [HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-20 Thread Jeff Janes
On Tue, Jun 19, 2012 at 8:06 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 But in the 9.2 branch, the slow phenotype was re-introduced in
 1575fbcb795fc331f4, although perhaps the details of who is locking
 what differs.  I haven't yet sorted that out.

 It very much does.  That commit prevents people from creating a
 relation in - or renaming a relation into - a schema that is being
 concurrently dropped, which in previous releases would have resulted
 in inconsistent catalog contents.  I admit that it harms your test
 case, but how likely is it that someone is going to put every single
 table into its own schema?

Yep, I see that even having 10 tables per scheme (which I think was
the case for the person who started this line of inquiry) rather than
just 1 greatly reduces this regression.  It basically goes from about
2x slower to about 1.1x slower.  So I think that pretty much settles
the issue for 9.2.


 And have shared_buffers low enough for
 this to be the dominant cost?

That one is not so unlikely.  After all, lowering shared_buffers is a
defensive move to get around the FlushRelationBuffers problem, so
hopefully people faced with the task of restoring such a dump would
take advantage of it, if they have a maintenance window in which to do
so.  Also, the FlushRelationBuffers issue is linear while the locks
are quadratic, so eventually the locking would regain dominance even
if shared_buffers is large.

 I think in real-world scenarios this
 isn't going to be a problem - although, of course, making the lock
 manager faster would be nifty if we can do it, and this might be a
 good test case.

The resource owner reassign lock patch does effectively solve the lock
manager problem of restoring dumps of such databases, as well as the
similar problem in creating such dumps.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables

2012-06-20 Thread Etsuro Fujita
Hi KaiGai-san,

Thank you for the review.

 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
 Sent: Wednesday, June 20, 2012 1:26 AM
 To: Etsuro Fujita
 Cc: Robert Haas; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
 foreign tables

 Hi Fujita-san,

 Could you rebase this patch towards the latest tree?
 It was unavailable to apply the patch cleanly.

Sorry, I updated the patch.  Please find attached an updated version of the
patch.

 I looked over the patch, then noticed a few points.

 At ProcessCopyOptions(), defel-arg can be NIL, isn't it?
 If so, cstate-convert_binary is not a suitable flag to check redundant
 option. It seems to me cstate-convert_selectively is more suitable flag
 to check it.

 +   else if (strcmp(defel-defname, convert_binary) == 0)
 +   {
 +   if (cstate-convert_binary)
 +   ereport(ERROR,
 +   (errcode(ERRCODE_SYNTAX_ERROR),
 +errmsg(conflicting or redundant options)));
 +   cstate-convert_selectively = true;
 +   if (defel-arg == NULL || IsA(defel-arg, List))
 +   cstate-convert_binary = (List *) defel-arg;
 +   else
 +   ereport(ERROR,
 +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 +errmsg(argument to option \%s\ must be a
 list of column names,
 +   defel-defname)));
 +   }

Yes, defel-arg can be NIL.  defel-arg is a List structure listing all the
columns needed to be converted to binary representation, which is NIL for
the case where no columns are needed to be converted.  For example,
defel-arg is NIL for SELECT COUNT(*).  In this case, while
cstate-convert_selectively is set to true, no columns are converted at
NextCopyFrom().  Most efficient case!  In short, cstate-convert_selectively
represents whether to do selective binary conversion at NextCopyFrom(), and
cstate-convert_binary represents all the columns to be converted at
NextCopyFrom(), that can be NIL.

 At NextCopyFrom(), this routine computes default values if configured.
 In case when these values are not referenced, it might be possible to skip
 unnecessary calculations.
 Is it unavailable to add logic to avoid to construct cstate-defmap on
 unreferenced columns at  BeginCopyFrom()?

I think that we don't need to add the above logic because file_fdw does
BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom()
doesn't construct cstate-defmap at all.

I fixed a bug plus some minor optimization in check_binary_conversion() that
is renamed to check_selective_binary_conversion() in the updated version,
and now file_fdw gives up selective binary conversion for the following
cases:

  a) BINARY format
  b) CSV/TEXT format and whole row reference
  c) CSV/TEXT format and all the user attributes needed


Best regards,
Etsuro Fujita

 Thanks,

 2012/5/11 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
  -Original Message-
  From: Robert Haas [mailto:robertmh...@gmail.com]
  Sent: Friday, May 11, 2012 1:36 AM
  To: Etsuro Fujita
  Cc: pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV
  file foreign tables
 
  On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita
  fujita.ets...@lab.ntt.co.jp
  wrote:
   I would like to propose to improve parsing efficiency of
   contrib/file_fdw by selective parsing proposed by Alagiannis et
   al.[1], which means that for a CSV/TEXT file foreign table,
   file_fdw performs binary conversion only for the columns needed for
   query processing.  Attached is a WIP patch implementing the feature.
 
  Can you add this to the next CommitFest?  Looks interesting.
 
  Done.
 
  Best regards,
  Etsuro Fujita
 
  --
  Robert Haas
  EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
  Company
 
 
 
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
  make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers



 --
 KaiGai Kohei kai...@kaigai.gr.jp

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
 changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




begin 666 file_fdw_sel_bin_conv_v2.patch
M9EF9B M+6=I=!A+V-O;G1R:6(O9FEL95]F9'O9FEL95]F9'N8R!B+V-O
M;G1R:6(O9FEL95]F9'O9FEL95]F9'N8PII;F1E!E,V(Y,C(S+BYF9#DT
M,3)F(#$P,#8T- HM+2T@82]C;VYTFEB+V9I;5?9F1W+V9I;5?9F1W+F,*
M*RLK((O8V]N=')I8B]F:6QE7V9D=R]F:6QE7V9D=RYCD! (TQ-BPV(LQ
M-BPW($! B C:6YC;'5D92 \=6YIW1D+F@^B *(-I;F-L=61E()A8V-E
MW,OF5L;W!T:6]NRYH(@HK(VEN8VQU94@(F%C8V5SR]S7-A='1R+F@B
MB C:6YC;'5D92 B8V%T86QO9R]P9U]F;W)E:6=N7W1A8FQE+F@BB C:6YC
M;'5D92 B8V]M;6%N9',O8V]P2YH(@H@(VEN8VQU94@(F-O;6UA;F1S+V1E
M9G)E;2YH(@I 0 M,CDL-B K,S L-R! 0 H@(VEN8VQU94@(F]P=EM:7IE
MB]P871H;F]D92YH(@H@(VEN8VQU94@(F]P=EM:7IEB]P;%N;6%I;BYH

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 11:00, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote:
 So if you take the word Aßlar here - that is equivalent to Asslar,
 and so strcoll(Aßlar, Asslar) will return 0 if you have the right
 LC_COLLATE

 This is not actually correct.  glibc will sort Asslar before Aßlar, and
 that is correct in my mind.

Uh, what happened here was that I assumed that it was correct, and
then went to verify it and found that it wasn't before sending the
mail, and couldn't immediately find any hard data about what
characters this did apply to, I decided to turn it into a joke. I say
this, and yet you've included that bit of the e-mail inline in your
reply, so maybe it just wasn't a very good joke.

 When a Wikipedia page on some particular language's alphabet says
 something like $letterA and $letterB are equivalent, what it really
 means is that they are sorted the same compared to other letters, but
 are distinct when ties are broken.

I know.

  (if you tried this out for yourself and found that I was
 actually lying through my teeth, pretend I said Hungarian instead of
 German and some really obscure character rather than ß).

 Yeah, there are obviously exceptions, which led to the original change
 being made, but they are not as wide-spread as they appear to be.

True.

 The real issue in this area, I suspect, will be dealing with Unicode
 combining sequences versus equivalent precombined characters.  But
 support for that is generally crappy, so it's not urgent to deal with
 it.

I agree that it isn't urgent. However, I have an ulterior motive,
which is that in allowing for this, we remove the need to strcmp()
after each strcoll(), and consequently it becomes possible to use
strxfrm() instead. Now, we could also use a hack that's going to make
the strxfrm() blobs even bulkier still (basically, concatenate the
original text to the blob before strcmp()), but I don't want to go
there if it can possibly be avoided.

I should also point out that we pride ourselves on following the
letter of the standard when that makes sense, and we are currently not
doing that in respect of the Unicode standard.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun19, 2012, at 17:36 , Robert Haas wrote:
 On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
 klep...@svana.org wrote:
 On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
 The fly in the ointment with any of these ideas is that the configure
 list is not a list of exact cipher names, as per Magnus' comment that
 the current default includes tests like !aNULL.  I am not sure that
 we know how to evaluate such conditions if we are applying an
 after-the-fact check on the selected cipher.  Does OpenSSL expose any
 API for evaluating whether a selected cipher meets such a test?
 
 I'm not sure whether there's an API for it, but you can certainly check
 manually with openssl ciphers -v, for example:
 
 $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
 NULL-SHASSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=SHA1
 NULL-MD5SSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=MD5
 
 ...etc...
 
 So unless the openssl includes the code twice there must be a way to
 extract the list from the library.
 
 There doubtless is, but I'd being willing to wager that you won't be
 able to figure out the exact method without reading the source code
 for 'opennssl ciphers' to see how it was done there, and most likely
 you'll find that at least one of the functions they use has no man
 page.  Documentation isn't their strong point.

Yes, unfortunately.

I wonder though if shouldn't restrict the allowed ciphers list to being
a simple list of supported ciphers. If our goal is to support multiple
SSL libraries transparently then surely having openssl-specific syntax
in the config file isn't exactly great anyway...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 05:01:16 AM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund and...@2ndquadrant.com 
wrote:
   1. dllist.h has double the element overhead by having an inline value
   pointer (which is not needed when embedding) and a pointer to the list
   (which I have a hard time seing as being useful)
   2. only double linked list, mine provided single and double linked
   ones 3. missing macros to use when embedded in a larger struct
   (containerof() wrappers and for(...) support basically)
   4. most things are external function calls...
   5. way much more branches/complexity in most of the functions. My
   implementation doesn't use any branches for the typical easy
   modifications (push, pop, remove element somewhere) and only one for
   the typical tests (empty, has-next, ...)
   
   The performance and memory aspects were crucial for the aforementioned
   toy project (slab allocator for postgres). Its not that crucial for
   the applycache where the lists currently are mostly used although its
   also relatively performance sensitive and obviously does a lot of
   list manipulation/iteration.
   
   If I had to decide I would add the missing api in dllist.h to my
   implementation and then remove it. Its barely used - and only in an
   embedded fashion - as far as I can see.
   I can understand though if that argument is met with doubt by others
   ;). If thats the way it has to go I would add some more convenience
   support for embedding data to dllist.h and settle for that.
  
  I think it might be simpler to leave the name as Dllist and just
  overhaul the implementation along the lines you suggest, rather than
  replacing it with something completely different.  Mostly, I don't
  want to add a third thing if we can avoid it, given that Dllist as it
  exists today is used only lightly.
  
  Well, if its the name, I have no problem with changing it, but I don't
  see how you can keep the api as it currently is and address my points.
  
  If there is some buyin I can try to go either way (keeping the existing
  name, changing the api, adjusting the callers or just adjust the
  callers, throw away the old implementation) I just don't want to get
  into that just to see somebody isn't agreeing with the fundamental idea.
 My guess is that it wouldn't be too hard to remove some of the extra
 pointers.  Anyone who is using Dllist as a non-inline list could be
 converted to List * instead. 
There are only three users of the whole dllist.h. Catcache, autovacuum and 
postmaster. The latter two just keep a list of databases around. So any change 
will only be moderatively intrusive.

 Also, the performance-critical things
 could be reimplemented as macros. 

 I question, though, whether we really need both single and doubly linked
 lists.  That seems like it's almost certainly micro-optimization that we are
 better off not doing.
It was certainly worthwile for the memory manager (lower per allocation 
overhead). You might be right that its not worth it for many other possible 
usecases in pg. Its not much code though.

*looks around*

A quick grep found:

single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h (ok, 
irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c

I stopped at that point because while surely not of all of the above usecases 
could be replaced by a common implementation several could from a quick look. 
Also, several pg_list.h users could benefit from a conversion. So I think 
adding a single linked list implementation is worthwile.

  The most contentious point is probably relying on USE_INLINE being
  available anywhere. Which I believe to be the point now that we have
  gotten rid of some platforms.
 I would be hesitant to chuck that even though I realize it's unlikely
 that we really need !USE_INLINE.  But see sortsupport for an example
 of how we've handled this in the recent past.
I agree its possible to resolve this. But hell, do we really need to add all 
that ugliness in 2012? I don't think its worth the effort of support ancient 
compilers that don't support inline anymore. If we could stop trying to 
support catering for probably non-existing compilers we could remove some 
*very* ugly long macros for example (e.g. in htup.h).

If support for !USE_INLINE is required I would prefer to have an header define 
the functions like

#ifdef USE_INLINE
#define OPTIONALLY_INLINE static inline
#define USE_LINKED_LIST_IMPL
#endif

#ifdef USE_LINKED_LIST_IMPL

OPTIONALLY_INLINE void myFuncCall(){
...
}
#endif

which then gets included with #define USE_LINKED_LIST_IMPL by some c file 
defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
Its too much code to duplicate imo.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Ability to listen on two unix sockets

2012-06-20 Thread Honza Horak

On 06/15/2012 05:40 PM, Honza Horak wrote:

I realized the patch has some difficulties -- namely the socket path in the 
data dir lock file, which currently uses one port for socket and the same for 
interface. So to allow users to use arbitrary port for all unix sockets, we'd 
need to add another line only for unix socket, which doesn't apply for other 
platforms. Or we could just say that the first socket will allways use the 
default port (PostPortNumber), which is a solution I prefer currently, but will 
be glad for any other opinion. This is also why there is still un-necesary 
string splitting in pg_ctl.c, which will be removed after the issue above is 
solved.



This is an enhanced patch, which forbids using a port number in the 
first socket directory entry.


Honza
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index cfdb33a..679c40a 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -838,7 +838,7 @@ omicron bryanh  guest1
 varnameunix_socket_permissions/varname (and possibly
 varnameunix_socket_group/varname) configuration parameters as
 described in xref linkend=runtime-config-connection.  Or you
-could set the varnameunix_socket_directory/varname
+could set the varnameunix_socket_directories/varname
 configuration parameter to place the socket file in a suitably
 restricted directory.
/para
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 074afee..d1727f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -453,17 +453,26 @@ SET ENABLE_SEQSCAN TO OFF;
   /listitem
  /varlistentry
 
- varlistentry id=guc-unix-socket-directory xreflabel=unix_socket_directory
-  termvarnameunix_socket_directory/varname (typestring/type)/term
+ varlistentry id=guc-unix-socket-directories xreflabel=unix_socket_directories
+  termvarnameunix_socket_directories/varname (typestring/type)/term
   indexterm
-   primaryvarnameunix_socket_directory/ configuration parameter/primary
+   primaryvarnameunix_socket_directories/ configuration parameter/primary
   /indexterm
   listitem
para
-Specifies the directory of the Unix-domain socket on which the
+Specifies the directories of the Unix-domain sockets on which the
 server is to listen for
 connections from client applications.  The default is normally
 filename/tmp/filename, but can be changed at build time.
+Directories are separated by ',' and additional replaceableport/ 
+number can be set, separated from directory by ':'. Port number will 
+only be used as a part of the socket file name. For example,
+literal'/var/run, /tmp:5431'/literal would create socket files
+literal/var/run/.s.PGSQL.5432/literal and
+literal/tmp/.s.PGSQL.5431/literal. 
+It is not possible to define replaceableport/ number in the first
+entry. If set, it will be ignored and default varnameport/
+will be used. 
 This parameter can only be set at server start.
/para
 
@@ -472,7 +481,7 @@ SET ENABLE_SEQSCAN TO OFF;
 literal.s.PGSQL.replaceable//literal where
 replaceable/ is the server's port number, an ordinary file
 named literal.s.PGSQL.replaceable/.lock/literal will be
-created in the varnameunix_socket_directory/ directory.  Neither
+created in the varnameunix_socket_directories/ directories.  Neither
 file should ever be removed manually.
/para
 
@@ -6593,7 +6602,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
/row
row
 entryoption-k replaceablex/replaceable/option/entry
-entryliteralunix_socket_directory = replaceablex/replaceable//entry
+entryliteralunix_socket_directories = replaceablex/replaceable//entry
/row
row
 entryoption-l/option/entry
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 7ba18f0..6c74844 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1784,7 +1784,7 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433
   para
The simplest way to prevent spoofing for literallocal/
connections is to use a Unix domain socket directory (xref
-   linkend=guc-unix-socket-directory) that has write permission only
+   linkend=guc-unix-socket-directories) that has write permission only
for a trusted local user.  This prevents a malicious user from creating
their own socket file in that directory.  If you are concerned that
some applications might still reference filename/tmp/ for the
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index e3ae92d..50d4167 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include postmaster/bgwriter.h
 #include 

Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance

2012-06-20 Thread Magnus Hagander
On Mon, Jun 11, 2012 at 6:06 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jun 12, 2012 at 12:47 AM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Jun 11, 2012 at 5:37 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jun 11, 2012 at 3:24 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 On Sun, Jun 10, 2012 at 6:08 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Jun 10, 2012 at 11:45 PM, Magnus Hagander mag...@hagander.net 
 wrote:
 On Sun, Jun 10, 2012 at 4:29 PM, Fujii Masao masao.fu...@gmail.com 
 wrote:
 On Sun, Jun 10, 2012 at 11:10 PM, Magnus Hagander mag...@hagander.net 
 wrote:
 On Sun, Jun 10, 2012 at 4:08 PM, Fujii Masao masao.fu...@gmail.com 
 wrote:
 On Sun, Jun 10, 2012 at 10:34 PM, Fujii Masao masao.fu...@gmail.com 
 wrote:
 On Sun, Jun 10, 2012 at 9:25 PM, Fujii Masao masao.fu...@gmail.com 
 wrote:
 On Sun, Jun 10, 2012 at 7:43 PM, Magnus Hagander 
 mag...@hagander.net wrote:
 On Sat, Jun 9, 2012 at 2:51 PM, Tom Lane t...@sss.pgh.pa.us 
 wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 This seems a bug. I think we should prevent pg_basebackup from
 becoming synchronous standby. Thought?

 Absolutely.  If we have replication clients that are not actually
 capable of being standbys, there *must* be a way for the master
 to know that.

 I thought we fixed this already by sending InvalidXlogRecPtr as 
 flush
 location? And that this only applied in 9.2?

 Are you saying we picked pg_basebackup *in backup mode* (not log
 streaming) as synchronous standby?

 Yes.

 If so then yes, that is
 *definitely* a bug that should be fixed. We should never select a
 connection that's not even streaming log as standby!

 Agreed. Attached patch prevents pg_basebackup from becoming sync
 standby. Also this patch fixes another problem: currently only 
 walsender
 which reaches STREAMING state can become sync walsender. OTOH,
 sync walsender thinks that walsender with higher priority will be 
 sync one
 whether its state is STREAMING, and switches to potential sync 
 walsender.
 So when the standby with higher priority connects to the master, we
 might have no sync standby until it reaches the STREAMING state.
 To fix this problem, the patch switches walsender's state from sync 
 to
 potential *after* walsender with higher priority has reached the
 STREAMING state.

 We also should not select (1) background stream process forked from
 pg_basebackup and (2) pg_receivexlog as sync standby because they
 don't send back replication progress. To address this, I'm thinking 
 to
 introduce new option NOSYNC in START_REPLICATION command
 as follows, and to change (1) and (2) so that they specify NOSYNC.

    START_REPLICATION XXX/XXX [NOSYNC]

 If the standby specifies NOSYNC option, it's never assigned as sync
 standby even if its name is in synchronous_standby_names. Thought?

 The standby which always sends InvalidXLogRecPtr back should not
 become sync one. So instead of NOSYNC option, by checking whether
 InvalidXLogRecPtr is sent, we can avoid problematic sync standby.

 We should not do this because Magnus is proposing the patch
 (http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php)
 which breaks the above assumption at all. So we should introduce
 something like NOSYNC option.

 Wouldn't the better choice there in that case be to give a switch to
 pg_receivexlog if you *want* it to be able to become a sync replica,
 and by default disallow it? And then keep the backend just treating
 InvalidXlogRecPtr as don't-become-sync-replica.

 I don't object to making pg_receivexlog as sync standby at all. So at 
 least
 for me, that switch is not necessary. What I'm worried about is the
 background stream process forked from pg_basebackup. I think that
 it should not run as sync standby but sending back its replication 
 progress
 seems helpful because a user can see the progress from 
 pg_stat_replication.
 So I'm thinking that something like NOSYNC option is required.

 On principle, no. By default, yes.

 How about:
 pg_basebackup background: *never* sends flush location, and therefor
 won't become sync replica
 pg_receivexlog *optionally* sends flush location. by defualt own't
 become sync replica, but can be made so with a switch

 Wouldn't a user who sees NULL in flush_location from pg_stat_replication
 misunderstand that pg_receivexlog (in default mode) and pg_basebackup
 background don't flush WAL files at all?

 That sounds like a documentable issue.

 But maybe you're right, and we need the never become sync as a flag.

 You agreed to add something like NOSYNC option into START_REPLICATION 
 command?

 I'm on the fence. I was hoping somebody else would chime in with an
 opinion as well.

 +1

Nobody else with any opinion on this? :(

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] WAL format changes

2012-06-20 Thread Magnus Hagander
On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
 a uint64, on top of my other WAL format patches. I think we should go ahead
 with this.

 +1.

 The LSNs on pages are still stored in the old format, to avoid changing the
 on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
 file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
 is required.

 Seems fine.

 Should we keep the old representation in the replication protocol messages?
 That would make it simpler to write a client that works with different
 server versions (like pg_receivexlog). Or, while we're at it, perhaps we
 should mandate network-byte order for all the integer and XLogRecPtr fields
 in the replication protocol. That would make it easier to write a client
 that works across different architectures, in = 9.3. The contents of the
 WAL would of course be architecture-dependent, but it would be nice if
 pg_receivexlog and similar tools could nevertheless be
 architecture-independent.

 I share Andres' question about how we're doing this already.  I think
 if we're going to break this, I'd rather do it in 9.3 than 5 years
 from now.  At this point it's just a minor annoyance, but it'll
 probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net wrote:
 (I do believe that using the v9.2.0beta marker is
 *better*, because then it sorts properly. But likely not enough much
 better to be inconsistent with previous versions)

 Good point. Maybe that's a reason to change the versioning scheme and
 stick with 9.2.0betaX everywhere. Including calling the final
 release 9.2.0 instead of simply 9.2?

That might actually be a good idea. We can't really change the way we
named the betas, but it's not too late to consider naming the actual
release as 9.2.0...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Dickson S. Guedes
2012/6/20 Magnus Hagander mag...@hagander.net:
 On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net 
 wrote:
 (I do believe that using the v9.2.0beta marker is
 *better*, because then it sorts properly. But likely not enough much
 better to be inconsistent with previous versions)

 Good point. Maybe that's a reason to change the versioning scheme and
 stick with 9.2.0betaX everywhere. Including calling the final
 release 9.2.0 instead of simply 9.2?

 That might actually be a good idea. We can't really change the way we
 named the betas, but it's not too late to consider naming the actual
 release as 9.2.0...


May be a symlink could be created just do fit the same pattern that other
versions do and keeps the actual links (for beta) working.

I'm using the same pattern in `pgvm` [1]  and it is failing to fetch
beta versions :(


[1] https://github.com/guedes/pgvm/blob/master/include/sites


regards
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance

2012-06-20 Thread Simon Riggs
On 11 June 2012 23:47, Magnus Hagander mag...@hagander.net wrote

 You agreed to add something like NOSYNC option into START_REPLICATION 
 command?

 I'm on the fence. I was hoping somebody else would chime in with an
 opinion as well.

Why would you add it to synchronous_standby_names and then explicitly ignore it?

I don't see why you'd want this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Peter Eisentraut
On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote:
 On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp ma...@juffo.org wrote:
  On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net 
  wrote:
  (I do believe that using the v9.2.0beta marker is
  *better*, because then it sorts properly. But likely not enough much
  better to be inconsistent with previous versions)
 
  Good point. Maybe that's a reason to change the versioning scheme and
  stick with 9.2.0betaX everywhere. Including calling the final
  release 9.2.0 instead of simply 9.2?
 
 That might actually be a good idea. We can't really change the way we
 named the betas, but it's not too late to consider naming the actual
 release as 9.2.0...

The final release was always going to be called 9.2.0, but naming the
beta 9.2.0betaX is wrong.  There was a previous discussion about that
particular point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] reviving AC_PROG_INSTALL

2012-06-20 Thread Peter Eisentraut
As I had written in [0], using /usr/bin/install instead of install-sh
can significantly speed up the time to run make install, and hence make
check.  Using /usr/bin/install is standard in all autotools-using
projects.  PostgreSQL has removed the use of /usr/bin/install because of
the incident described in [1], which was basically configure picking up
some random install program in the path, which didn't do what we
wanted.  The Autoconf test has been much improved in the meantime, it
checks whether calling the program actually installs a file now, so it's
unlikely that another false positive would be picked up.

Therefore I propose to put back the AC_PROG_INSTALL configure check and
associated bits that were taken out back then.  See attached patch.


[0]: 
http://petereisentraut.blogspot.fi/2012/03/postgresql-make-install-times.html
[1]: http://archives.postgresql.org/pgsql-hackers/2001-03/msg00312.php
diff --git a/configure b/configure
index 6f8ebd8..fd29770 100755
--- a/configure
+++ b/configure
@@ -693,6 +693,9 @@ MKDIR_P
 AWK
 LN_S
 TAR
+INSTALL_DATA
+INSTALL_SCRIPT
+INSTALL_PROGRAM
 WINDRES
 DLLWRAP
 DLLTOOL
@@ -6855,6 +6858,106 @@ fi
 
 fi
 
+# Find a good install program.  We prefer a C program (faster),
+# so one script is as good as another.  But avoid the broken or
+# incompatible versions:
+# SysV /etc/install, /usr/sbin/install
+# SunOS /usr/etc/install
+# IRIX /sbin/install
+# AIX /bin/install
+# AmigaOS /C/install, which installs bootblocks on floppy discs
+# AIX 4 /usr/bin/installbsd, which doesn't work without a -g flag
+# AFS /usr/afsws/bin/install, which mishandles nonexistent args
+# SVR4 /usr/ucb/install, which tries to use the nonexistent group staff
+# OS/2's system install, which has a completely different semantic
+# ./install, which can be erroneously created by make from ./install.sh.
+# Reject install programs that cannot install multiple files.
+{ $as_echo $as_me:$LINENO: checking for a BSD-compatible install 5
+$as_echo_n checking for a BSD-compatible install...  6; }
+if test -z $INSTALL; then
+if test ${ac_cv_path_install+set} = set; then
+  $as_echo_n (cached)  6
+else
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z $as_dir  as_dir=.
+  # Account for people who put trailing slashes in PATH elements.
+case $as_dir/ in
+  ./ | .// | /cC/* | \
+  /etc/* | /usr/sbin/* | /usr/etc/* | /sbin/* | /usr/afsws/bin/* | \
+  ?:\\/os2\\/install\\/* | ?:\\/OS2\\/INSTALL\\/* | \
+  /usr/ucb/* ) ;;
+  *)
+# OSF1 and SCO ODT 3.0 have their own names for install.
+# Don't use installbsd from OSF since it installs stuff as root
+# by default.
+for ac_prog in ginstall scoinst install; do
+  for ac_exec_ext in '' $ac_executable_extensions; do
+	if { test -f $as_dir/$ac_prog$ac_exec_ext  $as_test_x $as_dir/$ac_prog$ac_exec_ext; }; then
+	  if test $ac_prog = install 
+	grep dspmsg $as_dir/$ac_prog$ac_exec_ext /dev/null 21; then
+	# AIX install.  It has an incompatible calling convention.
+	:
+	  elif test $ac_prog = install 
+	grep pwplus $as_dir/$ac_prog$ac_exec_ext /dev/null 21; then
+	# program-specific install script used by HP pwplus--don't use.
+	:
+	  else
+	rm -rf conftest.one conftest.two conftest.dir
+	echo one  conftest.one
+	echo two  conftest.two
+	mkdir conftest.dir
+	if $as_dir/$ac_prog$ac_exec_ext -c conftest.one conftest.two `pwd`/conftest.dir 
+	  test -s conftest.one  test -s conftest.two 
+	  test -s conftest.dir/conftest.one 
+	  test -s conftest.dir/conftest.two
+	then
+	  ac_cv_path_install=$as_dir/$ac_prog$ac_exec_ext -c
+	  break 3
+	fi
+	  fi
+	fi
+  done
+done
+;;
+esac
+
+done
+IFS=$as_save_IFS
+
+rm -rf conftest.one conftest.two conftest.dir
+
+fi
+  if test ${ac_cv_path_install+set} = set; then
+INSTALL=$ac_cv_path_install
+  else
+# As a last resort, use the slow shell script.  Don't cache a
+# value for INSTALL within a source directory, because that will
+# break other packages using the cache if that directory is
+# removed, or if the value is a relative name.
+INSTALL=$ac_install_sh
+  fi
+fi
+{ $as_echo $as_me:$LINENO: result: $INSTALL 5
+$as_echo $INSTALL 6; }
+
+# Use test -z because SunOS4 sh mishandles braces in ${var-val}.
+# It thinks the first close brace ends the variable substitution.
+test -z $INSTALL_PROGRAM  INSTALL_PROGRAM='${INSTALL}'
+
+test -z $INSTALL_SCRIPT  INSTALL_SCRIPT='${INSTALL}'
+
+test -z $INSTALL_DATA  INSTALL_DATA='${INSTALL} -m 644'
+
+# When Autoconf chooses install-sh as install program it tries to generate
+# a relative path to it in each makefile where it subsitutes it. This clashes
+# with our Makefile.global concept. This workaround helps.
+case $INSTALL in
+  *install-sh*) INSTALL='';;
+esac
+
 # Extract the first word of tar, so it can be a program name with args.
 set dummy tar; ac_word=$2
 { $as_echo $as_me:$LINENO: checking for $ac_word 5
@@ 

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 19 June 2012 19:44, Peter Geoghegan pe...@2ndquadrant.com wrote:
 PostgreSQL supported Unicode before 2005, when the tie-breaker was
 introduced. I know at least one Swede who used Postgres95. I just took
 a look at the REL6_4 branch, and it looks much the same in 1999 as it
 did in 2005, in that there is no tie-breaker after the strcoll(). Now,
 that being the case, and Hungarian in particular having a whole bunch
 of these equivalencies, I have to wonder if the original complainant's
 problem really was diagnosed correctly. It could of had something to
 do with the fact that texteq() was confused about whether it reported
 equality or equivalency - it may have taken that long for the (len1 !=
 len2) fastpath thing (only holds for equality, not equivalence,
 despite the fact that the 2005-era strcoll() call checks equivalence
 within texteq() ) to trip someone out, because texteq() would have
 thereby given inconsistent answers in a very subtle way, that were not
 correct either according to the Hungarian locale, nor according to
 simple bitwise equality.

It seems likely that this is more-or-less correct. The two equivalent
strings had a variable number of characters, so the fastpath made
texteq not accord with varstr_cmp(), even though texteq() itself also
only had a single strcoll() call.

So there was some tuples with the hungarian string potty in the 2005
bug report. They were not visible for any of the queries seen in test
cases, even the good ones. There were a few tuples with equivalent
strings like potyty that were visible.

The index scans didn't fail to return the expected tuples because the
indexes were internally inconsistent or otherwise corrupt. Rather, the
_bt_checkkeys() function returned early because the faulty half
equivalence, half equality texteq() comparator reported that the
tuple returned didn't satisfy the qual, and on that basis the index
scan stopped. This usually wouldn't happen with Swedish, because their
equivalencies tend to be one character long, and are less common.

So far, so good, but how did this not blow-up sooner? Did Hungarians
only start using Postgres in late 2005, immediately after the 8.1
release? Hardly.

Commit c1d62bfd00f4d1ea0647e12947ca1de9fea39b33, made in late 2003,
Add operator strategy and comparison-value datatype fields to
ScanKey, may be part of the problem here.

Consider this test within _bt_checkkeys(), that was changed by that commit:

-   if (key-sk_flags  SK_COMMUTE)
-   test = FunctionCall2(key-sk_func,
-key-sk_argument, datum);
-   else
-   test = FunctionCall2(key-sk_func,
-datum, key-sk_argument);
+   test = FunctionCall2(key-sk_func, datum, key-sk_argument);

-   if (DatumGetBool(test) == !!(key-sk_flags  SK_NEGATE))
+   if (!DatumGetBool(test))

I think that this change may have made the difference between the
Hungarians getting away with it and not getting away with it. Might it
have been that for text, they were using some operator that wasn't '='
(perhaps one which has no fastpath, and thus correctly made a
representation about equivalency) rather than texteq prior to this
commit? I didn't eyeball the pg_amop entries of the era myself, but it
seems quite possible. In any case, I find it hard to believe that it
took at least ten years for this problem to manifest itself just
because it took that long for a Hungarian with a strcoll()
implementation that correctly represented equivalency to use Postgres.
A year is a more plausible window.

If we do introduce an idea of equivalency to make all this work, that
means there'll have to be equivalency verification when equality
verification returned false in a number of places, including the
above. For Gist, there is an equivalent test will still vary based on
the strategy number used dubbed the consistent function, which seems
analogous to the above.

So, you're going to have an extra strcoll()/strxfrm() + strcmp() here,
as part of a not-equal-but-maybe-equivalent test, which is bad.
However, if that means that we can cache a text constant as a
strxfrm() blob, and compare in a strxfrm()-wise fashion, that will
more than pay for itself, even for btree traversal alone. For a naive
strxfrm() + strcoll() implementation, that will save just under half
of the work, and everyone knows that the cost of the comparison is
what dominates here, particularly for certain collations.

We'd probably formalise it to the point where there'd be a btree
strategy number and fully-fledged equivalency operator that the user
could conceivably use themselves.

There seems to be scope-creep here. I'm not sure that I should
continue with this as part of this review. Maybe this should be
something that I work on for the next commitfest.

It would be nice to hear what others thought of these ideas before I
actually start writing a patch that both fixes these problems (our
behaviour is incorrect for some locales 

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner
 Heikki Linnakangas  wrote:
 
 I don't like the idea of adding the origin id to the record header.
 It's only required in some occasions, and on some record types.
 
Right.
 
 And I'm worried it might not even be enough in more complicated
 scenarios.

 Perhaps we need a more generic WAL record annotation system, where
 a plugin can tack arbitrary information to WAL records. The extra
 information could be stored in the WAL record after the rmgr
 payload, similar to how backup blocks are stored. WAL replay could
 just ignore the annotations, but a replication system could use it
 to store the origin id or whatever extra information it needs.
 
Not only would that handle absolute versus relative updates and
origin id, but application frameworks could take advantage of such a
system for passing transaction metadata.  I've held back on one
concern so far that I'll bring up now because this suggestion would
address it nicely.
 
Our current trigger-driven logical replication includes a summary
which includes transaction run time, commit time, the transaction
type identifier, the source code line from which that transaction was
invoked, the user ID with which the user connected to the application
(which isn't the same as the database login), etc.  Being able to
decorate a database transaction with arbitrary (from the DBMS POV)
metadata would be very valuable.  In fact, our shop can't maintain
the current level of capabilities without *some* way to associate
such information with a transaction.
 
I think that using up the only unused space in the fixed header to
capture one piece of the transaction metadata needed for logical
replication, and that only in some configurations, is short-sighted.
If we solve the general problem of transaction metadata, this one
specific case will fall out of that.
 
I think removing origin ID from this patch and submitting a separate
patch for a generalized transaction metadata system is the sensible
way to go.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund and...@2ndquadrant.com wrote:
 My guess is that it wouldn't be too hard to remove some of the extra
 pointers.  Anyone who is using Dllist as a non-inline list could be
 converted to List * instead.
 There are only three users of the whole dllist.h. Catcache, autovacuum and
 postmaster. The latter two just keep a list of databases around. So any change
 will only be moderatively intrusive.

Yeah.

 Also, the performance-critical things
 could be reimplemented as macros.

 I question, though, whether we really need both single and doubly linked
 lists.  That seems like it's almost certainly micro-optimization that we are
 better off not doing.
 It was certainly worthwile for the memory manager (lower per allocation
 overhead). You might be right that its not worth it for many other possible
 usecases in pg. Its not much code though.

 *looks around*

 A quick grep found:

 single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h (ok,
 irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
 Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c

 I stopped at that point because while surely not of all of the above usecases
 could be replaced by a common implementation several could from a quick look.
 Also, several pg_list.h users could benefit from a conversion. So I think
 adding a single linked list implementation is worthwile.

I can believe that, although I fear it may be a distraction in the
grand scheme of getting logical replication implemented.  There should
be very few places where this is actually performance-critical, and
Tom will complain about large amounts of code churn that don't improve
performance.

If we're going to do that, how about transforming dllist.h into the
doubly-linked list and adding sllist.h for the singly-linked list?

  The most contentious point is probably relying on USE_INLINE being
  available anywhere. Which I believe to be the point now that we have
  gotten rid of some platforms.
 I would be hesitant to chuck that even though I realize it's unlikely
 that we really need !USE_INLINE.  But see sortsupport for an example
 of how we've handled this in the recent past.
 I agree its possible to resolve this. But hell, do we really need to add all
 that ugliness in 2012? I don't think its worth the effort of support ancient
 compilers that don't support inline anymore. If we could stop trying to
 support catering for probably non-existing compilers we could remove some
 *very* ugly long macros for example (e.g. in htup.h).

I don't feel qualified to make a decision on this one, so will defer
to the opinions of others.

 If support for !USE_INLINE is required I would prefer to have an header define
 the functions like

 #ifdef USE_INLINE
 #define OPTIONALLY_INLINE static inline
 #define USE_LINKED_LIST_IMPL
 #endif

 #ifdef USE_LINKED_LIST_IMPL

 OPTIONALLY_INLINE void myFuncCall(){
 ...
 }
 #endif

 which then gets included with #define USE_LINKED_LIST_IMPL by some c file
 defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
 Its too much code to duplicate imo.

Neat trick.  Maybe we should revise the sortsupport stuff to do it that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund and...@2ndquadrant.com wrote:
 As I said before, I definitely agree that we want to have a separate transport
 format once we have decoding nailed down. We still need to ship wal around if
 the decoding happens in a different instance, but *after* that it can be
 shipped in something more convenient/appropriate.

Right, OK, we agree on this then.

 One bit is fine if you have only very simple replication topologies. Once you
 think about globally distributed databases its a bit different. You describe
 some of that below, but just to reiterate:
 Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF
 in europe). You may only want to have full intercontinental interconnect
 between two of those (say A and D). If you only have one bit to represent the
 origin thats not going to work because you won't be able discern the changes
 from BC on A from the changes from those originating on DEF.

I don't see the problem.  A certainly knows via which link the LCRs arrived.

So: change happens on A.  A sends the change to B, C, and D.  B and C
apply the change.  One bit is enough to keep them from regenerating
new LCRs that get sent back to A.  So they're fine.  D also receives
the changes (from A) and applies them, but it also does not need to
regenerate LCRs.  Instead, it can take the LCRs that it has already
got (from A) and send those to E and F.

Or: change happens on B.  B sends the changes to A.  Since A knows the
network topology, it sends the changes to C and D.  D sends them to E
and F.  Nobody except B needs to *generate* LCRs.  All any other node
needs to do is suppress *redundant* LCR generation.

 Another topology which is interesting is circular replications (i.e. changes
 get shipped A-B, B-C, C-A) which is a sensible topology if you only have a
 low change rate and a relatively high number of nodes because you don't need
 the full combinatorial amount of connections.

I think this one is OK too.  You just generate LCRs on the origin node
and then pass them around the ring at every step.  When the next hop
would be the origin node then you're done.

I think you may be imagining that A generates LCRs and sends them to
B.  B applies them, and then from the WAL just generated, it produces
new LCRs which then get sent to C.  If you do that, then, yes,
everything that you need to disentangle various network topologies
must be present in WAL.  But what I'm saying is: don't do it like
that.  Generate the LCRs just ONCE, at the origin node, and then pass
them around the network, applying them at every node.  Then, the
information that is needed in WAL is confined to one bit: the
knowledge of whether or not a particular transaction is local (and
thus LCRs should be generated) or non-local (and thus they shouldn't,
because the origin already generated them and thus we're just handing
them around to apply everywhere).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 02:51:30 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Also, the performance-critical things
  could be reimplemented as macros.
  
  I question, though, whether we really need both single and doubly linked
  lists.  That seems like it's almost certainly micro-optimization that we
  are better off not doing.
  
  It was certainly worthwile for the memory manager (lower per allocation
  overhead). You might be right that its not worth it for many other
  possible usecases in pg. Its not much code though.
  
  *looks around*
  
  A quick grep found:
  
  single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h
  (ok, irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
  Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c
  
  I stopped at that point because while surely not of all of the above
  usecases could be replaced by a common implementation several could from
  a quick look. Also, several pg_list.h users could benefit from a
  conversion. So I think adding a single linked list implementation is
  worthwile.
 
 I can believe that, although I fear it may be a distraction in the
 grand scheme of getting logical replication implemented.  There should
 be very few places where this is actually performance-critical, and
 Tom will complain about large amounts of code churn that don't improve
 performance.
Uh. I don't want to just go around and replace anything randomly. Actually I 
don't want to change anything for now except whats necessary to get the patch 
in. The point I tried to make was just that the relatively widespread usage of 
similar structure make it likely that it can be used in more places in future.

 If we're going to do that, how about transforming dllist.h into the
 doubly-linked list and adding sllist.h for the singly-linked list?
I would be fine with that.

I will go and try to cook up a patch, assuming for now that we rely on inline, 
the ugliness can be added back afterwards.

   The most contentious point is probably relying on USE_INLINE being
   available anywhere. Which I believe to be the point now that we have
   gotten rid of some platforms.

  I would be hesitant to chuck that even though I realize it's unlikely
  that we really need !USE_INLINE.  But see sortsupport for an example
  of how we've handled this in the recent past.
  
  I agree its possible to resolve this. But hell, do we really need to add
  all that ugliness in 2012? I don't think its worth the effort of support
  ancient compilers that don't support inline anymore. If we could stop
  trying to support catering for probably non-existing compilers we could
  remove some *very* ugly long macros for example (e.g. in htup.h).
 
 I don't feel qualified to make a decision on this one, so will defer
 to the opinions of others.
Ok.

  If support for !USE_INLINE is required I would prefer to have an header
  define the functions like
  
  #ifdef USE_INLINE
  #define OPTIONALLY_INLINE static inline
  #define USE_LINKED_LIST_IMPL
  #endif
  
  #ifdef USE_LINKED_LIST_IMPL
  
  OPTIONALLY_INLINE void myFuncCall(){
  ...
  }
  #endif
  
  which then gets included with #define USE_LINKED_LIST_IMPL by some c file
  defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
  Its too much code to duplicate imo.
 
 Neat trick.  Maybe we should revise the sortsupport stuff to do it that
 way.
Either that or at least add a comment to both that its duplicated...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

If you're saying that single-master logical replication isn't useful,
I disagree.  Of course, having both single-master and multi-master
replication together is even more useful.  But I think getting even
single-master logical replication working well in a single release
cycle is going to be a job and a half.  Thinking that we're going to
get MMR in one release is not realistic.  The only way to make it
realistic is to put MMR ahead of every other goal that people have for
logical replication, including robustness and stability.  It's
entirely premature to be designing features for MMR when we don't even
have the design for SMR nailed down yet.  And that's even assuming we
EVER want MMR in core, which has not even really been argued, let
alone agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
  The idea that logical rep is some kind of useful end goal in itself is
  slightly misleading. If the thought is to block multi-master
  completely on that basis, that would be a shame. Logical rep is the
  mechanism for implementing multi-master.
 
 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.  But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.  Thinking that we're going to
 get MMR in one release is not realistic.  The only way to make it
 realistic is to put MMR ahead of every other goal that people have for
 logical replication, including robustness and stability.  It's
 entirely premature to be designing features for MMR when we don't even
 have the design for SMR nailed down yet.  And that's even assuming we
 EVER want MMR in core, which has not even really been argued, let
 alone agreed.
I agree it has not been agreed uppon, but I certainly would consider 
submitting a prototype implementing it an argument for doing it ;)

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Amit Kapila

 I'm almost inclined to suggest that we not get next-LSN from WAL, but
 by scanning all the pages in the main data store and computing the max
 observed LSN.  This is clearly not very attractive from a performance
 standpoint, but it would avoid the obvious failure mode where you lost
 some recent WAL segments along with pg_control.

According to my analysis, this will have some problem. 
I will explain the problem by taking example scenario.

Example Scenario -
Let us assume that database crashes and it can be recovered by doing crash 
recovery.
Now assume we have Data files and WAL files intact and only control file is 
lost. 
Now user uses pg_resetxlog to generate pg_control file and we uses new 
algorithm to generate next-LSN.

Summary of events before database crash-
1. Checkpoint was in progress and it has already noted next-LSN location 
(LSN-107) and mark the dirty pages as BM_CHECKPOINT_NEEDED.
2. At this point a new transaction dirties 2 pages, first it dirties a fresh 
page (for this change LSN-108) 
   and then it dirties one which is already marked as BM_CHECKPOINT_NEEDED (for 
this change LSN-109).
3. CheckPoint starts flushing pages.
4. It will now flush the page with LSN-109 but not the page 108.
4. Checkpoint finishes.
5. Database crashes.

Normal Crash Recovery - 
it will start the replay from 107 and after recovery the database will be in 
consistent state.

Pg_resetxlog -
It will generate the next-LSN point as 109 which when used for recovery will 
generate inconsistent database.
However if we would have relied on WAL, it would have got next-LSN as 107.

This is just an Example case to show that there can be some problems using the 
algorithm for generating
next-LSN from pages. However it doesn't prove that generating from WAL will be 
correct.

Please correct my understanding if I am wrong.
 
With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Add some more documentation for array indexes/operators

2012-06-20 Thread Ryan Kelly
I had trouble finding what operators arrays supported or which ones
had index support or even determining that arrays could be indexed from
the documentation from the array data type. So, patch.

-Ryan Kelly
diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index 3508ba3..51d996d 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -573,6 +573,33 @@ SELECT * FROM
   This function is described in xref linkend=functions-srf-subscripts.
  /para
 
+ para
+  You can also search with arrays using the literal@gt;/,
+  literallt;@/, and literalamp;amp;/ operators which check
+  whether the left operand contains, is contained by, or overlaps with
+  the right operand, respectively. For instance,
+
+programlisting
+SELECT * FROM sal_emp WHERE pay_by_quarter  ARRAY[1];
+/programlisting
+
+  Will find all rows where the structfieldpay_by_quarter/structfield
+  overlaps with an array containing literal1/, that is, any
+  employee with a quarterly salary of literal1/ will be matched.
+ /para
+
+ para
+  See xref linkend=arrays for more details about array operator
+  behavior.
+ /para
+
+ para
+  Additionally, a number of operators support indexed operations. This means
+  that the example above could utilize an index to efficiently locate those 
+  employees with the desired quarterly salary. See xref linkend=indexes-types
+  for more details about which operators support indexed operations.
+ /para
+
  tip
   para
Arrays are not sets; searching for specific array elements
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cd374ac..2cef8e4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10282,7 +10282,8 @@ SELECT NULLIF(value, '(none)') ...
 
   para
See xref linkend=arrays for more details about array operator
-   behavior.
+   behavior. See xref linkend=indexes-types for more details about
+   which operators support indexed operations.
   /para
 
   para

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 Uh. I don't want to just go around and replace anything randomly. Actually I
 don't want to change anything for now except whats necessary to get the patch
 in. The point I tried to make was just that the relatively widespread usage of
 similar structure make it likely that it can be used in more places in future.

Well, the question is for anywhere you might be thinking of using
this: why not just use List?  We do that in a lot of other places, and
there's not much reason to event something new unless there is a
problem with what we already have.  I assume this is related to
logical replication somehow, but it's not clear to me exactly what
problem you hit doing this in the obvious way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.

 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

OK, so your estimate is 1.5 people to do that. And if we have more
people, should they sit around doing nothing?

 Thinking that we're going to
 get MMR in one release is not realistic.

If you block it, then the above becomes true, whether or not it starts true.

You may not want MMR, but others do. I see no reason to prevent people
from having it, which is what you suggest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:24:58 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Uh. I don't want to just go around and replace anything randomly.
  Actually I don't want to change anything for now except whats necessary
  to get the patch in. The point I tried to make was just that the
  relatively widespread usage of similar structure make it likely that it
  can be used in more places in future.
 
 Well, the question is for anywhere you might be thinking of using
 this: why not just use List?  We do that in a lot of other places, and
 there's not much reason to event something new unless there is a
 problem with what we already have.  I assume this is related to
 logical replication somehow, but it's not clear to me exactly what
 problem you hit doing this in the obvious way.
It incurs a rather high performance overhead due to added memory allocations 
and added pointer indirections. Thats fine for most of the current users of 
the List interface, but certainly not for all. In other places you cannot even 
have memory allocations because the list lives in shared memory.

E.g. in the ApplyCache, where I use the submitted ilist.h stuff, when 
reconstructing transactions you add to a potentially really long linked list 
of individual changes for every interesting wal record. Before I prevented 
memory allocations in that path it took about 12-14% of the time when applying 
changes in the same backend. Afterwards it wasn't visible in the profile 
anymore.

Several of the pieces of code I pointed out in a previous email use open-coded 
list implementation exactly to prevent those problems.

If you look at the parsing, planning  execution of trivial statements you 
will also notice the overhead of memory allocations. A good bit of those is 
caused by list manipulation. Check Stephen Frost's Pre-alloc ListCell's 
optimization for workarounds..

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Aidan Van Dyk
On Wed, Jun 20, 2012 at 9:21 AM, Amit Kapila amit.kap...@huawei.com wrote:

 Example Scenario -

 Now assume we have Data files and WAL files intact and only control file is 
 lost.


Just so I understand correctly, the aim of this is to fix the
situation where out of the thousands of files and 100s of GB of data
in my pg directory, the *only* corruption is that a single file
pg_control file is missing?

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.

 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

Oh, give me a break.  You're willfully missing my point.  And to quote
Fred Brooks, nine women can't make a baby in one month.

 Thinking that we're going to
 get MMR in one release is not realistic.

 If you block it, then the above becomes true, whether or not it starts true.

If I had no rational basis for my objections, that would be true.
You've got four people objecting to this patch now, all of whom happen
to be committers.  Whether or not MMR goes into core, who knows, but
it doesn't seem that this patch is going to fly.

My main point in bringing this up is that if you pick a project that
is too large, you will fail.  As I would rather see this project
succeed, I recommend that you don't do that.  Both you and Andres seem
to believe that MMR is a reasonable first target to shoot at, but I
don't think anyone else - including the Slony developers who have
commented on this issue - endorses that position.  At PGCon, you were
talking about getting a new set of features into PG over the next 3-5
years.  Now, it seems like you want to compress that timeline to a
year.  I don't think that's going to work.  You also requested that
people tell you sooner when large patches were in danger of not making
the release.  Now I'm doing that, VERY early, and you're apparently
angry about it.  If the only satisfactory outcome of this conversation
is that everyone agrees with the design pattern you've already decided
on, then you haven't left yourself very much room to walk away
satisfied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  One bit is fine if you have only very simple replication topologies. Once
  you think about globally distributed databases its a bit different. You
  describe some of that below, but just to reiterate:
  Imagine having 6 nodes, 3 on one of two continents (ABC in north america,
  DEF in europe). You may only want to have full intercontinental
  interconnect between two of those (say A and D). If you only have one
  bit to represent the origin thats not going to work because you won't be
  able discern the changes from BC on A from the changes from those
  originating on DEF.
 
 I don't see the problem.  A certainly knows via which link the LCRs
 arrived.

 So: change happens on A.  A sends the change to B, C, and D.  B and C
 apply the change.  One bit is enough to keep them from regenerating
 new LCRs that get sent back to A.  So they're fine.  D also receives
 the changes (from A) and applies them, but it also does not need to
 regenerate LCRs.  Instead, it can take the LCRs that it has already
 got (from A) and send those to E and F.

 Or: change happens on B.  B sends the changes to A.  Since A knows the
 network topology, it sends the changes to C and D.  D sends them to E
 and F.  Nobody except B needs to *generate* LCRs.  All any other node
 needs to do is suppress *redundant* LCR generation.
 
  Another topology which is interesting is circular replications (i.e.
  changes get shipped A-B, B-C, C-A) which is a sensible topology if
  you only have a low change rate and a relatively high number of nodes
  because you don't need the full combinatorial amount of connections.
 
 I think this one is OK too.  You just generate LCRs on the origin node
 and then pass them around the ring at every step.  When the next hop
 would be the origin node then you're done.
 
 I think you may be imagining that A generates LCRs and sends them to
 B.  B applies them, and then from the WAL just generated, it produces
 new LCRs which then get sent to C. 
Yes, thats what I am proposing.

 If you do that, then, yes,
 everything that you need to disentangle various network topologies
 must be present in WAL.  But what I'm saying is: don't do it like
 that.  Generate the LCRs just ONCE, at the origin node, and then pass
 them around the network, applying them at every node.  Then, the
 information that is needed in WAL is confined to one bit: the
 knowledge of whether or not a particular transaction is local (and
 thus LCRs should be generated) or non-local (and thus they shouldn't,
 because the origin already generated them and thus we're just handing
 them around to apply everywhere).
Sure, you can do it that way, but I don't think its a good idea. If you do it 
my way you *guarantee* that when replaying changes from node B on node C you 
have replayed changes from A at least as far as B has. Thats a really nice 
property for MM.
You *can* get same with your solution but it starts to get complicated rather 
fast. While my/our proposed solution is trivial to implement.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 20:37, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas  wrote:

 I don't like the idea of adding the origin id to the record header.
 It's only required in some occasions, and on some record types.

 Right.

Wrong, as explained.

 And I'm worried it might not even be enough in more complicated
 scenarios.

 Perhaps we need a more generic WAL record annotation system, where
 a plugin can tack arbitrary information to WAL records. The extra
 information could be stored in the WAL record after the rmgr
 payload, similar to how backup blocks are stored. WAL replay could
 just ignore the annotations, but a replication system could use it
 to store the origin id or whatever extra information it needs.

 Not only would that handle absolute versus relative updates and
 origin id, but application frameworks could take advantage of such a
 system for passing transaction metadata.  I've held back on one
 concern so far that I'll bring up now because this suggestion would
 address it nicely.

 Our current trigger-driven logical replication includes a summary
 which includes transaction run time, commit time, the transaction
 type identifier, the source code line from which that transaction was
 invoked, the user ID with which the user connected to the application
 (which isn't the same as the database login), etc.  Being able to
 decorate a database transaction with arbitrary (from the DBMS POV)
 metadata would be very valuable.  In fact, our shop can't maintain
 the current level of capabilities without *some* way to associate
 such information with a transaction.

 I think that using up the only unused space in the fixed header to
 capture one piece of the transaction metadata needed for logical
 replication, and that only in some configurations, is short-sighted.
 If we solve the general problem of transaction metadata, this one
 specific case will fall out of that.

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being used up as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.

It isn't true that this is needed only for some configurations of
multi-master, per discussion.

This is not transaction metadata, it is WAL record metadata required
for multi-master replication, see later point.

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.

 I think removing origin ID from this patch and submitting a separate
 patch for a generalized transaction metadata system is the sensible
 way to go.

We already have a very flexible WAL system for recording data of
interest to various resource managers. If you wish to annotate a
transaction, you can either generate a new kind of WAL record or you
can enhance a commit record. There are already unused flag bits on
commit records for just such a purpose.

XLOG_NOOP records can already be generated by your application if you
wish to inject additional metadata to the WAL stream. So no changes
are required for you to implement the generalised transaction metadata
scheme you say you require.

Not sure how or why that relates to requirements for multi-master.


Please note that I've suggested review changes to Andres' work myself.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Hannu Valtonen
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/19/2012 01:47 AM, Christopher Browne wrote:
 That numbering scheme gets pretty anti-intuitive fairly quickly,
 from whence we took the approach of having a couple digits
 indicating data centre followed by a digit indicating which node in
 that data centre.
 
 If that all sounds incoherent, well, the more nodes you have
 around, the more difficult it becomes to make sure you *do* have a
 coherent picture of your cluster.
 
 I recall the Slony-II project having a notion of attaching a
 permanent UUID-based node ID to each node.  As long as there is
 somewhere decent to find a symbolically significant node name, I
 like the idea of the ID *not* being in a tiny range, and being
 UUID/OID-like...
 

Just as a sidenote, MySQL's new global transaction ids use a UUID for
the serverid bit of it. [1]

- - Hannu Valtonen

[1]
http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq
vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T
4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw
ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119
mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4
SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s
cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x
f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d
S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse
OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE
GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1
adx5JGZK5k0xtSewTj7O
=lK2f
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com wrote:
 If you do that, then, yes,
 everything that you need to disentangle various network topologies
 must be present in WAL.  But what I'm saying is: don't do it like
 that.  Generate the LCRs just ONCE, at the origin node, and then pass
 them around the network, applying them at every node.  Then, the
 information that is needed in WAL is confined to one bit: the
 knowledge of whether or not a particular transaction is local (and
 thus LCRs should be generated) or non-local (and thus they shouldn't,
 because the origin already generated them and thus we're just handing
 them around to apply everywhere).
 Sure, you can do it that way, but I don't think its a good idea. If you do it
 my way you *guarantee* that when replaying changes from node B on node C you
 have replayed changes from A at least as far as B has. Thats a really nice
 property for MM.
 You *can* get same with your solution but it starts to get complicated rather
 fast. While my/our proposed solution is trivial to implement.

That's an interesting point.  I agree that's a useful property, and
might be a reason not to just use a single-bit flag, but I still think
you'd be better off handling that requirement via some other method,
like logging the node ID once per transaction or something.  That lets
you have as much metadata as you end up needing, which is a lot more
flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
  On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com 
wrote:
  The idea that logical rep is some kind of useful end goal in itself is
  slightly misleading. If the thought is to block multi-master
  completely on that basis, that would be a shame. Logical rep is the
  mechanism for implementing multi-master.
  
  If you're saying that single-master logical replication isn't useful,
  I disagree.  Of course, having both single-master and multi-master
  replication together is even more useful.
  
  But I think getting even
  single-master logical replication working well in a single release
  cycle is going to be a job and a half.
  Thinking that we're going to
  get MMR in one release is not realistic.
  If you block it, then the above becomes true, whether or not it starts
  true.
 My main point in bringing this up is that if you pick a project that
 is too large, you will fail.
Were not the only ones here that are performing scope creep though... I think 
about all people who have posted in the whole thread except maybe Tom and 
Marko are guilty of doing so.

I still think its rather sensible to focus on exactly duplicated schemas in a 
very first version just because that leaves out some of the complexity while 
paving the road for other nice things.

 You've got four people objecting to this patch now, all of whom happen
 to be committers.  Whether or not MMR goes into core, who knows, but
 it doesn't seem that this patch is going to fly.
I find that a bit too early to say. Sure it won't fly exactly as proposed, but 
hell, who cares? What I want to get in is a solution to the specific problem 
the patch targets. At least you have, not sure about others, accepted that the 
problem needs a solution.
We do not agree yet how that solution looks should like but thats not exactly 
surprising as we started discussing the problem only a good day ago.

If people agree that your proposed way of just one flag bit is the way to go 
we will have to live with that. But thats different from saying the whole 
thing is dead.

 As I would rather see this project
 succeed, I recommend that you don't do that.  Both you and Andres seem
 to believe that MMR is a reasonable first target to shoot at, but I
 don't think anyone else - including the Slony developers who have
 commented on this issue - endorses that position. 
I don't think we get full MMR into 9.3. What I am proposing is that we build 
in the few pieces that are required to implement MMR *ontop* of whats 
hopefully in 9.3.
And I think thats a realistic goal.

 At PGCon, you were
 talking about getting a new set of features into PG over the next 3-5
 years.  Now, it seems like you want to compress that timeline to a
 year.
Well, I obviously would like to be everything be done in a release, but I also 
would like to go hiking for a year, have a restored sailing boat and some 
more. That doesn't make it reality...
To make it absolutely clear: I definitely don't think its realistic to have 
everything in 9.3. And I don't think Simon does so either. What I want is to 
have the basic building blocks in 9.3.
The difference probably just is whats determined as a building block.

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  If you do that, then, yes,
  everything that you need to disentangle various network topologies
  must be present in WAL.  But what I'm saying is: don't do it like
  that.  Generate the LCRs just ONCE, at the origin node, and then pass
  them around the network, applying them at every node.  Then, the
  information that is needed in WAL is confined to one bit: the
  knowledge of whether or not a particular transaction is local (and
  thus LCRs should be generated) or non-local (and thus they shouldn't,
  because the origin already generated them and thus we're just handing
  them around to apply everywhere).
  
  Sure, you can do it that way, but I don't think its a good idea. If you
  do it my way you *guarantee* that when replaying changes from node B on
  node C you have replayed changes from A at least as far as B has. Thats
  a really nice property for MM.
  You *can* get same with your solution but it starts to get complicated
  rather fast. While my/our proposed solution is trivial to implement.
 
 That's an interesting point.  I agree that's a useful property, and
 might be a reason not to just use a single-bit flag, but I still think
 you'd be better off handling that requirement via some other method,
 like logging the node ID once per transaction or something.  That lets
 you have as much metadata as you end up needing, which is a lot more
 flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
 said.
If it comes down to that I can definitely live with that. I still think making 
the filtering trivial so it can be done without any logic on a low level is a 
very desirable property but if not, so be it.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 21:42, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.

 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

 Oh, give me a break.  You're willfully missing my point.  And to quote
 Fred Brooks, nine women can't make a baby in one month.


No, I'm not. The question is not how quickly can N people achieve a
single thing, but how long will it take a few skilled people working
on carefully selected tasks that have few dependencies between them to
achieve something.

We have significantly more preparation, development time and resources
than any other project previously performed for PostgreSQL, that I am
aware of.

Stating that it is impossible to perform a task in a certain period of
time without even considering those points is clearly rubbish. I've
arrived at my thinking based upon detailed project planning of what
was possible in the time.

How exactly did you arrive at your conclusion? Why is yours right and
mine wrong?



 Thinking that we're going to
 get MMR in one release is not realistic.

 If you block it, then the above becomes true, whether or not it starts true.

 If I had no rational basis for my objections, that would be true.
 You've got four people objecting to this patch now, all of whom happen
 to be committers.  Whether or not MMR goes into core, who knows, but
 it doesn't seem that this patch is going to fly.

No, I have four people who had initial objections and who have not
commented on the fact that the points made are regrettably incorrect.
I don't expect everybody commenting on the design to have perfect
knowledge of the whole design, so I expect people to make errors in
their comments. I also expect people to take note of what has been
said before making further objections or drawing conclusions.

Since at least 3 of the people making such comments did not attend the
full briefing meeting in Ottawa, I am not particularly surprised.
However, I do expect people that didn't come to the meeting to
recognise that they are likely to be missing information and to listen
closely, as I listen to them.

When the facts change, I change my mind. What do you do, sir?


 My main point in bringing this up is that if you pick a project that
 is too large, you will fail.  As I would rather see this project
 succeed, I recommend that you don't do that.  Both you and Andres seem
 to believe that MMR is a reasonable first target to shoot at, but I
 don't think anyone else - including the Slony developers who have
 commented on this issue - endorses that position.  At PGCon, you were
 talking about getting a new set of features into PG over the next 3-5
 years.  Now, it seems like you want to compress that timeline to a
 year.  I don't think that's going to work.  You also requested that
 people tell you sooner when large patches were in danger of not making
 the release.  Now I'm doing that, VERY early, and you're apparently
 angry about it.  If the only satisfactory outcome of this conversation
 is that everyone agrees with the design pattern you've already decided
 on, then you haven't left yourself very much room to walk away
 satisfied.

Note that I have already myself given review feedback to Andres and
that change has visibly occurred during this thread via public debate.

Claiming that I only stick to what has already been decided is
patently false, with me at least.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Greg Stark
On Sun, Jun 17, 2012 at 9:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The trick for hashing such datatypes is to be able to guarantee that
 equal values hash to the same hash code, which is typically possible
 as long as you know the equality rules well enough.  We could possibly
 do that for text with pure-strcoll equality if we knew all the details
 of what strcoll would consider equal, but we do not.

It occurs to me that strxfrm would answer this question. If we made
the hash function hash the result of strxfrm then we could make
equality use strcoll and not fall back to strcmp.

I'm suspect in a green field that's what we would do though the cpu
cost might be enough to think hard about it. I'm not sure it's worth
considering switching though.

The cases where it matters to users incidentally is when you have a
multi-column sort order and have values that are supposed to sort
equal in the first column but print differently. Given that there
seems to be some controversy in the locale definitions -- most locals
seem to use insignificant factors like accents or ligatures as
tie-breakers and avoid claiming different sequences are equal even
when the language usually treats them as equivalent -- it doesn't seem
super important to maintain the property for the few locales that fall
the other way. Unless my impression is wrong and there's a good
principled reason why some locales treat nearly equivalent strings one
way and some treat them the other.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 15:10, Greg Stark st...@mit.edu wrote:
 On Sun, Jun 17, 2012 at 9:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The trick for hashing such datatypes is to be able to guarantee that
 equal values hash to the same hash code, which is typically possible
 as long as you know the equality rules well enough.  We could possibly
 do that for text with pure-strcoll equality if we knew all the details
 of what strcoll would consider equal, but we do not.

 It occurs to me that strxfrm would answer this question. If we made
 the hash function hash the result of strxfrm then we could make
 equality use strcoll and not fall back to strcmp.

What about per-column collations?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks

2012-06-20 Thread Jaime Casanova
On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera
alvhe...@alvh.no-ip.org wrote:
 Hi,

 This is v12 of the foreign key locks patch.


Hi Álvaro,

Just noticed that this patch needs a rebase because of the refactoring
Tom did in ri_triggers.c

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 11:17, Simon Riggs wrote:

 On 20 June 2012 15:45, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.06.2012 10:32, Simon Riggs wrote:


 On 20 June 2012 14:40, Heikki Linnakangas


 And I'm worried
 it might not even be enough in more complicated scenarios.


 It is not the only required conflict mechanism, and has never been
 claimed to be so. It is simply one piece of information needed, at
 various times.


 So, if the origin id is not sufficient for some conflict resolution
 mechanisms, what extra information do you need for those, and where do
 you
 put it?


 As explained elsewhere, wal_level = logical (or similar) would be used
 to provide any additional logical information required.

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

 In the case of reflexive updates you raised, a typical response in
 other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
 by sending just the +1 part, not the current value of counter, as
 would be the case with the non-reflexive update
   UPDATE SET counter = 1

 Handling such things in Postgres would require some subtlety, which
 would not be resolved in first release but is pretty certain not to
 require any changes to the WAL record header as a way of resolving it.
 Having already thought about it, I'd estimate that is a very long
 discussion and not relevant to the OT, but if you wish to have it
 here, I won't stop you.


 Yeah, I'd like to hear briefly how you would handle that without any further
 changes to the WAL record header.

I already did:

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Greg Stark
On Wed, Jun 20, 2012 at 3:19 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 It occurs to me that strxfrm would answer this question. If we made
 the hash function hash the result of strxfrm then we could make
 equality use strcoll and not fall back to strcmp.

 What about per-column collations?

Well collations aren't really per-column, they're specific to the
comparison in the expression at hand. The per-column collation is just
the default for comparisons against that column.

So for a hash join for example you would use build the hash table
using strxfrm for the collation being used for the join  expression.
For hash indexes the index would only be valid for a specific
collation, just like a btree index is.

But this all seems like a lot of work for a case that most locales
seem to think isn't worth worrying about.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ADMIN] PANIC while doing failover (streaming replication)

2012-06-20 Thread Matheus Ricardo Espanhol
Hi,

Any news on this issue? My slave server is not trusted, all the warnings below
are related to
indexes of the main tables:

2012-06-14 11:46:23 BRT [18654]: [34603-1] user=,db= LOG:  recovery restart
point at 435/4E899CE8
2012-06-14 11:46:23 BRT [18654]: [34604-1] user=,db= DETAIL:  last
completed transaction was at log time 2012-06-14 11:46:22.770348-03
2012-06-14 11:48:53 BRT [18654]: [34605-1] user=,db= LOG:  restartpoint
starting: time
2012-06-14 11:51:09 BRT [18653]: [18-1] user=,db= LOG:  received promote
request
2012-06-14 11:51:09 BRT [11879]: [2-1] user=,db= FATAL:  terminating
walreceiver process due to administrator command
2012-06-14 11:51:09 BRT [18653]: [19-1] user=,db= LOG:  invalid record
length at 435/5079CAE8
2012-06-14 11:51:09 BRT [18653]: [20-1] user=,db= LOG:  redo done at
435/5079CAA8
2012-06-14 11:51:09 BRT [18653]: [21-1] user=,db= LOG:  last completed
transaction was at log time 2012-06-14 11:51:08.969771-03
2012-06-14 11:51:09 BRT [18653]: [22-1] user=,db= LOG:  selected new
timeline ID: 2
2012-06-14 11:51:09 BRT [18653]: [23-1] user=,db= LOG:  archive recovery
complete
2012-06-14 11:51:09 BRT [18653]: [24-1] user=,db= WARNING:  page 193569 of
relation base/16407/80416524 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [25-1] user=,db= WARNING:  page 129229 of
relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [26-1] user=,db= WARNING:  page 134146 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [27-1] user=,db= WARNING:  page 318013 of
relation base/16407/38334887 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [28-1] user=,db= WARNING:  page 134143 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [29-1] user=,db= WARNING:  page 156203 of
relation base/16407/38334883 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [30-1] user=,db= WARNING:  page 318009 of
relation base/16407/38334887 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [31-1] user=,db= WARNING:  page 370150 of
relation base/16407/38334888 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [32-1] user=,db= WARNING:  page 133811 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [33-1] user=,db= WARNING:  page 317963 of
relation base/16407/38334884 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [34-1] user=,db= WARNING:  page 133808 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [35-1] user=,db= WARNING:  page 133809 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [36-1] user=,db= WARNING:  page 133810 of
relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [37-1] user=,db= WARNING:  page 129231 of
relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [38-1] user=,db= WARNING:  page 14329 of
relation base/16407/80430266 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [39-1] user=,db= WARNING:  page 134145 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [40-1] user=,db= WARNING:  page 134144 of
relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [41-1] user=,db= WARNING:  page 317957 of
relation base/16407/38334884 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [42-1] user=,db= WARNING:  page 129230 of
relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [43-1] user=,db= WARNING:  page 156201 of
relation base/16407/38334883 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [44-1] user=,db= PANIC:  WAL contains
references to invalid pages
2012-06-14 11:51:09 BRT [18651]: [4-1] user=,db= LOG:  startup process (PID
18653) was terminated by signal 6: Aborted
2012-06-14 11:51:09 BRT [18651]: [5-1] user=,db= LOG:  terminating any
other active server processes

PostgreSQL version 9.1.3.

Thanks in advance for any suggestions.


-Matheus



2011/7/1 Fujii Masao masao.fu...@gmail.com

 On Fri, Jul 1, 2011 at 2:18 PM, Mikko Partio mpar...@gmail.com wrote:
  Hello list
  I have two a machine cluster with PostgreSQL 9.0.4 and streaming
  replication. In a normal situation I did a failover -- touched the
 trigger
  file in standby to promote it to production mode. I have done this
  previously without any complications but now the
 to-be-production-database
  had a PANIC and shut itself down. From the logs:
  postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db=
  LOG:  streaming replication successfully connected to primary
  postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1]
 user=,db=
  LOG:  trigger file found: /postgresql/data/finish_replication.trigger
  postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db=
  FATAL:  terminating walreceiver process due to administrator command
  postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1]
 user=,db=
  LOG:  redo done at AE/8B1855C8
  postgres[10736]: 

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 I think that this change may have made the difference between the
 Hungarians getting away with it and not getting away with it. Might it
 have been that for text, they were using some operator that wasn't '='
 (perhaps one which has no fastpath, and thus correctly made a
 representation about equivalency) rather than texteq prior to this
 commit?

Uh, no.  There aren't any magic variants of equality now, and there were
not back then either.  I'm inclined to think that the Hungarian
problem did exist long before we fixed it.

 So, you're going to have an extra strcoll()/strxfrm() + strcmp() here,
 as part of a not-equal-but-maybe-equivalent test, which is bad.
 However, if that means that we can cache a text constant as a
 strxfrm() blob, and compare in a strxfrm()-wise fashion, that will
 more than pay for itself, even for btree traversal alone.

Um ... are you proposing to replace text btree index entries with
strxfrm values?  Because if you aren't, I don't see that this is likely
to win anything.  And if you are, it loses on the grounds of (a) index
bloat and (b) loss of ability to do index-only scans.

 It would be nice to hear what others thought of these ideas before I
 actually start writing a patch that both fixes these problems (our
 behaviour is incorrect for some locales according to the Unicode
 standard),  facilitates a strxfrm() optimisation, and actually adds a
 strxfrm() optimisation.

Personally I think this is not a direction we want to go in.  I think
that it's going to end up a significant performance loss in many cases,
break backwards compatibility in numerous ways, and provide a useful
behavioral improvement to only a small minority of users.  If you check
the archives, we get far more complaints from people who think their
locale definition is wrong than from people who are unhappy because
we don't hew exactly to the corner cases of their locale.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) feature of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Alvaro Herrera

Excerpts from Florian Pflug's message of mié jun 20 06:35:29 -0400 2012:
 On Jun19, 2012, at 17:36 , Robert Haas wrote:
  On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
  klep...@svana.org wrote:
  On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
  The fly in the ointment with any of these ideas is that the configure
  list is not a list of exact cipher names, as per Magnus' comment that
  the current default includes tests like !aNULL.  I am not sure that
  we know how to evaluate such conditions if we are applying an
  after-the-fact check on the selected cipher.  Does OpenSSL expose any
  API for evaluating whether a selected cipher meets such a test?
  
  I'm not sure whether there's an API for it, but you can certainly check
  manually with openssl ciphers -v, for example:
  
  $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
  NULL-SHASSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=SHA1
  NULL-MD5SSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=MD5
  
  ...etc...
  
  So unless the openssl includes the code twice there must be a way to
  extract the list from the library.
  
  There doubtless is, but I'd being willing to wager that you won't be
  able to figure out the exact method without reading the source code
  for 'opennssl ciphers' to see how it was done there, and most likely
  you'll find that at least one of the functions they use has no man
  page.  Documentation isn't their strong point.
 
 Yes, unfortunately.

I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
the list of ciphers starting from the stringified configuration
parameter and iterate on them.  The problem is figuring out whether any
given cipher meets some criteria; all the stuff that the command prints
after the cipher name comes from a get cipher description API call and
it doesn't look like there's any simple way of getting the individual
bits in some better form (assuming we don't want to parse the
description string).

Now if the cipher name is enough for whatever it is that we want, then
that looks easy.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Tom Lane
Amit Kapila amit.kap...@huawei.com writes:
 I'm almost inclined to suggest that we not get next-LSN from WAL, but
 by scanning all the pages in the main data store and computing the max
 observed LSN.  This is clearly not very attractive from a performance
 standpoint, but it would avoid the obvious failure mode where you lost
 some recent WAL segments along with pg_control.

 According to my analysis, this will have some problem. 

I think you're missing the point.  There is no possible way to guarantee
database consistency after applying pg_resetxlog, unless the database
had been cleanly shut down beforehand.  The reset will lose the xlog
information that was needed to restore consistency.  So arguing from
examples that demonstrate this is rather pointless.  Rather, the value
of pg_resetxlog is to be able to start the database at all so that info
can be extracted from it.  What we are looking for is not perfection,
because that's impossible, but just to not make a bad situation worse.
The reason I'm concerned about selecting a next-LSN that's certainly
beyond every LSN in the database is that not doing so could result in
introducing further corruption, which would be entirely avoidable with
more care in choosing the next-LSN.

 Pg_resetxlog -
 It will generate the next-LSN point as 109 which when used for recovery will 
 generate inconsistent database.
 However if we would have relied on WAL, it would have got next-LSN as 107.

Umm ... the entire point of pg_resetxlog is to throw away WAL.  Not to
rely on it.

It's conceivable that there would be some use in a tool that searches
the available WAL files for the latest checkpoint record and recreates a
pg_control file pointing at that checkpoint, without zapping the WAL
files.  This would be much different in purpose and usage from
pg_resetxlog, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote:
 That might actually be a good idea. We can't really change the way we
 named the betas, but it's not too late to consider naming the actual
 release as 9.2.0...

 The final release was always going to be called 9.2.0, but naming the
 beta 9.2.0betaX is wrong.  There was a previous discussion about that
 particular point.

Yes.  There is no reason to change the naming scheme we have been using
for years now (at least since version_stamp.pl was invented for 7.4).
The only problem is that somebody got the name of the directory wrong on
the FTP server.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg default's verbosity?

2012-06-20 Thread Peter Eisentraut
On tis, 2012-06-19 at 02:15 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  There might be something to the idea of demoting a few of the things
  we've traditionally had as NOTICEs, though.  IME, the following two
  messages account for a huge percentage of the chatter:
 
  NOTICE:  CREATE TABLE will create implicit sequence foo_a_seq for
  serial column foo.a
  NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
  foo_pkey for table foo
 
 Personally, I'd have no problem with flat-out dropping (not demoting)
 both of those two specific messages.  I seem to recall that Bruce has
 lobbied for them heavily in the past, though.

I don't like these messages any more than the next guy, but why drop
only those, and not any of the other NOTICE-level messages?  The meaning
of NOTICE is pretty much, if this is the first time you're using
PostgreSQL, let me tell you a little bit about how we're doing things
here.  If you've run your SQL script more than 3 times, you won't need
them anymore.  So set your client_min_messages to WARNING then.  That
should be pretty much standard for running SQL scripts, in addition to
all the other stuff listed here:
http://petereisentraut.blogspot.fi/2010/03/running-sql-scripts-with-psql.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Tom Lane
Florian Pflug f...@phlo.org writes:
 I wonder though if shouldn't restrict the allowed ciphers list to being
 a simple list of supported ciphers. If our goal is to support multiple
 SSL libraries transparently then surely having openssl-specific syntax
 in the config file isn't exactly great anyway...

No, we don't want to go there, because then we'd have to worry about
keeping the default list in sync with what's supported by the particular
version of the particular library we chance to be using.  That's about
as far from transparent as you can get.  A notation like DEFAULT
is really quite ideal for our purposes in that respect.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas  wrote:

 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.

 Right.
 
 Wrong, as explained.
 
The point is not wrong; you are simply not responding to what is
being said.
 
You have not explained why an origin ID is required when there is no
replication, or if there is master/slave logical replication, or
there are multiple masters with non-overlapping primary keys
replicating to a single table in a consolidated database, or each
master replicates to all other masters directly, or any of various
other scenarios raised on this thread.  You've only explained why
it's necessary for certain configurations of multi-master
replication where all rows in a table can be updated on any of the
masters.  I understand that this is the configuration you find most
interesting, at least for initial implementation.  That does not
mean that the other situations don't exist as use cases or should be
not be considered in the overall design.
 
I don't think there is anyone here who would not love to see this
effort succeed, all the way to multi-master replication in the
configuration you are emphasizing.  What is happening is that people
are expressing concerns about parts of the design which they feel
are problematic, and brainstorming about possible alternatives.  As
I'm sure you know, fixing a design problem at this stage in
development is a lot less expensive than letting the problem slide
and trying to deal with it later.
 
 It isn't true that this is needed only for some configurations of
 multi-master, per discussion.
 
I didn't get that out of the discussion; I saw a lot of cases
mentioned as not needing it to which you simply did not respond.
 
 This is not transaction metadata, it is WAL record metadata
 required for multi-master replication, see later point.
 
 We need to add information to every WAL record that is used as the
 source for generating LCRs.
 
If the origin ID of a transaction doesn't count as transaction
metadata (i.e., data about the transaction), what does?  It may be a
metadata element about which you have special concerns, but it is
transaction metadata.  You don't plan on supporting individual WAL
records within a transaction containing different values for origin
ID, do you?  If not, why is it something to store in every WAL
record rather than once per transaction?  That's not intended to be
a rhetorical question.  I think it's because you're still thinking
of the WAL stream as *the medium* for logical replication data
rather than *the source* of logical replication data.
 
As long as the WAL stream is the medium, options are very
constrained.  You can code a very fast engine to handle a single
type of configuration that way, and perhaps that should be a
supported feature, but it's not a configuration I've needed yet. 
(Well, on reflection, if it had been available and easy to use, I
can think of *one* time I *might* have used it for a pair of nodes.)
It seems to me that you are so focused on this one use case that you
are not considering how design choices which facilitate fast
development of that use case paint us into a corner in terms of
expanding to other use cases.
 
 I think removing origin ID from this patch and submitting a
 separate patch for a generalized transaction metadata system is
 the sensible way to go.
 
 We already have a very flexible WAL system for recording data of
 interest to various resource managers. If you wish to annotate a
 transaction, you can either generate a new kind of WAL record or
 you can enhance a commit record.
 
Right.  Like many of us are suggesting should be done for origin ID.
 
 XLOG_NOOP records can already be generated by your application if
 you wish to inject additional metadata to the WAL stream. So no
 changes are required for you to implement the generalised
 transaction metadata scheme you say you require.
 
I'm glad it's that easy.  Are there SQL functions to for that yet?
 
 Not sure how or why that relates to requirements for multi-master.
 
That depends on whether you want to leave the door open to other
logical replication than the one use case on which you are currently
focused.  I even consider some of those other cases multi-master,
especially when multiple databases are replicating to a single table
on another server.  I'm not clear on your definition -- it seems to
be rather more narrow.  Maybe we need to define some terms somewhere
to facilitate discussion.  Is there a Wiki page where that would
make sense?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund and...@2ndquadrant.com wrote:
 Were not the only ones here that are performing scope creep though... I think
 about all people who have posted in the whole thread except maybe Tom and
 Marko are guilty of doing so.

 I still think its rather sensible to focus on exactly duplicated schemas in a
 very first version just because that leaves out some of the complexity while
 paving the road for other nice things.

Well, I guess what I want to know is: what does focusing on exactly
duplicated schemas mean?  If it means we'll disable DDL for tables
when we turn on replication, that's basically the Slony approach: when
you want to make a DDL change, you have to quiesce replication, do it,
and then resume replication.  I would possibly be OK with that
approach.  If it means that we'll hope that the schemas are duplicated
and start spewing garbage data when they're not, then I'm not
definitely not OK with that approach.  If it means using event
triggers to keep the catalogs synchronized, then I don't think I don't
think that's adequately robust.  The user could add more event
triggers that run before or after the ones the replication system
adds, and then you are back to garbage decoding (or crashes).  They
could also modify the catalogs directly, although it's possible we
don't care quite as much about that case (but on the other hand people
do sometimes need to do it to solve real problems).  Although I am
100% OK with pairing back the initial feature set - indeed, I strongly
recommend it - I think that robustness is not a feature which can be
left out in v1 and added in later.  All the robustness has to be
designed in at the start, or we will never have it.

On the whole, I think we're spending far too much time talking about
code and far too little time talking about what the overall design
should look like.  We are having a discussion about whether or not MMR
should be supported by sticking a 16-bit node ID into every WAL record
without having first decided whether we should support MMR, whether
that requires node IDs, whether they should be integers, whether those
integers should be 16 bits in size, whether they should be present in
WAL, and whether or not the record header is the right place to put
them.  There's a right order in which to resolve those questions, and
this isn't it.  More generally, I think there is a ton of complexity
that we're probably overlooking here in focusing in on specific coding
details.  I think the most interesting comment made to date is Steve
Singer's observation that very little of Slony is concerned with
changeset extraction or apply.  Now, on the flip side, all of these
patches seem to be concerned with changeset extraction and apply.
That suggests that we're missing some pretty significant pieces
somewhere in this design.  I think those pieces are things like error
recovery, fault tolerance, user interface design, and control logic.
Slony has spent years trying to get those things right.  Whether or
not they actually have gotten them right is of course an arguable
point, but we're unlikely to do better by ignoring all of those issues
and implementing whatever is most technically expedient.

 You've got four people objecting to this patch now, all of whom happen
 to be committers.  Whether or not MMR goes into core, who knows, but
 it doesn't seem that this patch is going to fly.
 I find that a bit too early to say. Sure it won't fly exactly as proposed, but
 hell, who cares? What I want to get in is a solution to the specific problem
 the patch targets. At least you have, not sure about others, accepted that the
 problem needs a solution.
 We do not agree yet how that solution looks should like but thats not exactly
 surprising as we started discussing the problem only a good day ago.

Oh, no argument with any of that.  I strongly object to the idea of
shoving this patch through as-is, but I don't object to solving the
problem in some other, more appropriate way.  I think that won't look
much like this patch, though; it will be some new patch.

 If people agree that your proposed way of just one flag bit is the way to go
 we will have to live with that. But thats different from saying the whole
 thing is dead.

I think you've convinced me that a single flag-bit is not enough, but
I don't think you've convinced anyone that it belongs in the record
header.

 As I would rather see this project
 succeed, I recommend that you don't do that.  Both you and Andres seem
 to believe that MMR is a reasonable first target to shoot at, but I
 don't think anyone else - including the Slony developers who have
 commented on this issue - endorses that position.
 I don't think we get full MMR into 9.3. What I am proposing is that we build
 in the few pieces that are required to implement MMR *ontop* of whats
 hopefully in 9.3.
 And I think thats a realistic goal.

I can't quite follow that sentence, but my general sense is that,
while you're 

Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun20, 2012, at 17:34 , Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
 I wonder though if shouldn't restrict the allowed ciphers list to being
 a simple list of supported ciphers. If our goal is to support multiple
 SSL libraries transparently then surely having openssl-specific syntax
 in the config file isn't exactly great anyway...
 
 No, we don't want to go there, because then we'd have to worry about
 keeping the default list in sync with what's supported by the particular
 version of the particular library we chance to be using.  That's about
 as far from transparent as you can get.  A notation like DEFAULT
 is really quite ideal for our purposes in that respect.

No argument with that, but does that mean we have to allow the full
syntax supported by OpenSSL (i.e., those +,-,! prefixes)? Maybe we could
map an empty list to DEFAULT and otherwise interpret it as a list of 
ciphers?

It'd make the whole NULL-cipher business easy, because once we know that
the cipher specified doesn't contain !NULL (which removes NULL *permanently*),
we can simply append NULL to allow all these ciphers plus NULL.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
David Fetter da...@fetter.org writes:
 Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
 the header line in should be parsed to establish which columns are
 where and rearranging the output if needed.

This is not fix a POLA violation.  This is a non-backwards-compatible
new feature, which would have to have a switch to turn it on.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 11:02 AM, David Fetter wrote:

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) feature of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?



First you are talking about COPY IN, not COPY OUT, surely.

This is not a bug, it is documented in exactly the place that all other 
COPY options are documented. The file_fdw page refers the reader to the 
COPY docs for details. Unless you want us to duplicate the entire COPY 
docs in the file_fdw page this seems entirely reasonable.


The current behaviour was discussed at some length back when we 
implemented the HEADER feature, IIRC, and is quite intentional. I don't 
think we should alter the current behaviour, as plenty of people rely on 
it, some to my certain knowledge. I do see a reasonable case for adding 
a new behaviour which takes notice of the header line, although it's 
likely to have plenty of wrinkles.


Reordering columns like you suggest might well have a significant impact 
on COPY performance, BTW. Also note that I created the file_text_array 
FDW precisely for people who want to be able to cherry pick and reorder 
columns. See https://github.com/adunstan/file_text_array_fdw



cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
 the list of ciphers starting from the stringified configuration
 parameter and iterate on them.

Do you mean that it will produce an expansion of the set of ciphers
meeting criteria like !aNULL?  If so, I think we are set; we can
easily check to see if the active cipher is in that list, no?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
  This is not transaction metadata, it is WAL record metadata
  required for multi-master replication, see later point.

  We need to add information to every WAL record that is used as the
  source for generating LCRs.
 If the origin ID of a transaction doesn't count as transaction
 metadata (i.e., data about the transaction), what does?  It may be a
 metadata element about which you have special concerns, but it is
 transaction metadata.  You don't plan on supporting individual WAL
 records within a transaction containing different values for origin
 ID, do you?  If not, why is it something to store in every WAL
 record rather than once per transaction?  That's not intended to be
 a rhetorical question.
Its definitely possible to store it per transaction (see the discussion around 
http://archives.postgresql.org/message-
id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via 
the originating node a considerably more complex thing. With our proposal you 
can do it without any complexity involved, on a low level. Storing it per 
transaction means you can only stream out the data to other nodes *after* 
fully reassembling the transaction. Thats a pitty, especially if we go for a 
design where the decoding happens in a proxy instance.

Other metadata will not be needed on such a low level.

I also have to admit that I am very hesitant to start developing some generic 
transaction metadata framework atm. That seems to be a good way to spend a 
good part of time in discussion and disagreeing. Imo thats something for 
later.

 I think it's because you're still thinking
 of the WAL stream as *the medium* for logical replication data
 rather than *the source* of logical replication data.
I don't think thats true. See the above referenced subthread for reasons why I 
think the origin id is important.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

 Oh, give me a break.  You're willfully missing my point.  And to quote
 Fred Brooks, nine women can't make a baby in one month.

 No, I'm not. The question is not how quickly can N people achieve a
 single thing, but how long will it take a few skilled people working
 on carefully selected tasks that have few dependencies between them to
 achieve something.

The bottleneck is getting the design right, not writing the code.
Selecting tasks for people to work on without an agreement on the
design will not advance the process, unless we just accept whatever
code you choose to right based on whatever design you happen to pick.

 How exactly did you arrive at your conclusion? Why is yours right and
 mine wrong?

I estimated the amount of work that would be required to do this right
and compared it to other large projects that have been successfully
done in the past.  I think you are looking at something on the order
of magnitude of the Windows port, which took about four releases to
become stable, or the SE-Linux project, which still isn't
feature-complete.  Even if it's only a HS-sized project, that took two
releases, as did streaming replication.  SSI got committed within one
release cycle, but there were several years of design and advocacy
work before any code was written, so that, too, was really a
multi-year project.

I'll confine my comments on the second part of the question to the
observation that it is a bit early to know who is right and who is
wrong, but the question could just as easily be turned on its head.

 No, I have four people who had initial objections and who have not
 commented on the fact that the points made are regrettably incorrect.

I think Kevin addressed this point better than I can.  Asserting
something doesn't make it true, and you haven't offered any rational
argument against the points that have been made, probably because
there isn't one.  We *cannot* justify steeling 100% of the available
bit space for a feature that many people won't use and may not be
enough to address the real requirement anyway.

 Since at least 3 of the people making such comments did not attend the
 full briefing meeting in Ottawa, I am not particularly surprised.
 However, I do expect people that didn't come to the meeting to
 recognise that they are likely to be missing information and to listen
 closely, as I listen to them.

Participation in the community development process is not contingent
on having flown to Ottawa in May, or on having decided to spend that
evening at your briefing meeting.  Attributing to ignorance what is
adequately explained by honest disagreement is impolite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 15:55, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Geoghegan pe...@2ndquadrant.com writes:
 I think that this change may have made the difference between the
 Hungarians getting away with it and not getting away with it. Might it
 have been that for text, they were using some operator that wasn't '='
 (perhaps one which has no fastpath, and thus correctly made a
 representation about equivalency) rather than texteq prior to this
 commit?

 Uh, no.  There aren't any magic variants of equality now, and there were
 not back then either.  I'm inclined to think that the Hungarian
 problem did exist long before we fixed it.

I was suggesting that an operator other than equality may have been
used for some reason. It probably isn't a significant issue though.

 Um ... are you proposing to replace text btree index entries with
 strxfrm values?  Because if you aren't, I don't see that this is likely
 to win anything.  And if you are, it loses on the grounds of (a) index
 bloat and (b) loss of ability to do index-only scans.

No, I'm suggesting it would probably be at least a bit of a win here
to cache the constant, and only have to do a strxfrm() + strcmp() per
comparison. Not enough to justify all the added complexity of course,
but I wouldn't seek to justify this on the basis of improvements to
the speed of btree traversal.  It would obviously be much more
valuable for tuple sorting.

 Personally I think this is not a direction we want to go in.  I think
 that it's going to end up a significant performance loss in many cases,
 break backwards compatibility in numerous ways, and provide a useful
 behavioral improvement to only a small minority of users.

I may have over-emphasized the improvement in correctness that this
would bring, which I personally consider to be very much a secondary
benefit. The fact is that this is likely to be a fairly significant
performance win, because strxfrm() is quite simply the way you're
supposed to do collation-aware sorting, and is documented as such. For
that reason, C standard library implementations should not be expected
to emphasize its performance - they assume that you're using strxfrm()
+ their highly optimised strcmp() (as I've said, the glibc
implementation is written in ASM, and makes use of hand-written SSSE3
instructions on x86_64 for example).

The only performance downside that I can think of is the added check
for equivalence for each tuple within _bt_checkkeys() - perhaps you
were thinking of something else that hasn't occurred to me though.
Were you?

The added overhead is only going to be paid only once per index scan,
and not once per tuple returned by an index scan. Since equality
implies equivalence for us, there is no need to check equivalence
until something is unequal, and when that happens, and equivalence
doesn't hold, we're done. Meanwhile, that entire index scan just got
measurably faster from caching the constant's strxfrm() blob.

Now, maybe it is a bit funky that this equivalence test has to happen
even though the vast majority of locales don't care. If the only
alternative is to bloat up the strxfrm() blobs with the original
string, I'd judge that the funkiness is well worth it.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Fujii Masao
On Wed, Jun 20, 2012 at 12:40 PM, Amit Kapila amit.kap...@huawei.com wrote:
 I believe if WAL files are proper as mentioned in Alvaro's mail, the
 purposed logic should generate
 correct values.
 Do you see any problem in logic purposed in my original mail.
 Can I resume my work on this feature?

 Maybe I'm missing your point, but... why don't you just use PITR to
 recover from the corruption of pg_control?

 AFAIK PITR can be used in a scenario where there is a base back-up and we
 have archived
 the WAL files after that, now it can use WAL files to apply on base-backup.

Yes. If you want to recover the database from the media crash like the
corruption of pg_control file, you basically should take a base backup
and set up continuous archiving.

 In this scenario we don't know a point from where to start the next replay.
 So I believe it will be difficult to use PITR in this scenario.

You can find out the point from the complete pg_control file which was
restored from the backup.

If pg_control is corrupted, we can easily imagine that other database files
would also be corrupted. I wonder how many cases where only pg_control
file gets corrupted are. In that case, pg_resetxlog is unhelpful at all.
You need to use PITR, intead.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote:
 David Fetter da...@fetter.org writes:
  Rather than being totally ignored in the COPY OUT (CSV HEADER)
  case, the header line in should be parsed to establish which
  columns are where and rearranging the output if needed.
 
 This is not fix a POLA violation.  This is a
 non-backwards-compatible new feature, which would have to have a
 switch to turn it on.

OK, new proposal:

COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
David Fetter da...@fetter.org writes:
 OK, new proposal:

 COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
 an option which requires that HEADER be enabled and mandates that the
 column names in the header match the columns coming in.

 Has someone got a better name for this option than
 KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-06-20 Thread John Lumby

---
Problem I'm trying to solve:
    For partitioned tables,  make it possible to use RETURNING clause on INSERT 
INTO
   together with DO INSTEAD rule

[  Note  -  wherever I say INSERT I also mean UPDATE and DELETE ]

---
Current behaviour :

    An INSERT which has a RETURNING clause and which is to be rewritten based on
    a rule will be accepted if the rule is an unconditional DO INSTEAD.
    In general I believe unconditional means no WHERE clause, but in 
practice
    if the rule is of the form 
   CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    this is treated as conditional and the query is rejected.    

    Testcase:
    A table T is partitioned and has two or more columns,  one of which 
    is an id column declared as
 id bigint DEFAULT nextval('history_id_seq'::regclass) NOT NULL
    and the application issues
  INSERT into history  (column-list which excludes id)  values () 
RETURNING id

    I can get the re-direction of the INSERT *without* RETURNING to work using
    either trigger or rule, in which the trigger/rule invokes a procedure, but
    whichever way I do it, I could not get this RETURNING clause to work.

    For a trigger,
  the INSERT ... RETURNING was accepted but returned no rows, (as I would
  expect), and for the RULE, the INSERT ... RETURNING was rejected with :
 
  ERROR:  cannot perform INSERT RETURNING on relation history
  HINT:  You need an unconditional ON INSERT DO INSTEAD rule with a 
RETURNING clause.

  but this hint was not much help,  since :

   For a rule,
 CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    ERROR:  syntax error at or near returning
    LINE 1: ...DO INSTEAD SELECT history_insert_partitioned(NEW) returning ...

    Here the function history_insert_partitioned is something like
    CREATE FUNCTION history_insert_partitioned( NEW public.history) RETURNS 
BIGINT AS $$
    DECLARE
   ...
    BEGIN
   ...
  acccess NEW fields e.g. timestamp
   construct partitioned table name
   EXECUTE 'INSERT INTO ' partitioned table
   ...
    RETURN history_id;
    END;
    $$
    LANGUAGE plpgsql

---
Some references to discussion of this requirement :
  .  http://wiki.postgresql.org/wiki/Todo
 item Make it possible to use RETURNING together with conditional DO 
INSTEAD rules,
   such as for partitioning setups
  .  http://archives.postgresql.org/pgsql-general/2012-06/msg00377.php
  .  http://archives.postgresql.org/pgsql-general/2010-12/msg00542.php
  .  
http://acodapella.blogspot.it/2011/06/hibernate-postgresql-table-partitioning.html

---
Proposal:
  .  I propose to extend the rule system to recognize cases where the INSERT 
query specifies
 RETURNING and the rule promises to return a row,  and to then permit this 
query to run
 and return the expected row.   In effect,  to widen the definition of 
unconditional
 to handle cases such as my testcase.
  .  One comment is that all the infrastructure for returning one row from the 
re-written query
 is already present in the code,  and the non-trivial question is how to 
ensure the
 new design is safe in preventing any rewrite that actually would not 
return a row.
  .  In this patch,   I have chosen to make use of the LIMIT clause  -
 I add a side-effect implication to a LIMIT clause when it occurs in a 
rewrite of an INSERT
 to mean this rule will return a row.
 So,  with my patch, same testcase,  same function 
history_insert_partitioned and new rule
    CREATE RULE insert_part_history as ON INSERT to history \
   DO INSTEAD SELECT history_insert_partitioned(NEW) LIMIT 1
 the INSERT is accepted and returns the id.
 This use of LIMIT clause is probably contentious but I wished to avoid 
introducing new
 SQL syntax,  and the LIMIT clause does have a connotation of returning 
rows.

---
I attach patch based on clone of postgresql.git as of yesterday (120619-145751 
EST)
I have tested the patch with INSERT and UPDATE  (not tested with DELETE but 
should work).
The patch is not expected to be final but just to show how I did it.

John
  --- src/backend/optimizer/plan/planner.c.orig	2012-06-19 14:59:21.264574275 -0400
+++ src/backend/optimizer/plan/planner.c	2012-06-19 15:10:54.776590758 -0400
@@ -226,7 +226,8 @@ standard_planner(Query *parse, int curso
 
 	result-commandType = parse-commandType;
 	result-queryId = parse-queryId;
-	result-hasReturning = (parse-returningList != 

Re: [HACKERS] libpq compression

2012-06-20 Thread Alvaro Herrera
Excerpts from Tom Lane's message of mié jun 20 11:49:51 -0400 2012:
 
 Alvaro Herrera alvhe...@commandprompt.com writes:
  I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
  the list of ciphers starting from the stringified configuration
  parameter and iterate on them.
 
 Do you mean that it will produce an expansion of the set of ciphers
 meeting criteria like !aNULL?

Attached is a simple program that does that.  You pass 'ALL:!aNULL' as
its first arg and it produces such a list.

 If so, I think we are set; we can
 easily check to see if the active cipher is in that list, no?

Great.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
#include stdio.h
#include stdlib.h
#include string.h

#include openssl/err.h
#include openssl/ssl.h

int main(int argc, char *argv[])
{
const SSL_METHOD *method = TLSv1_client_method();
SSL_CTX*ctx;
SSL*ssl = NULL;
char   *ciphers;
int i;
   
if (argc  2)
{
fprintf(stderr, ciphers not specified\n);
exit(1);
}

ciphers = argv[1];

SSL_library_init();

ctx = SSL_CTX_new(method);
if (!ctx)
{
fprintf(stderr, something went wrong\n);
exit(1);
}

if (!SSL_CTX_set_cipher_list(ctx, ciphers))
{
fprintf(stderr, unable to set cipher list\n);
exit(1);
}

ssl = SSL_new(ctx);
if (!ssl)
{
fprintf(stderr, unable to create the SSL object\n);
exit(1);
}

for (i = 0;; i++)
{
const char   *cipher;

cipher = SSL_get_cipher_list(ssl, i);
if (cipher == NULL)
{
fprintf(stderr, end of cipher list?\n);
break;
}
printf(cipher: %s\n, cipher);
}

SSL_CTX_free(ctx);
SSL_free(ssl);

return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 No, I'm suggesting it would probably be at least a bit of a win here
 to cache the constant, and only have to do a strxfrm() + strcmp() per
 comparison.

Um, have you got any hard evidence to support that notion?  The
traditional advice is that strcoll is faster than using strxfrm unless
the same strings are to be compared repeatedly.  I'm not convinced that
saving the strxfrm on just one side will swing the balance.

 The fact is that this is likely to be a fairly significant
 performance win, because strxfrm() is quite simply the way you're
 supposed to do collation-aware sorting, and is documented as such. For
 that reason, C standard library implementations should not be expected
 to emphasize its performance - they assume that you're using strxfrm()
 + their highly optimised strcmp()

Have you got any evidence in support of this claim, or is it just
wishful thinking about what's likely to be inside libc?  I'd also note
that any comparisons you may have seen about this are certainly not
accounting for the effects of data bloat from strxfrm (ie, possible
spill to disk, more merge passes, etc).

In any case, if you have to redefine the meaning of equality in order
to justify a performance patch, I'm prepared to walk away at the start.
The range of likely performance costs/benefits across different locales
and different implementations is so wide that if you can't show it to be
a win even with the strcmp tiebreaker, it's not likely to be a reliable
win without that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 12:35 PM, Florian Pflug f...@phlo.org wrote:
 On Jun19, 2012, at 17:36 , Robert Haas wrote:
 On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
 klep...@svana.org wrote:
 On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
 The fly in the ointment with any of these ideas is that the configure
 list is not a list of exact cipher names, as per Magnus' comment that
 the current default includes tests like !aNULL.  I am not sure that
 we know how to evaluate such conditions if we are applying an
 after-the-fact check on the selected cipher.  Does OpenSSL expose any
 API for evaluating whether a selected cipher meets such a test?

 I'm not sure whether there's an API for it, but you can certainly check
 manually with openssl ciphers -v, for example:

 $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
 NULL-SHA                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=SHA1
 NULL-MD5                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=MD5

 ...etc...

 So unless the openssl includes the code twice there must be a way to
 extract the list from the library.

 There doubtless is, but I'd being willing to wager that you won't be
 able to figure out the exact method without reading the source code
 for 'opennssl ciphers' to see how it was done there, and most likely
 you'll find that at least one of the functions they use has no man
 page.  Documentation isn't their strong point.

 Yes, unfortunately.

 I wonder though if shouldn't restrict the allowed ciphers list to being
 a simple list of supported ciphers. If our goal is to support multiple
 SSL libraries transparently then surely having openssl-specific syntax
 in the config file isn't exactly great anyway...

That is a very good point. Before we design *another* feature that
relies on it, we should verify if the syntax is compatible in the
other libraries that would be interesting (gnutls and NSS primarily),
and if it's not that at least the *functionality* exists ina
compatible way. So we don't put ourselves in a position where we can't
proceed.

And yes, that's vapourware so far. But I know at least Claes (added to
CC) has said he wants to work on it during this summer, and I've
promised to help him out and review as well, if/when he gets that far.
But even without that, we should try to keep the door to these other
library implementations as open as possible.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:
 David Fetter da...@fetter.org writes:
  OK, new proposal:
 
  COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
  an option which requires that HEADER be enabled and mandates that the
  column names in the header match the columns coming in.
 
  Has someone got a better name for this option than
  KEEP_HEADER_COLUMN_NAMES?
 
 Well, if it's just checking that the list matches, then maybe
 CHECK_HEADER would do.

OK

 In your original proposal, I was rather wondering what should happen
 if the incoming file didn't have the same set of columns called out
 in the COPY command's column list.  (That is, while *rearranging*
 the columns might be thought non-astonishing, I'm less convinced
 about a copy operation that inserts or defaults columns differently
 from what the command said should happen.)  If we're just checking
 for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t).  Just to emphasize, the column
ordering is different in the places where they match.

Would

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER)

now just work (possibly with some performance penalty) and 

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER)

only work, i.e. import gobbledygook, in the case where every t entry
in foo.csv happened to be castable to NUMERIC?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Alvaro Herrera

Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012:
 On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

  In your original proposal, I was rather wondering what should happen
  if the incoming file didn't have the same set of columns called out
  in the COPY command's column list.  (That is, while *rearranging*
  the columns might be thought non-astonishing, I'm less convinced
  about a copy operation that inserts or defaults columns differently
  from what the command said should happen.)  If we're just checking
  for a match, that question goes away.
 
 Let's imagine we have a table foo(id serial, t text, n numeric) and a
 CSV file foo.csv with headers (n, t).  Just to emphasize, the column
 ordering is different in the places where they match.

For the record, IIRC we had one person trying to do this in the spanish
list not long ago.

Another related case: you have a file with headers and columns (n, t, x, y, z)
but your table only has n and t.  How would you tell COPY to discard the
junk columns?  Currently it just complains that they are there.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
Hi,

On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Were not the only ones here that are performing scope creep though... I
  think about all people who have posted in the whole thread except maybe
  Tom and Marko are guilty of doing so.
  
  I still think its rather sensible to focus on exactly duplicated schemas
  in a very first version just because that leaves out some of the
  complexity while paving the road for other nice things.
 
 Well, I guess what I want to know is: what does focusing on exactly
 duplicated schemas mean?  If it means we'll disable DDL for tables
 when we turn on replication, that's basically the Slony approach: when
 you want to make a DDL change, you have to quiesce replication, do it,
 and then resume replication.  I would possibly be OK with that
 approach.  If it means that we'll hope that the schemas are duplicated
 and start spewing garbage data when they're not, then I'm not
 definitely not OK with that approach.  If it means using event
 triggers to keep the catalogs synchronized, then I don't think I don't
 think that's adequately robust.  The user could add more event
 triggers that run before or after the ones the replication system
 adds, and then you are back to garbage decoding (or crashes). 
I would prefer the event trigger way because that seems to be the most 
extensible/reusable. It would allow a fully replicated databases and catalog 
only instances.
I think we need to design event triggers in a way you cannot simply circumvent 
them. We already have the case that if users try to screw around system 
triggers we give back wrong answers with the planner relying on foreign keys 
btw.
If the problem is having user trigger after system triggers: Lets make that 
impossible. Forbidding DDL on the other instances once we have that isn't that 
hard.

Perhaps all that will get simpler if we can make reading the catalog via 
custom built snapshots work as you proposed otherwhere in this thread. That 
would make checking errors way much easier even if you just want to apply to  
a database with exactly the same schema. Thats the next thing I plan to work 
on.

 They could also modify the catalogs directly, although it's possible we
 don't care quite as much about that case (but on the other hand people
 do sometimes need to do it to solve real problems).
With that you already can crash the database perfectly fine today. I think 
trying to care for that is a waste of time.

 Although I am
 100% OK with pairing back the initial feature set - indeed, I strongly
 recommend it - I think that robustness is not a feature which can be
 left out in v1 and added in later.  All the robustness has to be
 designed in at the start, or we will never have it.
I definitely don't intend to cut down on robustness.

 On the whole, I think we're spending far too much time talking about
 code and far too little time talking about what the overall design
 should look like.
Agreed.

 We are having a discussion about whether or not MMR
 should be supported by sticking a 16-bit node ID into every WAL record
 without having first decided whether we should support MMR, whether
 that requires node IDs, whether they should be integers, whether those
 integers should be 16 bits in size, whether they should be present in
 WAL, and whether or not the record header is the right place to put
 them.  There's a right order in which to resolve those questions, and
 this isn't it.  More generally, I think there is a ton of complexity
 that we're probably overlooking here in focusing in on specific coding
 details.  I think the most interesting comment made to date is Steve
 Singer's observation that very little of Slony is concerned with
 changeset extraction or apply.  Now, on the flip side, all of these
 patches seem to be concerned with changeset extraction and apply.
 That suggests that we're missing some pretty significant pieces
 somewhere in this design.  I think those pieces are things like error
 recovery, fault tolerance, user interface design, and control logic.
 Slony has spent years trying to get those things right.  Whether or
 not they actually have gotten them right is of course an arguable
 point, but we're unlikely to do better by ignoring all of those issues
 and implementing whatever is most technically expedient.
I agree that the focus isn't 100% optimal and that there are *loads* of issues 
we haven't event started to look at. But you need a point to start and 
extraction  apply seems to be a good one because you can actually test it 
without the other issues solved which is not really the case the other way 
round.
Also its possible to plug in the newly built changeset extraction into 
existing solutions to make them more efficient while retaining most of their 
respective framework.

So I disagree that thats the wrong part to start with.

  You've got four people objecting to this patch now, 

Re: [HACKERS] foreign key locks

2012-06-20 Thread Tom Lane
Jaime Casanova ja...@2ndquadrant.com writes:
 On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera
 alvhe...@alvh.no-ip.org wrote:
 This is v12 of the foreign key locks patch.

 Just noticed that this patch needs a rebase because of the refactoring
 Tom did in ri_triggers.c

Hold on a bit before you work on that code --- I've got one more bit of
hacking I want to try before walking away from it.  I did some oprofile
work on Dean's example from
caezatcwm8m00ra814o4dc2cd_aj44gqlb0tddxmha312qg7...@mail.gmail.com
and noticed that it looks like ri_FetchConstraintInfo is eating a
noticeable fraction of the runtime, which is happening because it is
called to deconstruct the relevant pg_constraint row for each tuple
we consider firing the trigger for (and then again, if we do fire the
trigger).  I'm thinking it'd be worth maintaining a backend-local cache
for the deconstructed data, and am going to go try that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 12:18 PM, Tom Lane wrote:

David Fetterda...@fetter.org  writes:

OK, new proposal:
COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.
Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.





In the past people have asked me to have copy use the CSV header line in 
place of supplying a column list in the COPY command. I can certainly 
see some utility in that, and I think it might achieve what David wants. 
Using that scenario it would be an error to supply an explicit column 
list and also use the header line. But then I don't think CHECK_HEADER 
would be the right name for the option. In any case, specifying a name 
before we settle on an exact behaviour seems wrong :-)


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
Another related case: you have a file with headers and columns (n, t, 
x, y, z) but your table only has n and t. How would you tell COPY to 
discard the junk columns? Currently it just complains that they are 
there. 



That's one of the main use cases for file_text_array_fdw.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 23:34, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas  wrote:

 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.

 Right.

 Wrong, as explained.

 The point is not wrong; you are simply not responding to what is
 being said.

Heikki said that the origin ID was not required for all MMR
configs/scenarios. IMHO that is wrong, with explanation given.

By agreeing with him, I assumed you were sharing that assertion,
rather than saying something else.


 You have not explained why an origin ID is required when there is no
 replication, or if there is master/slave logical replication, or
...

You're right; I never claimed it was needed. Origin Id is only needed
for multi-master replication and that is the only context I've
discussed it.


 This is not transaction metadata, it is WAL record metadata
 required for multi-master replication, see later point.

 We need to add information to every WAL record that is used as the
 source for generating LCRs.

 If the origin ID of a transaction doesn't count as transaction
 metadata (i.e., data about the transaction), what does?  It may be a
 metadata element about which you have special concerns, but it is
 transaction metadata.  You don't plan on supporting individual WAL
 records within a transaction containing different values for origin
 ID, do you?  If not, why is it something to store in every WAL
 record rather than once per transaction?  That's not intended to be
 a rhetorical question.  I think it's because you're still thinking
 of the WAL stream as *the medium* for logical replication data
 rather than *the source* of logical replication data.

 As long as the WAL stream is the medium, options are very
 constrained.  You can code a very fast engine to handle a single
 type of configuration that way, and perhaps that should be a
 supported feature, but it's not a configuration I've needed yet.
 (Well, on reflection, if it had been available and easy to use, I
 can think of *one* time I *might* have used it for a pair of nodes.)
 It seems to me that you are so focused on this one use case that you
 are not considering how design choices which facilitate fast
 development of that use case paint us into a corner in terms of
 expanding to other use cases.

 I think removing origin ID from this patch and submitting a
 separate patch for a generalized transaction metadata system is
 the sensible way to go.

 We already have a very flexible WAL system for recording data of
 interest to various resource managers. If you wish to annotate a
 transaction, you can either generate a new kind of WAL record or
 you can enhance a commit record.

 Right.  Like many of us are suggesting should be done for origin ID.

 XLOG_NOOP records can already be generated by your application if
 you wish to inject additional metadata to the WAL stream. So no
 changes are required for you to implement the generalised
 transaction metadata scheme you say you require.

 I'm glad it's that easy.  Are there SQL functions to for that yet?


Yes, another possible design is to generate a new kind of WAL record
for the origin id.

Doing it that way will slow down multi-master by a measurable amount,
and slightly bloat the WAL stream.

The proposed way uses space that is currently wasted and likely to
remain so. Only 2 bytes of 6 bytes available are proposed for use,
with a flag design that allows future extension if required. When MMR
is not in use, the WAL records would look completely identical to the
way they look now, in size, settings and speed of writing them.

Putting the origin id onto each WAL record allows very fast and simple
stateless filtering. I suggest using it because those bytes have been
sitting there unused for close to 10 years now and no better use
springs to mind.

The proposed design is the fastest way of implementing MMR, without
any loss for non-users.

As I noted before, slowing down MMR by a small amount causes geometric
losses in performance across the whole cluster.


 Not sure how or why that relates to requirements for multi-master.

 That depends on whether you want to leave the door open to other
 logical replication than the one use case on which you are currently
 focused.  I even consider some of those other cases multi-master,
 especially when multiple databases are replicating to a single table
 on another server.  I'm not clear on your definition -- it seems to
 be rather more narrow.  Maybe we need to define some terms somewhere
 to facilitate discussion.  Is there a Wiki page where that would
 make sense?

The project is called BiDirectional Replication to ensure that people
understood this is not just multi-master. But that doesn't mean that
multi-master can't have its own specific requirements.

Adding originid is also useful for the use case 

Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Josh Berkus

 In the past people have asked me to have copy use the CSV header line in
 place of supplying a column list in the COPY command. I can certainly
 see some utility in that, and I think it might achieve what David wants.
 Using that scenario it would be an error to supply an explicit column
 list and also use the header line. But then I don't think CHECK_HEADER
 would be the right name for the option. In any case, specifying a name
 before we settle on an exact behaviour seems wrong :-)

Actually, I can see three valid and valuable-to-users behaviors here:

1) current behavior, header line is ignored.

2) CHECK HEADER: a column list is supplied, but a check header flag is
set.  If the column list and header list don't match *exactly*, you get
an error.

3) USE HEADER: no column list is supplied, but a use header flag is
set.  A column list is created to match the columns from the CSV header.
 Of necessity, this will consist of all or some of the columns in the
table.  If columns are supplied which are not in the table, then you get
an error (as well as if columns are missing which are NOT NULL, as you
get at present).

(2) is more valuable to people who want to check data integrity
rigorously, and test for unexpected API changes.  (3) is more valuable
for regular users who want CSV import to just work.  (1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Christopher Browne
On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com wrote:
 On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
  This is not transaction metadata, it is WAL record metadata
  required for multi-master replication, see later point.

  We need to add information to every WAL record that is used as the
  source for generating LCRs.
 If the origin ID of a transaction doesn't count as transaction
 metadata (i.e., data about the transaction), what does?  It may be a
 metadata element about which you have special concerns, but it is
 transaction metadata.  You don't plan on supporting individual WAL
 records within a transaction containing different values for origin
 ID, do you?  If not, why is it something to store in every WAL
 record rather than once per transaction?  That's not intended to be
 a rhetorical question.
 Its definitely possible to store it per transaction (see the discussion around
 http://archives.postgresql.org/message-
 id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via
 the originating node a considerably more complex thing. With our proposal you
 can do it without any complexity involved, on a low level. Storing it per
 transaction means you can only stream out the data to other nodes *after*
 fully reassembling the transaction. Thats a pitty, especially if we go for a
 design where the decoding happens in a proxy instance.

I guess I'm not seeing the purpose to having the origin node id in the
WAL stream either.

We have it in the Slony sl_log_* stream, however there is a crucial
difference, in that sl_log_* is expressly a shared structure.  In
contrast, WAL isn't directly sharable; you don't mix together multiple
WAL streams.

It seems as though the point in time at which you need to know the
origin ID is the moment at which you're deciding to read data from the
WAL files, and knowing which stream you are reading from is an
assertion that might be satisfied by looking at configuration that
doesn't need to be in the WAL stream itself.  It might be *nice* for
the WAL stream to be self-identifying, but that doesn't seem to be
forcibly necessary.

The case where it *would* be needful is if you are in the process of
assembling together updates coming in from multiple masters, and need
to know:
   - This INSERT was replicated from node #1, so should be ignored downstream
   - That INSERT was replicated from node #2, so should be ignored downstream
   - This UPDATE came from the local node, so needs to be passed to
downstream users

Or perhaps something else is behind the node id being deeply embedded
into the stream that I'm not seeing altogether.

 Other metadata will not be needed on such a low level.

 I also have to admit that I am very hesitant to start developing some generic
 transaction metadata framework atm. That seems to be a good way to spend a
 good part of time in discussion and disagreeing. Imo thats something for
 later.

Well, I see there being a use in there being at least 3 sorts of LCR records:
a) Capturing literal SQL that is to replayed downstream.  This
parallels two use cases existing in existing replication systems:
   i) In pre-2.2 versions of Slony, statements are replayed literally.
 So there's a stream of INSERT/UPDATE/DELETE statements.
   ii) DDL capture and replay.  In existing replication systems, DDL
isn't captured implicitly, the way Dimitri's Event Triggers are to do,
but rather is captured explicitly.
   There should be a function to allow injecting such SQL explicitly;
that is sure to be a useful sort of thing to be able to do.

b) Capturing tuple updates in a binary form that can be turned readily
into heap updates on a replica.
Unfortunately, this form is likely not to play well when
replicating across platforms or Postgres versions, so I suspect that
this performance optimization should be implemented as a *last*
resort, rather than first.  Michael Jackson had some rules of
optimization that said don't do it, and, for the expert, don't do
it YET...

c) Capturing tuple data in some reasonably portable and readily
re-writable form.
Slony 2.2 changes from SQL fragments (of a) i) above) to storing
updates as an array of text values indicating:
 - relation name
 - attribute names
 - attribute values, serialized into strings
I don't know that this provably represents the *BEST*
representation, but it definitely will be portable where b) would not
be, and lends itself to being able to reuse query plans, where a)
requires extraordinary amounts of parsing work, today.  So I'm pretty
sure it's better than a) and b) for a sizable set of cases.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund and...@2ndquadrant.com wrote:
 I would prefer the event trigger way because that seems to be the most
 extensible/reusable. It would allow a fully replicated databases and catalog
 only instances.
 I think we need to design event triggers in a way you cannot simply circumvent
 them. We already have the case that if users try to screw around system
 triggers we give back wrong answers with the planner relying on foreign keys
 btw.
 If the problem is having user trigger after system triggers: Lets make that
 impossible. Forbidding DDL on the other instances once we have that isn't that
 hard.

So, this is interesting.  I think something like this could solve the
problem, but then why not just make it built-in code that runs from
the same place as the event trigger rather than using the trigger
mechanism per se?  Presumably the trigger code's purpose is going to
be to inject additional data into the WAL stream (am I wrong?) which
is not something you're going to be able to do from PL/pgsql anyway,
so you don't really need a trigger, just a call to some C function -
which not only has the advantage of being not bypassable, but is also
faster.

 Perhaps all that will get simpler if we can make reading the catalog via
 custom built snapshots work as you proposed otherwhere in this thread. That
 would make checking errors way much easier even if you just want to apply to
 a database with exactly the same schema. Thats the next thing I plan to work
 on.

I realized a problem with that idea this morning: it might work for
reading things, but if anyone attempts to write data you've got big
problems.  Maybe we could get away with forbidding that, not sure.
Would be nice to get some input from other hackers on this.

 They could also modify the catalogs directly, although it's possible we
 don't care quite as much about that case (but on the other hand people
 do sometimes need to do it to solve real problems).
 With that you already can crash the database perfectly fine today. I think
 trying to care for that is a waste of time.

You're probably right.

 I agree that the focus isn't 100% optimal and that there are *loads* of issues
 we haven't event started to look at. But you need a point to start and
 extraction  apply seems to be a good one because you can actually test it
 without the other issues solved which is not really the case the other way
 round.
 Also its possible to plug in the newly built changeset extraction into
 existing solutions to make them more efficient while retaining most of their
 respective framework.

 So I disagree that thats the wrong part to start with.

I think extraction is a very sensible place to start; actually, I
think it's the best possible place to start.  But this particular
thread is about adding origin_ids to WAL, which I think is definitely
not the best place to start.

 I definitely do want to provide code that generates a textual representation
 of the changes. As you say, even if its not used for anything its needed for
 debugging. Not sure if it should be sql or maybe the new slony representation.
 If thats provided and reusable it should make sure that ontop of that other
 solutions can be built.

Oh, yeah.  If we can get that, I will throw a party.

 I find your supposition that I/we just want to get MMR without regard for
 anything else a bit offensive. I wrote at least three times in this thread
 that I do think its likely that we will not get more than the minimal basis
 for implementing MMR into 9.3. I wrote multiple times that I want to provide
 the basis for multiple solutions. The prototype - while obviously being
 incomplete - tried hard to be modular.
 You cannot blame us that we want the work we do to be *also* usable for what
 one of our major aims?
 What can I do to convince you/others that I am not planning to do something
 evil but that I try to reach as many goals at once as possible?

Sorry.  I don't think you're planning to do something evil, but before
I thought you said you did NOT want to write the code to extract
changes as text or something similar.  I think that would be a really
bad thing to skip for all kinds of reasons.  I think we need that as a
foundational technology before we do much else.  Now, once we have
that, if we can safely detect cases where it's OK to bypass decoding
to text and skip it in just those cases, I think that's great
(although possibly difficult to implement correctly).  I basically
feel that without decode-to-text, this can't possibly be a basis for
multiple solutions; it will be a basis only for itself, and extremely
difficult to debug, too.  No other replication solution can even
theoretically have any use for the raw on-disk tuple, at least not
without horrible kludgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your 

Re: [HACKERS] sortsupport for text

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 12:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The fact is that this is likely to be a fairly significant
 performance win, because strxfrm() is quite simply the way you're
 supposed to do collation-aware sorting, and is documented as such. For
 that reason, C standard library implementations should not be expected
 to emphasize its performance - they assume that you're using strxfrm()
 + their highly optimised strcmp()

 Have you got any evidence in support of this claim, or is it just
 wishful thinking about what's likely to be inside libc?  I'd also note
 that any comparisons you may have seen about this are certainly not
 accounting for the effects of data bloat from strxfrm (ie, possible
 spill to disk, more merge passes, etc).

 In any case, if you have to redefine the meaning of equality in order
 to justify a performance patch, I'm prepared to walk away at the start.
 The range of likely performance costs/benefits across different locales
 and different implementations is so wide that if you can't show it to be
 a win even with the strcmp tiebreaker, it's not likely to be a reliable
 win without that.

On the testing I've done, the strcmp() tie-breaking rarely gets run
anyway.  Unless you're sorting data with only a few distinct values,
most comparisons are between values that are distinct under any choice
of collation, and therefore strcoll() returns 0 very rarely, and
therefore the additional runtime it consumes does not matter very
much.  Also, it's quite a bit faster than strcoll() anyway, so even
when it does run it doesn't add much to the total time.

I think the elephant in the room here is that we're relying on the OS
to do everything for us, and the OS API we use (strcoll) requires an
extra memcpy and is also dreadfully slow.  If we could solve that
problem, it would save us a lot more than worrying about the extra
strcmp().  Of course, solving that problem is hard: we either have to
get the glibc and FreeBSD libc folks to provide a better API (that
takes lengths for each input instead of relying on trailing nul
bytes), or reimplement locales within PG, or store trailing NUL bytes
that we don't really need in the index so we can apply strcoll
directly, none of which are very appealing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Marc Mamin
  -Ursprüngliche Nachricht-
  Von: pgsql-hackers-ow...@postgresql.org im Auftrag von Josh Berkus
  Gesendet: Mi 6/20/2012 7:06
  An: pgsql-hackers@postgresql.org
  Betreff: Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

...

   (1) is valuable
  for backwards compatibility, and for cases where the CSV header was
  generated by another program (Excel?) so the column names don't match.


4) MAP_HEADER ('column 1'- 'col_1', 'Date' - 'input_date' ...)

to cover the case when column names do not match.

my 2 pences,

Marc Mamin


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 17:41, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Geoghegan pe...@2ndquadrant.com writes:
 No, I'm suggesting it would probably be at least a bit of a win here
 to cache the constant, and only have to do a strxfrm() + strcmp() per
 comparison.

 Um, have you got any hard evidence to support that notion?  The
 traditional advice is that strcoll is faster than using strxfrm unless
 the same strings are to be compared repeatedly.  I'm not convinced that
 saving the strxfrm on just one side will swing the balance.

Fair enough,  but I'd suggest that that traditional advice assumes
that strcoll()'s space-efficiency and general avoidance of dynamic
allocation is an important factor, which, for us, it clearly isn't,
since we can re-use a single buffer in the manner of Robert's text
sortsupport patch for each and every per-tuple strxfrm() blob (when
traversing an index). This is a direct quote from glibc's strcoll_l.c
:

 Perform the first pass over the string and while doing this find
 and store the weights for each character.  Since we want this to
 be as fast as possible we are using `alloca' to store the temporary
 values.  But since there is no limit on the length of the string
 we have to use `malloc' if the string is too long.  We should be
 very conservative here.

Here, alloca is used to allocate space in a stack frame. I believe
that this is an entirely inappropriate trade-off for Postgres to be
making. strxfrm(), in constrast, leaves buffer sizing and management
up to the caller. That has to be a big part of the problem here.

 The fact is that this is likely to be a fairly significant
 performance win, because strxfrm() is quite simply the way you're
 supposed to do collation-aware sorting, and is documented as such. For
 that reason, C standard library implementations should not be expected
 to emphasize its performance - they assume that you're using strxfrm()
 + their highly optimised strcmp()

 Have you got any evidence in support of this claim, or is it just
 wishful thinking about what's likely to be inside libc?

According to the single-unix specification's strcoll() documentation,
The strxfrm() and strcmp() functions should be used for sorting large
lists. If that isn't convincing enough for you, there is the fact
that glibc's strcmp() is clearly highly optimised for each and every
architecture, and that we are currently throwing away an extra
strcmp() in the event of strcoll() equality.

 I'd also note that any comparisons you may have seen about this are certainly 
 not
 accounting for the effects of data bloat from strxfrm (ie, possible
 spill to disk, more merge passes, etc).

What about the fact that strcoll() may be repeatedly allocating and
freeing memory per comparison? The blobs really aren't that much
larger than the strings to be sorted, which are typically quite short.

 In any case, if you have to redefine the meaning of equality in order
 to justify a performance patch, I'm prepared to walk away at the start.

The advantage of my proposed implementation is precisely that I won't
have to redefine the meaning of equality, and that only the text
datatype will have to care about equivalency, so you can just skip
over an explanation of equivalency for most audiences.

If you feel that strongly about it, and I have no possible hope of
getting this accepted, I'm glad that I know now rather than after
completing a significant amount of work on this. I would like to hear
other people's opinions before I drop it though.

 The range of likely performance costs/benefits across different locales
 and different implementations is so wide that if you can't show it to be
 a win even with the strcmp tiebreaker, it's not likely to be a reliable
 win without that.

glibc is the implementation that really matters. My test-case used
en_US.UTF-8 as its locale, which has to be one of the least stressful
to strcoll() - I probably could have shown a larger improvement just
by selecting a locale that was known to have to make more passes,
like, say, hu_HU.UTF-8.

I expect to have access to a 16 core server next week, which Bull have
made available to me. Maybe I'll get some interesting performance
numbers from it.

The reason that I don't want to use the blob with original string hack
is because it's ugly, space-inefficient, unnecessary and objectively
incorrect, since it forces us to violate the conformance requirement
C9 of Unicode 3.0, marginal though that may be.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Josh Berkus

 
 4) MAP_HEADER ('column 1'- 'col_1', 'Date' - 'input_date' ...)
 
 to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY.  At that point, check out the CSV-array FDW.

Of course, if someone writes a WIP patch which implements the above, we
can evaluate it then.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  I would prefer the event trigger way because that seems to be the most
  extensible/reusable. It would allow a fully replicated databases and
  catalog only instances.
  I think we need to design event triggers in a way you cannot simply
  circumvent them. We already have the case that if users try to screw
  around system triggers we give back wrong answers with the planner
  relying on foreign keys btw.
  If the problem is having user trigger after system triggers: Lets make
  that impossible. Forbidding DDL on the other instances once we have that
  isn't that hard.
 
 So, this is interesting.  I think something like this could solve the
 problem, but then why not just make it built-in code that runs from
 the same place as the event trigger rather than using the trigger
 mechanism per se?  Presumably the trigger code's purpose is going to
 be to inject additional data into the WAL stream (am I wrong?) which
 is not something you're going to be able to do from PL/pgsql anyway,
 so you don't really need a trigger, just a call to some C function -
 which not only has the advantage of being not bypassable, but is also
 faster.
I would be totally fine with that. As long as event triggers provide the 
infrastructure that shouldn't be a big problem.

  Perhaps all that will get simpler if we can make reading the catalog via
  custom built snapshots work as you proposed otherwhere in this thread.
  That would make checking errors way much easier even if you just want to
  apply to a database with exactly the same schema. Thats the next thing I
  plan to work on.
 I realized a problem with that idea this morning: it might work for
 reading things, but if anyone attempts to write data you've got big
 problems.  Maybe we could get away with forbidding that, not sure.
Hm, why is writing a problem? You mean io conversion routines writing data? 
Yes, that will be a problem. I am fine with simply forbidding that, we should 
be able to catch that and provide a sensible error message, since SSI we have 
the support for that.

 Would be nice to get some input from other hackers on this.
Oh, yes!


  I agree that the focus isn't 100% optimal and that there are *loads* of
  issues we haven't event started to look at. But you need a point to
  start and extraction  apply seems to be a good one because you can
  actually test it without the other issues solved which is not really the
  case the other way round.
  Also its possible to plug in the newly built changeset extraction into
  existing solutions to make them more efficient while retaining most of
  their respective framework.
  
  So I disagree that thats the wrong part to start with.
 
 I think extraction is a very sensible place to start; actually, I
 think it's the best possible place to start.  But this particular
 thread is about adding origin_ids to WAL, which I think is definitely
 not the best place to start.
Yep. I think the reason everyone started at it is that the patch was actually 
really simple ;).
Note that the wal enrichement  decoding patches were before the origin_id 
patch in the patchseries ;)

  I definitely do want to provide code that generates a textual
  representation of the changes. As you say, even if its not used for
  anything its needed for debugging. Not sure if it should be sql or maybe
  the new slony representation. If thats provided and reusable it should
  make sure that ontop of that other solutions can be built.
 Oh, yeah.  If we can get that, I will throw a party.
Good ;)

  I find your supposition that I/we just want to get MMR without regard for
  anything else a bit offensive. I wrote at least three times in this
  thread that I do think its likely that we will not get more than the
  minimal basis for implementing MMR into 9.3. I wrote multiple times that
  I want to provide the basis for multiple solutions. The prototype -
  while obviously being incomplete - tried hard to be modular.
  You cannot blame us that we want the work we do to be *also* usable for
  what one of our major aims?
  What can I do to convince you/others that I am not planning to do
  something evil but that I try to reach as many goals at once as
  possible?
 Sorry.  I don't think you're planning to do something evil, but before
 I thought you said you did NOT want to write the code to extract
 changes as text or something similar.
Hm. I might have been a bit ambiguous when saying that I do not want to 
provide everything for that use-case.
Once we have a callpoint that has a correct catalog snapshot for exactly the 
tuple in question text conversion is damn near trivial. The point where you 
get passed all that information (action, tuple, table, snapshot) is the one I 
think the patch should mainly provide.

 I think that would be a really
 bad thing to skip for all kinds of reasons.  I think we need that as a

Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I'm sure Jeff submitted this because of the need for a standard test,
 rather than the wish to actually modify pgbench itself.

 Can I suggest that we include a list of standard scripts with pgbench
 for this purpose? These can then be copied alongside the binary when
 we do an install.

I was thinking along similar lines myself.  At the least, I think we
can't continue to add a short option for every new test type.
Instead, maybe we could have --test-type=WHATEVER, and perhaps that
then reads whatever.sql from some compiled-in directory.  That would
allow us to sanely support a moderately large number of tests.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-20 Thread Fujii Masao
On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
 a uint64, on top of my other WAL format patches. I think we should go ahead
 with this.

 +1.

 The LSNs on pages are still stored in the old format, to avoid changing the
 on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
 file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
 is required.

 Seems fine.

 Should we keep the old representation in the replication protocol messages?
 That would make it simpler to write a client that works with different
 server versions (like pg_receivexlog). Or, while we're at it, perhaps we
 should mandate network-byte order for all the integer and XLogRecPtr fields
 in the replication protocol. That would make it easier to write a client
 that works across different architectures, in = 9.3. The contents of the
 WAL would of course be architecture-dependent, but it would be nice if
 pg_receivexlog and similar tools could nevertheless be
 architecture-independent.

 I share Andres' question about how we're doing this already.  I think
 if we're going to break this, I'd rather do it in 9.3 than 5 years
 from now.  At this point it's just a minor annoyance, but it'll
 probably get worse as people write more tools that understand WAL.

 If we are looking at breaking it, and we are especially concerned
 about something like pg_receivexlog... Is it something we could/should
 change in the protocl *now* for 9.2, to make it non-broken in any
 released version? As in, can we extract just the protocol change and
 backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location FF
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips FF. So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send FF. No?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >