Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Amit Kapila
> I think you're missing the point.  There is no possible way to guarantee 
> database 
> consistency after applying pg_resetxlog, unless the database had been cleanly 
> shut 
> down beforehand.  The reset will lose the xlog information that was needed to 
> restore 
> consistency.  So arguing from examples that demonstrate this is rather 
> pointless.  
> Rather, the value of pg_resetxlog is to be able to start the database at all 
> so that 
> info can be extracted from it.  What we are looking for is not perfection, 
> because 
> that's impossible, but just to not make a bad situation worse.

I got the point that we cannot reconstruct a consistent database where further 
operations can be allowed.

> The reason I'm concerned about selecting a next-LSN that's certainly beyond 
> every LSN in the database is that not doing 
> so could result in introducing further corruption, which would be entirely 
> avoidable with more care in choosing the 
> next-LSN.

The further corruption can only be possible when we replay some wrong WAL by 
selecting wrong LSN.
Do you mean to say that if next-LSN is selected from pages, it will be a better 
position for starting Replay.
On the otherhand if we don't have to replay the WAL and just take the dump, how 
it will matter what is next-LSN.

>> Pg_resetxlog -
>> It will generate the next-LSN point as 109 which when used for recovery will 
>> generate inconsistent database.
>> However if we would have relied on WAL, it would have got next-LSN as 107.

>Umm ... the entire point of pg_resetxlog is to throw away WAL.  Not to
>rely on it.

Yes, that is right but the solution I have purposed in my first mail was not to 
reset it after getting consistent checkpoint and generating control file values 
from it.
However now I understand that the problem and solution definition should 
consider that some WAL files are missing.

> It's conceivable that there would be some use in a tool that searches
> the available WAL files for the latest checkpoint record and recreates a
> pg_control file pointing at that checkpoint, without zapping the WAL
> files.  This would be much different in purpose and usage from
> pg_resetxlog, though.

I will collect all the requirements that can be done in this area from existing 
mails and
think more in it to generate concrete set of requirements that can be helpful 
to users to over come
such situations.

Requirements or suggestions from all are most welcome. It can help me to do 
some useful work for
PostgreSQL.

With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backport of fsync queue compaction

2012-06-20 Thread Greg Smith
I don't want to take a bunch of time away from the active CF talking 
about this, just wanted to pass along some notes:


-Back branch release just happening a few weeks ago.  Happy to have this 
dropped until the CF is over.


-Attached is a working backport of this to 8.4, with standard git 
comments in the header.  I think this one will backport happily with git 
cherrypick.  Example provided mainly to prove that; not intended to be a 
patch submission.


-It is still possible to get extremely long running sync times with the 
improvement applied.


Since I had a 8.4 server manifesting this problem where I could just try 
this one change, I did that.  Before we had this:



2012-06-17 14:48:13 EDT LOG: checkpoint complete: wrote 90 buffers
(0.1%); 0 transaction log file(s) added, 0 removed, 14 recycled;
write=26.531 s, sync=4371.513 s, total=4461.058 s


After the compaction code was working (also backported the extra logging 
here) I got this instead:


2012-06-20 23:10:36 EDT LOG:  checkpoint complete: wrote 188 buffers 
(0.1%); 0 transaction log file(s) added, 0 removed, 7 recycled; 
write=31.975 s, sync=3064.270 s, total=3096.263 s; sync files=308, 
longest=482.200 s, average=9.948 s


So the background writer still took a long time due to starvation from 
clients.  But the backend side latency impact wasn't nearly as bad 
though.  The peak load average didn't jump into the hundreds, it only 
got 10 to 20 clients behind on things.


Anyway, larger discussion around this and related OS tuning is a better 
topic for pgsql-performance, will raise this there when I've sorted that 
out a bit more clearly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
>From 0c251231cca932b50376f0ca8b088b71b9eb5f84 Mon Sep 17 00:00:00 2001
From: Greg Smith 
Date: Sun, 24 Apr 2011 20:54:08 -0400
Subject: [PATCH 3/3] Try to avoid running with a full fsync request queue.

When we need to insert a new entry and the queue is full, compact the
entire queue in the hopes of making room for the new entry.  Doing this
on every insertion might worsen contention on BgWriterCommLock, but
when the queue it's full, it's far better than allowing the backend to
perform its own fsync, per testing by Greg Smith as reported in
http://archives.postgresql.org/pgsql-hackers/2011-01/msg02665.php

Original idea from Greg Smith.  Patch by Robert Haas.  Review by Chris
Browne and Greg Smith

Conflicts:

	src/backend/postmaster/bgwriter.c
---
 src/backend/postmaster/bgwriter.c |  139 ++---
 src/backend/storage/smgr/md.c |8 ++-
 2 files changed, 137 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 831ea94..8a6abff 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -179,6 +179,7 @@ static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
+static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
@@ -1058,14 +1059,15 @@ RequestCheckpoint(int flags)
  * use high values for special flags; that's all internal to md.c, which
  * see for details.)
  *
- * If we are unable to pass over the request (at present, this can happen
- * if the shared memory queue is full), we return false.  That forces
- * the backend to do its own fsync.  We hope that will be even more seldom.
- *
- * Note: we presently make no attempt to eliminate duplicate requests
- * in the requests[] queue.  The bgwriter will have to eliminate dups
- * internally anyway, so we may as well avoid holding the lock longer
- * than we have to here.
+ * To avoid holding the lock for longer than necessary, we normally write
+ * to the requests[] queue without checking for duplicates.  The bgwriter
+ * will have to eliminate dups internally anyway.  However, if we discover
+ * that the queue is full, we make a pass over the entire queue to compact
+ * it.  This is somewhat expensive, but the alternative is for the backend
+ * to perform its own fsync, which is far more expensive in practice.  It
+ * is theoretically possible a backend fsync might still be necessary, if
+ * the queue is full and contains no duplicate entries.  In that case, we
+ * let the backend know by returning false.
  */
 bool
 ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
@@ -1083,8 +1085,15 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 	/* we count non-bgwriter writes even when the request queue overflows */
 	BgWriterShmem->num_backend_writes++;
 
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.  But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+

Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Amit Kapila
> Just so I understand correctly, the aim of this is to "fix" the
> situation where out of the thousands of files and 100s of GB of data
> in my pg directory, the *only* corruption is that a single file
> pg_control file is missing?

This is just an example I have used to explain what should be the best way to 
generate
Next-LSN.
The overall aim for this feature is to start the database with as much accuracy 
as possible after database corruption occurred due to missing files or hardware 
crash. 
However it is not possible to start with full consistency and accuracy after 
such a 
Situation.

-Original Message-
From: Aidan Van Dyk [mailto:ai...@highrise.ca] 
Sent: Wednesday, June 20, 2012 7:13 PM
To: Amit Kapila
Cc: Pg Hackers
Subject: Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

On Wed, Jun 20, 2012 at 9:21 AM, Amit Kapila  wrote:

> Example Scenario -

> Now assume we have Data files and WAL files intact and only control file is 
> lost.


Just so I understand correctly, the aim of this is to "fix" the
situation where out of the thousands of files and 100s of GB of data
in my pg directory, the *only* corruption is that a single file
pg_control file is missing?

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Amit Kapila
>> AFAIK PITR can be used in a scenario where there is a base back-up and we
>> have archived
>> the WAL files after that, now it can use WAL files to apply on
base-backup.

> Yes. If you want to recover the database from the media crash like the
> corruption of pg_control file, you basically should take a base backup
> and set up continuous archiving.

>> In this scenario we don't know a point from where to start the next
replay.
>> So I believe it will be difficult to use PITR in this scenario.

>You can find out the point from the complete pg_control file which was
>restored from the backup.

Yes, it can work the way you have explained or even by using Replication
solutions where
user can recreate the database from slave or other copy.
But the tool pg_resetxlog or similar tools are provided to handle situations
where user 
has not taken care enough to be saved from corruption.

With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 21 June 2012 01:22, Peter Geoghegan  wrote:
> I've written a very small C++ program, which I've attached, that
> basically proves that this can still make a fairly large difference -
> I hope it's okay that that it's C++, but that allowed me to write the
> program quickly, with no dependencies for anyone who cares to try this
> out, other than a C++ compiler and the standard library.

So, it turns out that with a relatively expensive to sort collation
like hu_HU.UTF-8, the difference here is huge:

[peter@peterlaptop strxfrm_test]$ ./a.out
locale set: hu_HU.UTF-8
Time elapsed with strcoll: 11.122695 seconds
Time elapsed with strxfrm optimization: 2.328365 seconds
Time elapsed with strcoll: 11.404160 seconds
Time elapsed with strxfrm optimization: 2.355684 seconds
Time elapsed with strcoll: 11.411680 seconds
Time elapsed with strxfrm optimization: 2.342553 seconds
Time elapsed with strcoll: 11.415693 seconds
Time elapsed with strxfrm optimization: 2.343593 seconds
Time elapsed with strcoll: 11.428686 seconds
Time elapsed with strxfrm optimization: 2.345204 seconds
*strcoll average: 11.356583 strxfrm average: 2.343080*

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 17:41, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> No, I'm suggesting it would probably be at least a bit of a win here
>> to cache the constant, and only have to do a strxfrm() + strcmp() per
>> comparison.
>
> Um, have you got any hard evidence to support that notion?  The
> traditional advice is that strcoll is faster than using strxfrm unless
> the same strings are to be compared repeatedly.  I'm not convinced that
> saving the strxfrm on just one side will swing the balance.

I've written a very small C++ program, which I've attached, that
basically proves that this can still make a fairly large difference -
I hope it's okay that that it's C++, but that allowed me to write the
program quickly, with no dependencies for anyone who cares to try this
out, other than a C++ compiler and the standard library.

It just inserts 500,000 elements (the same succession of psuedo-random
strings again and again) into a std::set, which is implemented using a
red-black tree on my machine. In one case, we just use strcmp() (there
is actually no tie-breaker). In the other case, the comparator
allocates and fills memory for a strxfrm blob when needed, caches it,
and uses strxfrm() to generate blobs for other elements into a
dedicated buffer which persists across comparisons.

It prints the duration of each run, and a running average for each
case until terminated. Here is what the output looks like on my
machine:

[peter@peterlaptop strxfrm_test]$ g++ -O2 set_test.cpp
[peter@peterlaptop strxfrm_test]$ ./a.out
Time elapsed with strcoll: 1.485290 seconds
Time elapsed with strxfrm optimization: 1.070211 seconds
Time elapsed with strcoll: 1.813725 seconds
Time elapsed with strxfrm optimization: 1.097950 seconds
Time elapsed with strcoll: 1.813381 seconds
Time elapsed with strxfrm optimization: 1.102769 seconds
Time elapsed with strcoll: 1.826708 seconds
Time elapsed with strxfrm optimization: 1.105093 seconds
Time elapsed with strcoll: 1.842156 seconds
Time elapsed with strxfrm optimization: 1.103198 seconds
*strcoll average: 1.756252 strxfrm average: 1.095844*
Time elapsed with strcoll: 1.834785 seconds
Time elapsed with strxfrm optimization: 1.105369 seconds
Time elapsed with strcoll: 1.817449 seconds
Time elapsed with strxfrm optimization: 1.110386 seconds
Time elapsed with strcoll: 1.812446 seconds
Time elapsed with strxfrm optimization: 1.101266 seconds
Time elapsed with strcoll: 1.813595 seconds
Time elapsed with strxfrm optimization: 1.099444 seconds
Time elapsed with strcoll: 1.814584 seconds
Time elapsed with strxfrm optimization: 1.099542 seconds
*strcoll average: 1.787412 strxfrm average: 1.099523*
Time elapsed with strcoll: 1.820218 seconds
Time elapsed with strxfrm optimization: 1.102984 seconds
Time elapsed with strcoll: 1.817549 seconds
Time elapsed with strxfrm optimization: 1.100526 seconds
Time elapsed with strcoll: 1.817020 seconds
Time elapsed with strxfrm optimization: 1.099273 seconds
Time elapsed with strcoll: 1.820983 seconds
Time elapsed with strxfrm optimization: 1.100501 seconds
Time elapsed with strcoll: 1.813270 seconds
Time elapsed with strxfrm optimization: 1.098904 seconds
*strcoll average: 1.797544 strxfrm average: 1.099828*
Time elapsed with strcoll: 1.815952 seconds
Time elapsed with strxfrm optimization: 1.102583 seconds

If you'd like to print the contents of the set, there are a couple of
lines you can uncomment. It should be straightforward to understand
the program, even if you don't know any C++.

Note that I still think that the compelling case for strxfrm() caching
is sorting tuples, not btree index traversal. All that I ask is that
you allow that on the whole, this will pay for the cost of verifying
equivalency within _bt_checkkeys() once per index-scan.

I am aware that a red-black tree isn't generally an excellent proxy
for how a btree will perform, but I don't think that that really
matters here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
/*
 * set_test.cpp:
 *
 * Measure the performance characteristics of using strcoll() as
 * a string comparator rather than caching strxfrm() blobs.
 *
 * Author: Peter Geoghegan
 *
 * A std::set is an ascending container of unique elements. The implementation
 * that I used for any numbers published was the one from Gnu libstdc++, on
 * Fedora 16. That particular implementation uses a red-black tree, but the
 * standard's requirement that common operations (search, insert, delete) take
 * logarithmic time effectively requires the use of some kind of self-balancing
 * binary search tree, which isn't a bad proxy for a btree for my purposes.
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define ORIG_STRING_SIZE	32
#define BLOB_STRING_SIZE	128

double
timeval_subtract(struct timeval *x, struct timeval *y)
{
	struct timeval result;
	/* Compute the time remaining to wait. tv_usec is certainly positive. */
	result.tv_s

Re: [HACKERS] [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes

2012-06-20 Thread Steve Singer

On 12-06-13 07:28 AM, Andres Freund wrote:

From: Andres Freund

The individual changes need to be identified by an xid. The xid can be a
subtransaction or a toplevel one, at commit those can be reintegrated by doing
a k-way mergesort between the individual transaction.

Callbacks for apply_begin, apply_change and apply_commit are provided to
retrieve complete transactions.

Missing:
- spill-to-disk
- correct subtransaction merge, current behaviour is simple/wrong
- DDL handling (?)
- resource usage controls

Here is an initial review of the ApplyCache patch.

This patch provides a module for taking actions in the WAL stream and 
groups the actions by transaction, then passing these change records to 
a set of plugin functions.


For each transaction it encounters it keeps a list of the actions in 
that transaction. The ilist included in an earlier patch is used, 
changes resulting from that patch review would effect the code here but 
not in a way that chances the design.  When the module sees a commit for 
a transaction it calls the apply_change callback for each change.


I can think of three ways that a replication system like this could try 
to apply transactions.


1) Each time it sees a new transaction it could open up a new 
transaction on the replica and makes that change.  It leaves the
transaction open and goes on applying the next change (which might be 
for the current transaction or might be for another one).
When it comes across a commit record it would then commit the 
transaction.   If 100 concurrent transactions were open on the origin 
then 100 concurrent transactions will be open on the replica.


2) Determine the commit order of the transactions, group all the changes 
for a particular transaction together and apply them in that order for 
the transaction that committed first, commit that transaction and then 
move onto the transaction that committed second.


3) Group the transactions in a way that you move the replica from one 
consistent snapshot to another.  This is what Slony and Londiste do 
because they don't have the commit order or commit timestamps. Built-in 
replication can do better.


This patch implements option (2).   If we had a way of implementing 
option (1) efficiently would we be better off?


Option (2) requires us to put unparsed WAL data (HeapTuples) in the 
apply cache.  You can't translate this to an independent LCR until you 
call the apply_change record (which happens once the commit is 
encountered).  The reason for this is because some of the changes might 
be DDL (or things generated by a DDL trigger) that will change the 
translation catalog so you can't translate the HeapData to LCR's until 
your at a stage where you can update the translation catalog.  In both 
cases you might need to see later WAL records before you can convert an 
earlier one into an LCR (ie TOAST).


Some of my concerns with the apply cache are

Big transactions (bulk loads, mass updates) will be cached in the apply 
cache until the commit comes along.  One issue Slony has todo with bulk 
operations is that the replicas can't start processing the bulk INSERT 
until after it has commited.  If it takes 10 hours to load the data on 
the master it will take another 10 hours (at best) to load the data into 
the replica(20 hours after you start the process).  With binary 
streaming replication your replica is done processing the bulk update 
shortly after the master is.


Long running transactions can sit in the cache for a long time.  When 
you spill to disk we would want the long running but inactive ones 
spilled to disk first.  This is solvable but adds to the complexity of 
this module, how were you planning on managing which items of the list 
get spilled to disk?


The idea that we can safely reorder the commands into transactional 
groupings works (as far as I know) today because DDL commands get big 
heavy locks that are held until the end of the transaction.  I think 
Robert mentioned earlier in the parent thread that maybe some of that 
will be changed one day.


The downsides of (1) that I see are:

We would want a single backend to keep open multiple transactions at 
once. How hard would that be to implement? Would subtransactions be good 
enough here?


Applying (or even translating WAL to LCR's) the changes in parallel 
across transactions might complicate the catalog structure because each 
concurrent transaction might need its own version of the catalog (or can 
you depend on the locking at the master for this? I think you can today)


With approach (1) changes that are part of a rolledback transaction 
would have more overhead because you would call apply_change on them.


With approach (1) a later component could still group the LCR's by 
transaction before applying by running the LCR's through a data 
structure very similar to the ApplyCache.



I think I need more convincing that approach (2), what this patch 
implements, is the best way doing things, compared (1). I 

Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun20, 2012, at 22:40 , Marko Kreen wrote:
> On Wed, Jun 20, 2012 at 10:05 PM, Florian Pflug  wrote:
>> I'm starting to think that relying on SSL/TLS for compression of
>> unencrypted connections might not be such a good idea after all. We'd
>> be using the protocol in a way it quite clearly never was intended to
>> be used...
> 
> Maybe, but what is the argument that we should avoid
> on encryption+compression at the same time?

Mostly orthogonality, I think. I personally could live supporting
compression only together with encryption. We *shouldn't* require people
to create an RSA certificate just to use compression, though. I guess we
could do that by using one of the cipher suits that use Diffie-Hellman
key exchange (TLS_DH_anon_WITH_AES_128_CBC_SHA for example), but that
again requires fiddling with the allowed ciphers list. Or we could
auto-generate a self-signed RSA certificate if none is available. Which
is doable as long as we're going to stick with OpenSSL or GnuTLS, but if
we eventually want to support more SSL libraries server-side, this could
be a road blocker...

> AES is quite lightweight compared to compression, so should
> be no problem in situations where you care about compression.

Hm, yeah, that's especially true since DEFLATE (zlib) seems to be the
only universally supported compression method for TLS, which is rather
slow anyway (At least compared to alternatives such as LZO or Google's
Snappy).

> So what exactly is the situation we need to solve
> with postgres-specific protocol compression?

All we really want IMHO is a way to enable compression which requires
no more than specifying compress=on or the like in the connection
string. The million dollar question is, what is the easiest way to
get there? I initially though that relying on TLS for compression should
be relatively straight-forward, but it's starting to look rather messy…

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing 9.2 in ~production environment

2012-06-20 Thread Kevin Grittner
James Cloos  wrote:
 
>create index mb_name_own_idx on mb ( lower(name), ownerid );
 
>   WHERE lower(name) = lower('foo@bar') AND ownerid=7;
 
If you expect to be using an equality test on ownerid, you should
put that first in the index.
 
BTW, this is starting to sound more like something for the
pgsql-performance list than the pgsql-hackers list.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing 9.2 in ~production environment

2012-06-20 Thread James Cloos
As a followup, I find that I can avoid the seq scan by adding an index
to that table as:

   create index mb_name_own_idx on mb ( lower(name), ownerid );

and changing the query from using the idiom:

  WHERE name ILIKE 'foo@bar' AND ownerid=7;

to using:

  WHERE lower(name) = lower('foo@bar') AND ownerid=7;

which saves 20+ ms on each of the 30+ k such selects in a full run.

I haven't tested how fast it would be with that change and a utf8 ctype.

Because of how the middleware achives its portability between pg, my et al,
changing it to use lower and = will require significant surgery.

Is there any way to specify the index such that the ILIKE query will use
said index?

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_prewarm

2012-06-20 Thread Cédric Villemain
Le mercredi 20 juin 2012 21:53:43, Peter Eisentraut a écrit :
> On tis, 2012-04-10 at 13:29 +0200, Cédric Villemain wrote:
> > I have no problem deprecating overlapping features from pgfincore as
> > soon as I can do a «depend:pg_prewarm[os_warm]»  :)  ...It would have
> > been better to split pgfincore analyze and warming parts times
> > ago, anyway.
> 
> So pg_prewarm is now pending in the commitfest, so let's see what the
> situation is.

I have refused to propose pgfincore so far because BSD didn't supported 
POSIX_FADVISE (but already supported mincore(2)).
Now, things change and pgfincore should work on linux, bsd, hp, ... (but not 
on windows)
I'll be happy to propose it if people want.

> I'm worried about the overlap with pgfincore.  It's pretty well
> established in this space, and has about 73% feature overlap with
> pg_prewarm, while having more features all together.  So it would seem
> to me that it would be better to add whatever is missing to pgfincore
> instead.  Or split pgfincore, as suggested above, but that would upset
> existing users.  But adding severely overlapping stuff for no technical
> reasons (or there any?) doesn't sound like a good idea.

And I am also worried with the overlap.

> Also, Robert has accurately described this as "mechanism, not policy".
> That's fine, that's what all of our stuff is.  Replication and most of
> postgresql.conf suffers from that.  Eventually someone will want to
> create a way to save and restore various caches across server restarts,
> as discussed before.  Would that mean there will be a third way to do
> all this, or could we at least align things a bit so that such a
> facility could use most of the proposed prewarming stuff?  (Patches for
> the cache restoring exist, so it should be possible to predict this a
> little bit.)

It makes some time I have a look at the postgresql source code about 
readBuffer and friends. AFAIR pgfincore needed some access to file decsriptor 
which were not possible with PostgreSQL core functions.

I can have a look as this is near the stuff I'll work on next (posix_fadvice 
random/sequential/normal applyed directly by PostgreSQL, instead of relying 
on hacks around read-the-first-block to start readahead).

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


Re: [HACKERS] Remove support in ri_triggers.c for zero-column foreign keys?

2012-06-20 Thread Kevin Grittner
Andres Freund  wrote:
> On Wednesday, June 20, 2012 11:15:44 PM Tom Lane wrote:
>> There is a nontrivial amount of code in ri_triggers.c that is
>> concerned with supporting foreign key constraints having zero
>> columns.  ...
>> 
>> I think we should rip all that code out and instead put one test
>> into ri_FetchConstraintInfo insisting that the number of keys be
>> > 0.
> +1. I really find that a strange concept and I have a hard time
> finding useful use-cases.
 
+1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove support in ri_triggers.c for zero-column foreign keys?

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 11:15:44 PM Tom Lane wrote:
> There is a nontrivial amount of code in ri_triggers.c that is concerned
> with supporting foreign key constraints having zero columns.  ...
> 
> I think we should rip all that code out and instead put one test into
> ri_FetchConstraintInfo insisting that the number of keys be > 0.
+1. I really find that a strange concept and I have a hard time finding useful 
use-cases.

Thanks for improving that code btw, I tried to read/patch it before and its 
not exactly accessible...

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Remove support in ri_triggers.c for zero-column foreign keys?

2012-06-20 Thread Tom Lane
There is a nontrivial amount of code in ri_triggers.c that is concerned
with supporting foreign key constraints having zero columns.  There is
of course no support for this concept in the standard; the semantics
given to the case in the code are made up out of whole cloth.
(Basically, it says that such a constraint is satisfied if there's at
least one row in the PK table; contents don't matter.)  It is also not
possible to create such a constraint in Postgres, of course, which means
that the code in question is utterly untested.  I have no faith that all
the corner cases (most of which are just "do nothing if zero keys") are
correct even granting that the proposed semantics are reasonable.  In
particular, the corner case in ri_restrict_del that makes it do nothing
is certainly wrong, since it would need to disallow deleting the last
row in the PK table unless the FK table is empty.  I don't see exactly
how SET NULL or SET DEFAULT cases ought to behave either, except that
do-nothing is probably not it.

I think we should rip all that code out and instead put one test into
ri_FetchConstraintInfo insisting that the number of keys be > 0.
Alternatively, if we think this is actually a useful corner case,
somebody needs to put significant work into making the feature
accessible and then fixing it to work correctly.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_prewarm

2012-06-20 Thread Cédric Villemain
> The biggest problem with pgfincore from my point of view is that it
> only works under Linux, whereas I use a MacOS X machine for my
> development, and there is also Windows to think about.  Even if that
> were fixed, though, I feel we ought to have something in the core
> distribution.  This patch got more +1s than 95% of what gets proposed
> on hackers.

pgfincore works also on BSD kernels. Can you try on your MacOSX ? (I don't 
have one here).
As of freeBSD 8.3 there is suport for posix_fadvise call so both PostgreSQL 
core and pgfincore now support the preloading on this distribution (I've not 
tested it recently but it should).

All pgfincore features should now works in most places, except windows.

> > Also, Robert has accurately described this as "mechanism, not policy".
> > That's fine, that's what all of our stuff is.  Replication and most of
> > postgresql.conf suffers from that.  Eventually someone will want to
> > create a way to save and restore various caches across server restarts,
> > as discussed before.  Would that mean there will be a third way to do
> > all this, or could we at least align things a bit so that such a
> > facility could use most of the proposed prewarming stuff?  (Patches for
> > the cache restoring exist, so it should be possible to predict this a
> > little bit.)
> 
> Well, pg_buffercache + pg_prewarm is enough to save and restore shared
> buffers.  Not the OS cache, but we don't have portable code to query
> the OS cache yet anyway.

+pgfincore and the OS cache part is done.

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


Re: [HACKERS] Event Triggers reduced, v1

2012-06-20 Thread Dimitri Fontaine
Robert Haas  writes:
>> Cant you just put it alone in a test group in the parallel_schedule? Several
>> other tests do that?
>
> Yeah, that seems like it should work.  If not, I'd say the tests
> themselves are broken.

I completely missed that we could do that. I don't feel bright. Of
course it just works.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FOREGIN TABLE LACUNA

2012-06-20 Thread Alvaro Herrera


The patch uses literals such as 'r' to identify the relkind values.
This should be using RELKIND_RELATION et al instead -- see
src/include/catalog/pg_class.h.

Other than that, it seems simple enough ...

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Marko Kreen
On Wed, Jun 20, 2012 at 10:05 PM, Florian Pflug  wrote:
> I'm starting to think that relying on SSL/TLS for compression of
> unencrypted connections might not be such a good idea after all. We'd
> be using the protocol in a way it quite clearly never was intended to
> be used...

Maybe, but what is the argument that we should avoid
on encryption+compression at the same time?

AES is quite lightweight compared to compression, so should
be no problem in situations where you care about compression.

RSA is noticeable, but only for short connections.
Thus easily solvable with connection pooling.

And for really special compression needs you can always
create a UDF that does custom compression for you.

So what exactly is the situation we need to solve
with postgres-specific protocol compression?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-06-20 Thread Dimitri Fontaine
Hi,

Robert Haas  writes:
> 1. I still think we ought to get rid of the notion of BEFORE or AFTER
> (i.e. pg_event_trigger.evttype) and just make that detail part of the
> event name (e.g. pg_event_trigger.evtevent).  Many easily forseeable
> event types will be more like "during" rather than "before" or
> "after", and for those that do have a notion of before and after, we
> can have two different event names and include the word "before" or
> "after" there.  I am otherwise satisfied with the schema you've
> chosen.

It's not before/after anymore, but rather addon/replace if you will. I
kept the INSTEAD OF keyword for the replace semantics, that you've been
asking me to keep IIRC, with security policy plugins as a use case.

Now we can of course keep those semantics and embed them in the event
name we provide users, I though that maybe a documentation matrix of
which event support which "mode" would be cleaner to document. We might
as well find a clean way to implement both modes for most of the
commands, I don't know yet.

So, are you sure you want to embed that part of the event trigger
semantics in the event name itself?

> 2. I think it's important to be able to add new types of event
> triggers without creating excessive parser bloat.  I think it's

I've been trying to do that yes, as you can see with event_name and
event_trigger_variable rules. I've been re-using as much existing
keywords as I could because I believe that's not causing any measurable
bloat, I'll kindly reconsider if necessary, even if sadly.

> important to use some kind of generic syntax here which will be able
> to apply to all types of triggers we may want to add, now or in the
> future.  The easiest way to do that is to use literal syntax for the
> list of command tags, rather than writing them out as key words: e.g.
> 'ALTER TABLE' rather than ALTER TABLE.  It's not quite as pretty, but
> the savings in parser bloat and future code change seem well worth it.
>  Or, alternatively, we could use identifier style, e.g. alter_table,
> as I previously suggested.

Whatever the solution here, we still need to be able to ERROR out very
early if the user entered a non supported command here, such as GRANT
for the time being. I'm not sure a manual list of strcmp() would be more
effective than having bison basically produce the same thing for us. I
think using the grammar here makes for easier maintenance.

> 3. The event trigger cache seems to be a few bricks shy of a load.

I wouldn't be that surprised, mind you. I didn't have nearly as much
time I wanted to working on that project.

> First, event_trigger_cache_is_stalled is mis-named; I think you mean
> "stale", not "stalled".  Second, instead of setting that flag and then

Stale. Right. Edited.

> rebuilding the cache when you see the flag set, how about just blowing
> away the cache contents whenever you would have set the flag?  That

I've been doing that at first, but that meant several full rebuilds in a
row in the regression tests, which are adding new event triggers then
using them. I though lazily maintaining the cache would be better.

> seems a whole lot simpler and cleaner, and removes the need for a
> force_rebuild flag on BuildEventTriggerCache().  Third, ISTM that this
> isn't going to work correctly if backend A performs an event after
> backend B has built its cache.  To fix this, I think you need to rip
> out all the places where you force a rebuild locally and instead use
> something like CacheRegisterSyscacheCallback() to blow away the cache
> whenever something changes; you might find it helpful to look at
> attoptcache.c.

Ok, looking at that for next revision of the patch, which I should be
able to post early next week.

> 4. The documentation doesn't build.

Sorry about that, will get fixed too.

> 5. In terms of a psql command, I think that \dev is both not very
> mnemonic and, as you mentioned in the comment, possibly confusable
> with SQL/MED.  If we're going to have a \d command for this, I suggest
> something like \dy, which is not very mnemonic either but at least
> seems unlikely to be confused with anything else.  Other things that
> are unused include \dh, \dj, \dk, \dq, \dw, and \dz, if any of those
> grab you, or a somewhat broader range of things (but still nothing
> very memorable) if we include capital letters.  Or we could branch out
> into punctuation, like \d& -- or things that don't begin with the
> letter d, but that doesn't seem like a particularly good idea.

I'm not that fond of psql commands, but I don't think it's going to fly
not to have one for event triggers. I could buy \dy.

> 6. In objectaddress.c, I think that get_object_address_event_trigger
> should be eliminated in favor of an additional case in
> get_object_address_unqualified.

Sure. It used to be more complex than that when the identifier was a
composite with the command name, it makes no sense to separate it away
now. Done.

> 7. There are many references to command triggers 

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 10:12:46 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund  
wrote:
> > On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
> >> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund 
> > 
> > wrote:
> >> >> OK, so in this case, I still don't see how the "origin_id" is even
> >> >> enough.
> >> >> 
> >> >> C applies the change originally from A (routed through B, because
> >> >> it's faster).  But when it get's the change directly from A, how
> >> >> does it know to *not* apply it again?
> >> > 
> >> > The lsn of the change.
> >> 
> >> So why isn't the LSN good enough for when C propagates the change back
> >> to A?
> > 
> > Because every node has individual progress in the wal so the lsn doesn't
> > mean anything unless you know from which node it originally is.
> > 
> >> Why does A need more information than C?
> > 
> > Didn't I explain that two mails up?
> 
> Probably, but that didn't mean I understood it... I'm trying to keep up
> here ;-)
Heh. Yes. This developed into a huge mess already ;)

> So the origin_id isn't strictly for the origin node to know filter an
> LCR it's applied already, but it is also to correlate the LSN's
> because the LSN's of the re-generated LCR's are meant to contain the
> originating nodes's LSN, and every every node applying LCRs needs to
> be able to know where it is in every node's LSN progress.
There are multiple use-cases for it, this is one of them.

> I had assumed any LCR's generated on a node woudl be relative to the
> LSN sequencing of that node.
We have the original lsn in the commit record. Thats needed to support crash 
recovery because you need to know where to restart applying changes again 
from.

> > Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in
> > north america.
> > #1 wants changes from #3 and #4 when talking to #3 but not those from #2
> >  and itself (because that would be cheaper to get locally).
> 
> Right, but if the link between #1 and #2 ever "slows down", changes
> from #3 and #4 may very well already have #2's changes, and even
> require them.
> 
> #1 has to apply them, or is it going to stop applying LCR's from #3
> when it see's LCRs from #3 coming in that originate on #2 and have
> LSNs greater than what it's so far received from #2?
We will see ;). There are several solutions possible, possibly even depending 
on the use-case. This patch just want's to provide the very basic information 
required to implement such solutions...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk
On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund  wrote:
> On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
>> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund 
> wrote:
>> >> OK, so in this case, I still don't see how the "origin_id" is even
>> >> enough.
>> >>
>> >> C applies the change originally from A (routed through B, because it's
>> >> faster).  But when it get's the change directly from A, how does it
>> >> know to *not* apply it again?
>> >
>> > The lsn of the change.
>>
>> So why isn't the LSN good enough for when C propagates the change back to
>> A?
> Because every node has individual progress in the wal so the lsn doesn't mean
> anything unless you know from which node it originally is.
>
>> Why does A need more information than C?
> Didn't I explain that two mails up?

Probably, but that didn't mean I understood it... I'm trying to keep up here ;-)

So the origin_id isn't strictly for the origin node to know filter an
LCR it's applied already, but it is also to correlate the LSN's
because the LSN's of the re-generated LCR's are meant to contain the
originating nodes's LSN, and every every node applying LCRs needs to
be able to know where it is in every node's LSN progress.

I had assumed any LCR's generated on a node woudl be relative to the
LSN sequencing of that node.

> Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north
> america.
> #1 wants changes from #3 and #4 when talking to #3 but not those from #2  and
> itself (because that would be cheaper to get locally).

Right, but if the link between #1 and #2 ever "slows down", changes
from #3 and #4 may very well already have #2's changes, and even
require them.

#1 has to apply them, or is it going to stop applying LCR's from #3
when it see's LCRs from #3 coming in that originate on #2 and have
LSNs greater than what it's so far received from #2?


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Greg Smith

On 06/20/2012 03:22 PM, Peter Geoghegan wrote:

The situation would be made a lot better if we could just find a way
to generalise pgbench a little bit more. I'm thinking about a facility
for specifying new tables in scripts, with a moderate degree of
flexibility as to their definition, data, and the distribution of that
data.


That's probably excess scope creep for this CF though.  And converting 
this new script to use a longer name like --test would be a useful 
forward step for such a larger plan anyway.


I think there's some useful value to breaking out the existing test 
scripts into files too, rather than all of them being embedded in the 
source code.  While there's not any pressure to reclaim existing 
switches like -S and -N, having those just be shortcuts for a 
"--test=pgbench-select-only.sql" form would make sure that external 
script facility worked right.  I think there's some value to making it 
easier for people to copy the built-in scripts and hack on them too.


The main sticky point I see here that really needs to be improved now, 
to turn any of the built-in ones to external scripts, is flagging how to 
handle the database scale.  Right now the built-in scripts set :scale 
based on counting pgbench_branches, while external ones run with "-f" do 
not.  What things like pgbench-tools end up doing is manually doing that 
count and then passing it in with "-s ".  That's annoying, and if 
this is getting refactored I'd like to improve that too.  Getting that 
logic cleanly while being backward compatible is a bit complicated; I'm 
going to think about that for a bit.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_prewarm

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 3:53 PM, Peter Eisentraut  wrote:
> I'm worried about the overlap with pgfincore.  It's pretty well
> established in this space, and has about 73% feature overlap with
> pg_prewarm, while having more features all together.  So it would seem
> to me that it would be better to add whatever is missing to pgfincore
> instead.  Or split pgfincore, as suggested above, but that would upset
> existing users.  But adding severely overlapping stuff for no technical
> reasons (or there any?) doesn't sound like a good idea.

73%?  I think it's got about 15% overlap.

The biggest problem with pgfincore from my point of view is that it
only works under Linux, whereas I use a MacOS X machine for my
development, and there is also Windows to think about.  Even if that
were fixed, though, I feel we ought to have something in the core
distribution.  This patch got more +1s than 95% of what gets proposed
on hackers.

> Also, Robert has accurately described this as "mechanism, not policy".
> That's fine, that's what all of our stuff is.  Replication and most of
> postgresql.conf suffers from that.  Eventually someone will want to
> create a way to save and restore various caches across server restarts,
> as discussed before.  Would that mean there will be a third way to do
> all this, or could we at least align things a bit so that such a
> facility could use most of the proposed prewarming stuff?  (Patches for
> the cache restoring exist, so it should be possible to predict this a
> little bit.)

Well, pg_buffercache + pg_prewarm is enough to save and restore shared
buffers.  Not the OS cache, but we don't have portable code to query
the OS cache yet anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 2:57 PM, Heikki Linnakangas
 wrote:
> On 20.06.2012 21:41, Peter Geoghegan wrote:
>> On 20 June 2012 18:42, Robert Haas  wrote:
>>> On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs
>>>  wrote:
 I'm sure Jeff submitted this because of the need for a standard test,
 rather than the wish to actually modify pgbench itself.

 Can I suggest that we include a list of standard scripts with pgbench
 for this purpose? These can then be copied alongside the binary when
 we do an install.
>>>
>>> I was thinking along similar lines myself.  At the least, I think we
>>> can't continue to add a short option for every new test type.
>>> Instead, maybe we could have --test-type=WHATEVER, and perhaps that
>>> then reads whatever.sql from some compiled-in directory.  That would
>>> allow us to sanely support a moderately large number of tests.
>
> We could call the --test-type option -f, and the "compiled-in directory"
> could be the current directory ;-).

Well, that sounds a lot like "let's reject the patch".  Which would be
OK with me, I guess, but if the goal is to make it easy for all
developers to run that particular test, I'm not sure that's getting us
there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_prewarm

2012-06-20 Thread Peter Eisentraut
On tis, 2012-04-10 at 13:29 +0200, Cédric Villemain wrote:
> I have no problem deprecating overlapping features from pgfincore as
> soon as I can do a «depend:pg_prewarm[os_warm]»  :)  ...It would have
> been better to split pgfincore analyze and warming parts times 
> ago, anyway. 

So pg_prewarm is now pending in the commitfest, so let's see what the
situation is.

I'm worried about the overlap with pgfincore.  It's pretty well
established in this space, and has about 73% feature overlap with
pg_prewarm, while having more features all together.  So it would seem
to me that it would be better to add whatever is missing to pgfincore
instead.  Or split pgfincore, as suggested above, but that would upset
existing users.  But adding severely overlapping stuff for no technical
reasons (or there any?) doesn't sound like a good idea.

Also, Robert has accurately described this as "mechanism, not policy".
That's fine, that's what all of our stuff is.  Replication and most of
postgresql.conf suffers from that.  Eventually someone will want to
create a way to save and restore various caches across server restarts,
as discussed before.  Would that mean there will be a third way to do
all this, or could we at least align things a bit so that such a
facility could use most of the proposed prewarming stuff?  (Patches for
the cache restoring exist, so it should be possible to predict this a
little bit.)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 3:39 PM, Andres Freund  wrote:
> On Wednesday, June 20, 2012 09:33:26 PM Dimitri Fontaine wrote:
>> > Hmm, I don't like the idea of a test that only runs in serial mode.
>> > Maybe we can find some way to make it work in parallel mode as well.
>>
>> I don't see how we can manage to do that, as adding an event trigger to
>> some command will impact tests running in parallel.
> Cant you just put it alone in a test group in the parallel_schedule? Several
> other tests do that?

Yeah, that seems like it should work.  If not, I'd say the tests
themselves are broken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund  
wrote:
> >> OK, so in this case, I still don't see how the "origin_id" is even
> >> enough.
> >> 
> >> C applies the change originally from A (routed through B, because it's
> >> faster).  But when it get's the change directly from A, how does it
> >> know to *not* apply it again?
> > 
> > The lsn of the change.
> 
> So why isn't the LSN good enough for when C propagates the change back to
> A?
Because every node has individual progress in the wal so the lsn doesn't mean 
anything unless you know from which node it originally is.

> Why does A need more information than C?
Didn't I explain that two mails up?

Perhaps Chris' phrasing explains the basic idea better:

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
> The case where it would be needful is if you are in the process of
> assembling together updates coming in from multiple masters, and need
> to know:
>- This INSERT was replicated from node #1, so should be ignored
> downstream - That INSERT was replicated from node #2, so should be ignored
> downstream - This UPDATE came from the local node, so needs to be passed
> to downstream users

Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north 
america.
#1 wants changes from #3 and #4 when talking to #3 but not those from #2  and 
itself (because that would be cheaper to get locally).

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 20:43, Fujii Masao wrote:

On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander  wrote:

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas  wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
  wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.


+1.


The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.


Seems fine.


Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in>= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.


I share Andres' question about how we're doing this already.  I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now.  At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.


If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?


pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?


Yeah, you can't use pg_receivexlog from 9.2 against a 9.3 server. We 
can't really promise compatibility when using an older client against a 
newer server, but we can try to be backwards-compatible in the other 
direction. I'm thinking of using a 9.3 pg_receivexlog against a 9.2 server.


But I guess Robert is right and we shouldn't worry about 
backwards-compatibility at this point. Instead, let's try to get the 
protocol right, so that we can more easily provide 
backwards-compatibility in the future. Like, using a 9.4 pg_receivexlog 
against a 9.3 server.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk
On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund  wrote:

>> OK, so in this case, I still don't see how the "origin_id" is even enough.
>>
>> C applies the change originally from A (routed through B, because it's
>> faster).  But when it get's the change directly from A, how does it
>> know to *not* apply it again?
> The lsn of the change.

So why isn't the LSN good enough for when C propagates the change back to A?

Why does A need more information than C?

a.


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 09:33:26 PM Dimitri Fontaine wrote:
> > Hmm, I don't like the idea of a test that only runs in serial mode.
> > Maybe we can find some way to make it work in parallel mode as well.
> 
> I don't see how we can manage to do that, as adding an event trigger to
> some command will impact tests running in parallel.
Cant you just put it alone in a test group in the parallel_schedule? Several 
other tests do that?

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-06-20 Thread Dimitri Fontaine
Alvaro Herrera  writes:
> Did you try REASSIGN OWNED and DROP OWNED with a role that has defined
> some event triggers?

Not yet. Will add to regression tests, good catch.

> Hmm, I don't like the idea of a test that only runs in serial mode.
> Maybe we can find some way to make it work in parallel mode as well.

I don't see how we can manage to do that, as adding an event trigger to
some command will impact tests running in parallel.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 09:23:34 PM Heikki Linnakangas wrote:
> > And then we just put the originid on each heap record for MMR, in some
> > manner, discussed later.
> 
> I reserve the right to object to that, too :-). Others raised the 
> concern that a 16-bit integer is not a very intuitive identifier. Also, 
> as discussed, for more complex scenarios just the originid is not 
> sufficient. ISTM that we need more flexibility.
I think the '16bit integer is unintiuitive' argument isn't that interesting. 
As pointed out by multiple people in the thread that origin_id can be local 
and mapped to something more complex in communication between the different 
nodes and the configuration.
Before applying changes from another node you lookup their "complex id" into 
the locally mapped 16bit origin_id which then gets written into the wal 
stream. When decoding the wal stream into the LCR stream its mapped the other 
way.

We might need more information than that at a later point but those probably 
won't needed during low-level filtering of wal before reassembling it into 
transactions...

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 09:24:29 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund  
wrote:
> > To recap why we think origin_id is a sensible design choice:
> > 
> > There are many sensible replication topologies where it does make sense
> > that you want to receive changes (on node C) from one node (say B) that
> > originated from some other node (say A).
> > Reasons include:
> > * the order of applying changes should be as similar as possible on all
> > nodes. That means when applying a change on C that originated on B and
> > if changes replicated faster from A->B than from A->C you want to be at
> > least as far with the replication from A as B was. Otherwise the
> > conflict ratio will increase. If you can recreate the stream from the
> > wal of every node and still detect where an individual change
> > originated, thats easy.
> 
> OK, so in this case, I still don't see how the "origin_id" is even enough.
> 
> C applies the change originally from A (routed through B, because it's
> faster).  But when it get's the change directly from A, how does it
> know to *not* apply it again?
The lsn of the change.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 03:23, Heikki Linnakangas
 wrote:

>> And then we just put the originid on each heap record for MMR, in some
>> manner, discussed later.
>
>
> I reserve the right to object to that, too :-).

OK. But that would be only for MMR, using special record types.

> Others raised the concern
> that a 16-bit integer is not a very intuitive identifier.

Of course

> Also, as
> discussed, for more complex scenarios just the originid is not sufficient.
> ISTM that we need more flexibility.

Of course

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 03:13, Heikki Linnakangas
 wrote:
> On 20.06.2012 21:51, Simon Riggs wrote:
>>
>> On 21 June 2012 02:32, Heikki Linnakangas
>>   wrote:
>>>
>>> I'm not saying that we need to implement all possible conflict resolution
>>>
>>> algorithms right now - on the contrary I think conflict resolution
>>> belongs
>>> outside core
>>
>>
>> It's a pretty standard requirement to have user defined conflict
>> handling, if that's what you mean.
>>
>> I'm OK with conflict handling being outside of core as a module, if
>> that's what people think. We just need a commit hook. That is more
>> likely to allow a workable solution with 9.3 as well, ISTM. It's
>> likely to be contentious...
>
>
> Hmm, what do you need to happen at commit time?
>
> We already have RegisterXactCallback(), if that's enough...

Let me refer back to my notes and discuss this later, I'm just going
from vague memory here, which isn't helpful.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk
On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund  wrote:

> To recap why we think origin_id is a sensible design choice:
>
> There are many sensible replication topologies where it does make sense that
> you want to receive changes (on node C) from one node (say B) that originated
> from some other node (say A).
> Reasons include:
> * the order of applying changes should be as similar as possible on all nodes.
> That means when applying a change on C that originated on B and if changes
> replicated faster from A->B than from A->C you want to be at least as far with
> the replication from A as B was. Otherwise the conflict ratio will increase.
> If you can recreate the stream from the wal of every node and still detect
> where an individual change originated, thats easy.

OK, so in this case, I still don't see how the "origin_id" is even enough.

C applies the change originally from A (routed through B, because it's
faster).  But when it get's the change directly from A, how does it
know to *not* apply it again?




> * the interconnects between some nodes may be more expensive than from others
> * an interconnect between two nodes may fail but others dont
>
> Because of that we think its sensible to be able generate the full LCR stream
> with all changes, local and remote ones, on each individual node. If you then
> can filter on individual origin_id's you can build complex replication
> topologies without much additional complexity.
>
>> I'm not saying that we need to implement all possible conflict
>> resolution algorithms right now - on the contrary I think conflict
>> resolution belongs outside core - but if we're going to change the WAL
>> record format to support such conflict resolution, we better make sure
>> the foundation we provide for it is solid.
> I think this already provides a lot. At some point we probably want to have
> support for looking on which node a certain local xid originated and when that
> was originally executed. While querying that efficiently requires additional
> support we already have all the information for that.
>
> There are some more complexities with consistently determining conflicts on
> changes that happened in a very small timewindown on different nodes but thats
> something for another day.
>
>> BTW, one way to work around the lack of origin id in the WAL record
>> header is to just add an origin-id column to the table, indicating the
>> last node that updated the row. That would be a kludge, but I thought
>> I'd mention it..
> Yuck. The aim is to improve on whats done today ;)
>
> --
>  Andres Freund                     http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 22:11, Simon Riggs wrote:

On 21 June 2012 02:56, Simon Riggs  wrote:


I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.


Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.


I don't think that's actually necessary, the WAL bloat isn't *that* bad 
that we need to start shaving bytes from there. I was just trying to 
make a point.



And then we just put the originid on each heap record for MMR, in some
manner, discussed later.


I reserve the right to object to that, too :-). Others raised the 
concern that a 16-bit integer is not a very intuitive identifier. Also, 
as discussed, for more complex scenarios just the originid is not 
sufficient. ISTM that we need more flexibility.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
Josh Berkus  writes:
>> 
>> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
>> 
>> to cover the case when column names do not match.

> Personally, I think that's going way beyond what we want to do with
> COPY.

I agree --- if you know that much about the incoming data, you might as
well just specify the correct column list in the COPY command.  This
would only have use if you know somebody used weird column names, but
not the order they put the columns in; which seems like a thin use-case.

The three options Josh enumerated seem reasonably sane to me.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 19:57, Heikki Linnakangas
 wrote:
> Yeah, this sounds like a good approach. A library of standard workload
> scripts seems very useful. I've been using custom scripts to benchmark WAL
> insertion scalability lately, that also seems like a kind of a thing to put
> in such a library. I don't know if we should ship the library of scripts in
> contrib, or just put them up on a web site, but something like that...

The situation would be made a lot better if we could just find a way
to generalise pgbench a little bit more. I'm thinking about a facility
for specifying new tables in scripts, with a moderate degree of
flexibility as to their definition, data, and the distribution of that
data.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Simon Riggs
On 21 June 2012 02:57, Heikki Linnakangas
 wrote:
> On 20.06.2012 21:41, Peter Geoghegan wrote:
>>
>> On 20 June 2012 18:42, Robert Haas  wrote:
>>>
>>> On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs
>>>  wrote:

 I'm sure Jeff submitted this because of the need for a standard test,
 rather than the wish to actually modify pgbench itself.

 Can I suggest that we include a list of standard scripts with pgbench
 for this purpose? These can then be copied alongside the binary when
 we do an install.
>>>
>>>
>>> I was thinking along similar lines myself.  At the least, I think we
>>> can't continue to add a short option for every new test type.
>>> Instead, maybe we could have --test-type=WHATEVER, and perhaps that
>>> then reads whatever.sql from some compiled-in directory.  That would
>>> allow us to sanely support a moderately large number of tests.
>
>
> We could call the --test-type option -f, and the "compiled-in directory"
> could be the current directory ;-).

;-)

Yeh, -f is good

We should read $PATH, but if not found, it should search the share dir
for a script of that name.


>> +1. As long as pgbench is considered to be the standard benchmarking
>> tool (and I think that it is a general problem that it is), we ought
>> to make an effort to give people more options.
>
>
> Yeah, this sounds like a good approach. A library of standard workload
> scripts seems very useful. I've been using custom scripts to benchmark WAL
> insertion scalability lately, that also seems like a kind of a thing to put
> in such a library. I don't know if we should ship the library of scripts in
> contrib, or just put them up on a web site, but something like that...

Mix of both, I guess. We just add in to contrib any new scripts that
prove useful

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
Hi,

On Wednesday, June 20, 2012 08:32:53 PM Heikki Linnakangas wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
> > On 20 June 2012 16:23, Heikki Linnakangas
> > 
> >   wrote:
> >> On 20.06.2012 11:17, Simon Riggs wrote:
> >>> On 20 June 2012 15:45, Heikki Linnakangas
> >>> 
> >>> wrote:
>  So, if the origin id is not sufficient for some conflict resolution
>  mechanisms, what extra information do you need for those, and where do
>  you put it?
> >>> 
> >>> As explained elsewhere, wal_level = logical (or similar) would be used
> >>> to provide any additional logical information required.
> >>> 
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> >>> 
> >>> In the case of reflexive updates you raised, a typical response in
> >>> other DBMS would be to represent the query
> >>> 
> >>>UPDATE SET counter = counter + 1
> >>> 
> >>> by sending just the "+1" part, not the current value of counter, as
> >>> would be the case with the non-reflexive update
> >>> 
> >>>UPDATE SET counter = 1
> >>> 
> >>> Handling such things in Postgres would require some subtlety, which
> >>> would not be resolved in first release but is pretty certain not to
> >>> require any changes to the WAL record header as a way of resolving it.
> >>> Having already thought about it, I'd estimate that is a very long
> >>> discussion and not relevant to the OT, but if you wish to have it
> >>> here, I won't stop you.
> >> 
> >> Yeah, I'd like to hear briefly how you would handle that without any
> >> further changes to the WAL record header.
> > 
> > I already did:
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> > 
> > The case you mentioned relates to UPDATEs only, so I would suggest
> > that we add that information to a new form of update record only.
> > 
> > That has nothing to do with the WAL record header.
> 
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update
> and delete records.
Imo the whole +1 stuff doesn't have anything to do with the origin_id proposal 
and should be ignored for quite a while. We might go to something like it 
sometime in the future but its nothing we work on (as far as I know ;)).

wal_level=logical (in patch 07) currently only changes the following things 
about the wal stream:

For HEAP_(INSERT|(HOT_)?UPDATE|DELETE)
* prevent full page writes from removing the row data (could be optimized at 
some point to just store the tuple slot)

For HEAP_DELETE
* add the primary key of the changed row

HEAP_MULTI_INSERT obviously needs to get the same treatment in future.

The only real addition that I forsee in the near future is logging the old 
primary key when the primary key changes in HEAP_UPDATE.

Kevin wants an option for full pre-images of rows in HEAP_(UPDATE|DELETE)

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header
> alone? Do you ever need extra information on other record types?
Its needed in some more locations: HEAP_HOT_UPDATE, HEAP2_MULTI_INSERT, 
HEAP_NEWPAGE, HEAP_XACT_(ASSIGN, COMMIT, COMMIT_PREPARED, COMMIT_COMPACT, 
ABORT, ABORT_PREPARED) and probably some I didn't remember right now.

Sure, we can add it to all those but then you need to have individual 
knowledge about *all* of those because the location where its stored will be 
different for each of them.

To recap why we think origin_id is a sensible design choice:

There are many sensible replication topologies where it does make sense that 
you want to receive changes (on node C) from one node (say B) that originated 
from some other node (say A).
Reasons include:
* the order of applying changes should be as similar as possible on all nodes. 
That means when applying a change on C that originated on B and if changes 
replicated faster from A->B than from A->C you want to be at least as far with 
the replication from A as B was. Otherwise the conflict ratio will increase. 
If you can recreate the stream from the wal of every node and still detect 
where an individual change originated, thats easy.
* the interconnects between some nodes may be more expensive than from others
* an interconnect between two nodes may fail but others dont

Because of that we think its sensible to be able generate the full LCR stream 
with all changes, local and remote ones, on each individual node. If you then 
can filter on individual origin_id's you can build complex replication 
topologies without much additional complexity.

> I'm not saying that we need to implement all possible conflict
> resolution algorithms right now - on the contrary I think conflict
> resolution belongs outside core - but if we're going to change the WAL
> record format to support such c

Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-06-20 Thread Alvaro Herrera
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:

> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
> and SvPVUTF8() when turning a perl string into a cstring.

Right.

So I played a bit with this patch, and touched it a bit mainly just to
add some more comments; and while at it I noticed that some of the
functions in Util.xs might leak some memory, so I made an attempt to
plug them, as in the attached patch (which supersedes yours).

Now, with my version of the patch applied and using a SQL_ASCII database
to test the problem in the original report, I notice that we now have a
regression failure:

*** /pgsql/source/HEAD/src/pl/plperl/expected/plperl.out2012-05-16 
13:38:02.495647259 -0400
--- /home/alvherre/Code/pgsql/build/HEAD/src/pl/plperl/results/plperl.out   
2012-06-20 15:09:19.869778824 -0400
***
*** 658,664 
return "abcd\0efg";
  $$ LANGUAGE plperl;
  SELECT perl_zerob();
! ERROR:  invalid byte sequence for encoding "UTF8": 0x00
  CONTEXT:  PL/Perl function "perl_zerob"
  -- make sure functions marked as VOID without an explicit return work
  CREATE OR REPLACE FUNCTION myfuncs() RETURNS void AS $$
--- 658,664 
return "abcd\0efg";
  $$ LANGUAGE plperl;
  SELECT perl_zerob();
! ERROR:  invalid byte sequence for encoding "SQL_ASCII": 0x00
  CONTEXT:  PL/Perl function "perl_zerob"
  -- make sure functions marked as VOID without an explicit return work
  CREATE OR REPLACE FUNCTION myfuncs() RETURNS void AS $$

==


I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer?  Or should I just take the test
out?  Opinions please.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


plperl_sql_ascii-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 21:51, Simon Riggs wrote:

On 21 June 2012 02:32, Heikki Linnakangas
  wrote:

I'm not saying that we need to implement all possible conflict resolution
algorithms right now - on the contrary I think conflict resolution belongs
outside core


It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...


Hmm, what do you need to happen at commit time?

We already have RegisterXactCallback(), if that's enough...


BTW, one way to work around the lack of origin id in the WAL record header
is to just add an origin-id column to the table, indicating the last node
that updated the row. That would be a kludge, but I thought I'd mention it..


err, I hope you mean that to be funny. (It wouldn't actually work either)


No, I wasn't serious that we should implement it that way. But now you 
made me curious; why would it not work? If there's an origin-id column 
in a table, it's included in every heap insert/delete/update WAL record. 
Just set it to the current node's id on a local modification, and to the 
origin's id when replaying changes from another node, and you have the 
exact same information as you would with the extra field in WAL record 
header.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 02:56, Simon Riggs  wrote:

> I think allowing rmgrs to redefine the wasted bytes in the header is
> the best idea.

Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.

And then we just put the originid on each heap record for MMR, in some
manner, discussed later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing 9.2 in ~production environment

2012-06-20 Thread James Cloos
Updating pg_database to set datctype='C' did solve the speed issues with
the two largs dbs.

Presumably, since LC_CTYPE=en_US.UTF-8 was in the env when I ran pg_restore,
it overrode the ctype setting in the dump files.

Some of the slow selects do use ilike; even w/ datctype='C' the indices
are skipped for at least this query:

# explain analyze SELECT mb_id FROM mb WHERE name ILIKE 'foo@bar' AND ownerid=7;

 QUERY PLAN 
 
-
 Seq Scan on mb (cost=0.00..570.96 rows=3 width=4) (actual time=9.443..25.039 
rows=1 loops=1)
   Filter: ((name ~~* 'foo@bar'::text) AND (ownerid = 7))
   Rows Removed by Filter: 34827
 Total runtime: 25.071 ms
(4 rows)

The mb table has several indices, including separate ones on name and ownerid.

(not my design, btw.  And I really do need to re-write the middleware)

Whether it is strcoll(3) (even though LC_COLLATE is explicitly C) or
LIKE, it does make a significant difference for those two apps.

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun20, 2012, at 18:42 , Magnus Hagander wrote:
> That is a very good point. Before we design *another* feature that
> relies on it, we should verify if the syntax is compatible in the
> other libraries that would be interesting (gnutls and NSS primarily),
> and if it's not that at least the *functionality* exists ina
> compatible way. So we don't put ourselves in a position where we can't
> proceed.

Hm, here's another problem with relying on SSL/TLS for compression.
RFC2246, which defines TLS 1.0, explicitly states that

   "TLS_NULL_WITH_NULL_NULL is specified and is the initial state of a
TLS connection during the first handshake on that channel, but MUST
NOT be negotiated, as it provides no more protection than an
unsecured connection." [RFC2246, A.5. The Cipher Suite]

and that paragraph is still present in RFC5246 (TLS 1.2). The other
cipher suits without actual encryption seem to be

  TLS_RSA_WITH_NULL_MD5
  TLS_RSA_WITH_NULL_SHA
  TLS_RSA_WITH_NULL_SHA256 (TLS 1.2)

Unless I'm missing something, that leaves us with no way of skipping the
initial RSA handshake and also (more importantly) of computing a MD5
or SHA digest of every packet sent.

I'm starting to think that relying on SSL/TLS for compression of
unencrypted connections might not be such a good idea after all. We'd
be using the protocol in a way it quite clearly never was intended to
be used...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 21:41, Peter Geoghegan wrote:

On 20 June 2012 18:42, Robert Haas  wrote:

On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs  wrote:

I'm sure Jeff submitted this because of the need for a standard test,
rather than the wish to actually modify pgbench itself.

Can I suggest that we include a list of standard scripts with pgbench
for this purpose? These can then be copied alongside the binary when
we do an install.


I was thinking along similar lines myself.  At the least, I think we
can't continue to add a short option for every new test type.
Instead, maybe we could have --test-type=WHATEVER, and perhaps that
then reads whatever.sql from some compiled-in directory.  That would
allow us to sanely support a moderately large number of tests.


We could call the --test-type option -f, and the "compiled-in directory" 
could be the current directory ;-).



+1. As long as pgbench is considered to be the standard benchmarking
tool (and I think that it is a general problem that it is), we ought
to make an effort to give people more options.


Yeah, this sounds like a good approach. A library of standard workload 
scripts seems very useful. I've been using custom scripts to benchmark 
WAL insertion scalability lately, that also seems like a kind of a thing 
to put in such a library. I don't know if we should ship the library of 
scripts in contrib, or just put them up on a web site, but something 
like that...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 02:45, Heikki Linnakangas
 wrote:
> On 20.06.2012 16:46, Simon Riggs wrote:
>>
>> The proposal now includes flag bits that would allow the addition of a
>> variable length header, should that ever become necessary. So the
>> unused space in the fixed header is not being "used up" as you say. In
>> any case, the fixed header still has 4 wasted bytes on 64bit systems
>> even after the patch is applied. So this claim of short sightedness is
>> just plain wrong.
>>
>> ...
>
>>
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs. It is also possible to add this to HEAP
>> and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
>> using the *currently wasted* bytes on a WAL record header does *not*
>> bloat the WAL stream.
>

Wonderful ideas, these look good.

> Or, we could provide a mechanism for resource managers to use those padding
> bytes for whatever data the wish to use.

Sounds better to me.

> Or modify the record format so that
> the last 4 bytes of the data in the WAL record are always automatically
> stored in those padding bytes, thus making all WAL records 4 bytes shorter.
> That would make the WAL even more compact, with only a couple of extra CPU
> instructions in the critical path.

Sounds cool, but a little weird, even for me.

> My point is that it's wrong to think that it's free to use those bytes, just
> because they're currently unused. If we use them for one thing, we can't use
> them for other things anymore. If we're so concerned about WAL bloat that we
> can't afford to add any more bytes to the WAL record header or heap WAL
> records, then it would be equally fruitful to look at ways to use those
> padding bytes to save that precious WAL space.

Agreed. Thanks for sharing those ideas. Exactly why I like the list (really...)

> I don't think we're *that* concerned about the WAL bloat, however. So let's
> see what is the most sensible place to add whatever extra information we
> need in the WAL, from the point of view of maintainability, flexibility,
> readability etc. Then we can decide where to put it.

Removing FPW is still most important aspect there.

I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 02:32, Heikki Linnakangas
 wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
>>
>> On 20 June 2012 16:23, Heikki Linnakangas
>>   wrote:
>>>
>>> On 20.06.2012 11:17, Simon Riggs wrote:


 On 20 June 2012 15:45, Heikki Linnakangas
     wrote:
>
>
> So, if the origin id is not sufficient for some conflict resolution
> mechanisms, what extra information do you need for those, and where do
> you put it?


 As explained elsewhere, wal_level = logical (or similar) would be used
 to provide any additional logical information required.

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

 In the case of reflexive updates you raised, a typical response in
 other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
 by sending just the "+1" part, not the current value of counter, as
 would be the case with the non-reflexive update
   UPDATE SET counter = 1

 Handling such things in Postgres would require some subtlety, which
 would not be resolved in first release but is pretty certain not to
 require any changes to the WAL record header as a way of resolving it.
 Having already thought about it, I'd estimate that is a very long
 discussion and not relevant to the OT, but if you wish to have it
 here, I won't stop you.
>>>
>>>
>>>
>>> Yeah, I'd like to hear briefly how you would handle that without any
>>> further
>>> changes to the WAL record header.
>>
>>
>> I already did:
>>
 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.
>>
>>
>> The case you mentioned relates to UPDATEs only, so I would suggest
>> that we add that information to a new form of update record only.
>>
>> That has nothing to do with the WAL record header.
>
>
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update and
> delete records.

Yes

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header alone? Do
> you ever need extra information on other record types?

No extra info on other record types, in general at least. Doing it
that way is the most logical place, just not the most efficient.

> The other question is, *what* information do you need to put there? We don't
> necessarily need to have a detailed design of that right now, but it seems
> quite clear to me that we need more flexibility there than this patch
> provides, to support more complicated conflict resolution.

Another patch already covers that for the common non-reflexive case.
More on conflict resolution soon, I guess. I was going to begin
working on that in July. Starting with the design discussed on list,
of course. This patch has thrown up stuff I thought was
compartmentalised, but I was wrong.

> I'm not saying that we need to implement all possible conflict resolution
> algorithms right now - on the contrary I think conflict resolution belongs
> outside core

It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...

> - but if we're going to change the WAL record format to support
> such conflict resolution, we better make sure the foundation we provide for
> it is solid.

Agreed

> BTW, one way to work around the lack of origin id in the WAL record header
> is to just add an origin-id column to the table, indicating the last node
> that updated the row. That would be a kludge, but I thought I'd mention it..

err, I hope you mean that to be funny. (It wouldn't actually work either)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 16:46, Simon Riggs wrote:

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being "used up" as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.


> ...
>

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.


Or, we could provide a mechanism for resource managers to use those 
padding bytes for whatever data the wish to use. Or modify the record 
format so that the last 4 bytes of the data in the WAL record are always 
automatically stored in those padding bytes, thus making all WAL records 
4 bytes shorter. That would make the WAL even more compact, with only a 
couple of extra CPU instructions in the critical path.


My point is that it's wrong to think that it's free to use those bytes, 
just because they're currently unused. If we use them for one thing, we 
can't use them for other things anymore. If we're so concerned about WAL 
bloat that we can't afford to add any more bytes to the WAL record 
header or heap WAL records, then it would be equally fruitful to look at 
ways to use those padding bytes to save that precious WAL space.


I don't think we're *that* concerned about the WAL bloat, however. So 
let's see what is the most sensible place to add whatever extra 
information we need in the WAL, from the point of view of 
maintainability, flexibility, readability etc. Then we can decide where 
to put it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 07:50:37 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund  
wrote:
> >> I realized a problem with that idea this morning: it might work for
> >> reading things, but if anyone attempts to write data you've got big
> >> problems.  Maybe we could get away with forbidding that, not sure.
> > 
> > Hm, why is writing a problem? You mean io conversion routines writing
> > data? Yes, that will be a problem. I am fine with simply forbidding
> > that, we should be able to catch that and provide a sensible error
> > message, since SSI we have the support for that.
> I think we could do something a little more vigorous than that, like
> maybe error out if anyone tries to do anything that would write WAL or
> acquire an XID.
I would go for all of them ;). The read only transaction warnings will 
probably result in the best error messages.

> Of course, then the question becomes: then what?  We
> probably need to think about what happens after that - we don't want
> an error replicating one row in one table to permanently break
> replication for the entire system.
Interesting problem, yes. My first reaction was to just to warn and log and 
throw the transaction away... But thats not likely to work very well on the 
apply side...

I don't think its a particularly likely problem though. An io function doing 
that would probably fail in lots of other situations (HS, read only 
transactions, permission problems).

> >> Sorry.  I don't think you're planning to do something evil, but before
> >> I thought you said you did NOT want to write the code to extract
> >> changes as text or something similar.
> > Hm. I might have been a bit ambiguous when saying that I do not want to
> > provide everything for that use-case.
> > Once we have a callpoint that has a correct catalog snapshot for exactly
> > the tuple in question text conversion is damn near trivial. The point
> > where you get passed all that information (action, tuple, table,
> > snapshot) is the one I think the patch should mainly provide.
> This is actually a very interesting list.  We could rephrase the
> high-level question about the design of this feature as "what is the
> best way to make sure that you have these things available?".  Action
> and tuple are trivial to get, and table isn't too hard either.  It's
> really the snapshot - and all the downstream information that can only
> be obtained via using that snapshot - that is the hard part.
For others, a sensible entry point into this discussion before switching 
subthreads probably is in http://archives.postgresql.org/message-
id/201206192023.20589.and...@2ndquadrant.com

The table isn't as easy as you might think as the wal record only contains the 
relfilenode. Unless you want to log more its basically solved by solving the 
snapshot issue.

And yes, agreeing on how to do that is the thing we need to solve next. Patch 
07 (add enough information to wal) and this are the first ones that should get 
committed imo. And be ready for that obviously.

Just noticed that I failed to properly add Patch 07 to the commitfest. I have 
done so now, hope thats ok.


Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 18:42, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs  wrote:
>> I'm sure Jeff submitted this because of the need for a standard test,
>> rather than the wish to actually modify pgbench itself.
>>
>> Can I suggest that we include a list of standard scripts with pgbench
>> for this purpose? These can then be copied alongside the binary when
>> we do an install.
>
> I was thinking along similar lines myself.  At the least, I think we
> can't continue to add a short option for every new test type.
> Instead, maybe we could have --test-type=WHATEVER, and perhaps that
> then reads whatever.sql from some compiled-in directory.  That would
> allow us to sanely support a moderately large number of tests.

+1. As long as pgbench is considered to be the standard benchmarking
tool (and I think that it is a general problem that it is), we ought
to make an effort to give people more options.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 17:35, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
  wrote:

On 20.06.2012 11:17, Simon Riggs wrote:


On 20 June 2012 15:45, Heikki Linnakangas
wrote:


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do
you put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.



Yeah, I'd like to hear briefly how you would handle that without any further
changes to the WAL record header.


I already did:


Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.


The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.


Hmm, so you need the origin id in the WAL record header to do filtering. 
Except when that's not enough, you add some more fields to heap update 
and delete records.


Don't you think it would be simpler to only add the extra fields to heap 
insert, update and delete records, and leave the WAL record header 
alone? Do you ever need extra information on other record types?


The other question is, *what* information do you need to put there? We 
don't necessarily need to have a detailed design of that right now, but 
it seems quite clear to me that we need more flexibility there than this 
patch provides, to support more complicated conflict resolution.


I'm not saying that we need to implement all possible conflict 
resolution algorithms right now - on the contrary I think conflict 
resolution belongs outside core - but if we're going to change the WAL 
record format to support such conflict resolution, we better make sure 
the foundation we provide for it is solid.


BTW, one way to work around the lack of origin id in the WAL record 
header is to just add an origin-id column to the table, indicating the 
last node that updated the row. That would be a kludge, but I thought 
I'd mention it..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-06-20 Thread Kevin Grittner
John Lumby  wrote:
 
> I attach patch based on clone of postgresql.git as of yesterday 
 
Please read about the CommitFest process:
 
http://wiki.postgresql.org/wiki/CommitFest
 
and add your patch to the open CF:
 
https://commitfest.postgresql.org/action/commitfest_view/open
 
This will ensure that the patch doesn't get lost before the next review
cycle starts.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Too frequent message of pgbench -i?

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 4:04 AM, Tatsuo Ishii  wrote:
> Currently pgbench -i prints following message every 10k tuples created.
>
>                        fprintf(stderr, "%d tuples done.\n", j);
>
> I think it's long time ago when the frequency of message seemed to be
> appropriate because computer is getting so fast these days and every
> 10k message seems to be too often for me. Can we change the frequency
> from 10k to 100k?

+1 for doing it that way.  I have an old patch lying around for that.
It's one line, which seems to be about as much work as the problem
justifies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgbench-quiet.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Alvaro Herrera

Excerpts from Andrew Dunstan's message of mié jun 20 12:56:52 -0400 2012:
> 
> On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
> > Another related case: you have a file with headers and columns (n, t, 
> > x, y, z) but your table only has n and t. How would you tell COPY to 
> > discard the junk columns? Currently it just complains that they are 
> > there. 
> 
> That's one of the main use cases for file_text_array_fdw.

Ah, great, thanks.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 7:18 AM, Magnus Hagander  wrote:
 You agreed to add something like NOSYNC option into START_REPLICATION 
 command?
>>>
>>> I'm on the fence. I was hoping somebody else would chime in with an
>>> opinion as well.
>>
>> +1
>
> Nobody else with any opinion on this? :(

I don't think we really need a NOSYNC flag at this point.  Just not
setting the flush location in clients that make a point of flushing in
a timely fashion seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg default's verbosity?

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 11:25 AM, Peter Eisentraut  wrote:
> I don't like these messages any more than the next guy, but why drop
> only those, and not any of the other NOTICE-level messages?  The meaning
> of NOTICE is pretty much, if this is the first time you're using
> PostgreSQL, let me tell you a little bit about how we're doing things
> here.  If you've run your SQL script more than 3 times, you won't need
> them anymore.  So set your client_min_messages to WARNING then.  That
> should be pretty much standard for running SQL scripts, in addition to
> all the other stuff listed here:
> http://petereisentraut.blogspot.fi/2010/03/running-sql-scripts-with-psql.html

Well, let's look at some of the other places where we use NOTICE:

ereport(NOTICE,
(errcode(ERRCODE_NO_ACTIVE_SQL_TRANSACTION),
 errmsg("there is no transaction in progress")));

ereport(NOTICE,
(errmsg("pg_stop_backup cleanup done, waiting
for required WAL segments to be archived")));

ereport(msglevel,
/* translator: %d always has a value larger than 1 */
(errmsg_plural("drop cascades to %d other object",
   "drop cascades to %d other objects",
   numReportedClient + numNotReportedClient,
   numReportedClient + numNotReportedClient),
 errdetail("%s", clientdetail.data),
 errdetail_log("%s", logdetail.data)));

ereport(NOTICE,
   (errmsg("merging constraint \"%s\" with inherited definition",
   ccname)));

ereport(NOTICE,
(errmsg("database \"%s\" does not exist, skipping",
dbname)));

ereport(NOTICE,
   (errmsg("version \"%s\" of extension \"%s\" is already installed",
   versionName, stmt->extname)));

ereport(NOTICE,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("argument type %s is only a shell",
TypeNameToString(t;

It seems to me that at least some of those have quite a bit more value
than the messages under discussion, and they're not all just tips for
novices.  Maybe you'd want to suppress those when running a script and
maybe you wouldn't, but I think that most people don't want to see the
messages under discussion even in interactive mode, unless perhaps
they are debugging some unexpected behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 01:06, Christopher Browne  wrote:

> I guess I'm not seeing the purpose to having the origin node id in the
> WAL stream either.
>
> We have it in the Slony sl_log_* stream, however there is a crucial
> difference, in that sl_log_* is expressly a shared structure.  In
> contrast, WAL isn't directly sharable; you don't mix together multiple
> WAL streams.

Unfortunately you do. That's really the core of how this differs from
current Slony.

Every change we make creates WAL records. Whether that is changes
originating on the current node, or changes originating on upstream
nodes that need to be applied on the current node.

The WAL stream is then read and filtered for changes to pass onto
other nodes. So we want to be able to filter out the applied changes
to avoid passing them back to the original nodes.

Having each record know the origin makes the filtering much simpler,
so if its possible to do it efficiently then its the best design. It
turns out to be the best way to do this so far known. There are other
design however, as noted. In all cases we need the origin id in the
WAL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
Hi Chris!

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
> On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund  
wrote:
> > On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> >> Simon Riggs  wrote:
> >> > This is not transaction metadata, it is WAL record metadata
> >> > required for multi-master replication, see later point.
> >> > 
> >> > We need to add information to every WAL record that is used as the
> >> > source for generating LCRs.
> >> 
> >> If the origin ID of a transaction doesn't count as transaction
> >> metadata (i.e., data about the transaction), what does?  It may be a
> >> metadata element about which you have special concerns, but it is
> >> transaction metadata.  You don't plan on supporting individual WAL
> >> records within a transaction containing different values for origin
> >> ID, do you?  If not, why is it something to store in every WAL
> >> record rather than once per transaction?  That's not intended to be
> >> a rhetorical question.
> > 
> > Its definitely possible to store it per transaction (see the discussion
> > around http://archives.postgresql.org/message-
> > id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering
> > via the originating node a considerably more complex thing. With our
> > proposal you can do it without any complexity involved, on a low level.
> > Storing it per transaction means you can only stream out the data to
> > other nodes *after* fully reassembling the transaction. Thats a pitty,
> > especially if we go for a design where the decoding happens in a proxy
> > instance.
> 
> I guess I'm not seeing the purpose to having the origin node id in the
> WAL stream either.
> 
> We have it in the Slony sl_log_* stream, however there is a crucial
> difference, in that sl_log_* is expressly a shared structure.  In
> contrast, WAL isn't directly sharable; you don't mix together multiple
> WAL streams.
> 
> It seems as though the point in time at which you need to know the
> origin ID is the moment at which you're deciding to read data from the
> WAL files, and knowing which stream you are reading from is an
> assertion that might be satisfied by looking at configuration that
> doesn't need to be in the WAL stream itself.  It might be *nice* for
> the WAL stream to be self-identifying, but that doesn't seem to be
> forcibly necessary.
> 
> The case where it *would* be needful is if you are in the process of
> assembling together updates coming in from multiple masters, and need
> to know:
>- This INSERT was replicated from node #1, so should be ignored
> downstream - That INSERT was replicated from node #2, so should be ignored
> downstream - This UPDATE came from the local node, so needs to be passed
> to downstream users
Exactly that is the point. And you want to do that in an efficient manner 
without too much logic, thats why something simple like the record header is 
so appealing.

> > I also have to admit that I am very hesitant to start developing some
> > generic "transaction metadata" framework atm. That seems to be a good
> > way to spend a good part of time in discussion and disagreeing. Imo
> > thats something for later.

> Well, I see there being a use in there being at least 3 sorts of LCR
> records:
> a) Capturing literal SQL that is to replayed downstream
> b) Capturing tuple updates in a binary form that can be turned readily
> into heap updates on a replica.
> c) Capturing tuple data in some reasonably portable and readily
> re-writable form
I think we should provide the utilities to do all of those. a) is a 
consequence of being able to do c). 

That doesn't really have something to do with this subthread though? The part 
you quoted above was my response to the suggestion to add some generic 
framework to attach metadata to individual transactions on the generating 
side. We quite possibly will end up needing that but I personally don't think 
we should designing that part atm.

> b) Capturing tuple updates in a binary form that can be turned readily
> into heap updates on a replica.
> Unfortunately, this form is likely not to play well when
> replicating across platforms or Postgres versions, so I suspect that
> this performance optimization should be implemented as a *last*
> resort, rather than first.  Michael Jackson had some "rules of
> optimization" that said "don't do it", and, for the expert, "don't do
> it YET..."
Well, apply is a bottleneck. Besides field experience I/We have benchmarked it 
and its rather plausible that it is. And I don't think we can magically make 
that faster in pg in general so my plan is to remove the biggest cost factor I 
can see.
And yes, it will have restrictions...

Regards,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 21 June 2012 01:40, Andres Freund  wrote:

>> I think extraction is a very sensible place to start; actually, I
>> think it's the best possible place to start.  But this particular
>> thread is about adding origin_ids to WAL, which I think is definitely
>> not the best place to start.

> Yep. I think the reason everyone started at it is that the patch was actually
> really simple ;).
> Note that the wal enrichement & decoding patches were before the origin_id
> patch in the patchseries ;)

Actually, I thought it would be non contentious...

A format change, so early in release cycle. Already openly discussed
with no comment. Using wasted space for efficiency, for information
already in use by Slony for similar reasons.

Oh well.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund  wrote:
>> I realized a problem with that idea this morning: it might work for
>> reading things, but if anyone attempts to write data you've got big
>> problems.  Maybe we could get away with forbidding that, not sure.
> Hm, why is writing a problem? You mean io conversion routines writing data?
> Yes, that will be a problem. I am fine with simply forbidding that, we should
> be able to catch that and provide a sensible error message, since SSI we have
> the support for that.

I think we could do something a little more vigorous than that, like
maybe error out if anyone tries to do anything that would write WAL or
acquire an XID.  Of course, then the question becomes: then what?  We
probably need to think about what happens after that - we don't want
an error replicating one row in one table to permanently break
replication for the entire system.

>> Sorry.  I don't think you're planning to do something evil, but before
>> I thought you said you did NOT want to write the code to extract
>> changes as text or something similar.
> Hm. I might have been a bit ambiguous when saying that I do not want to
> provide everything for that use-case.
> Once we have a callpoint that has a correct catalog snapshot for exactly the
> tuple in question text conversion is damn near trivial. The point where you
> get passed all that information (action, tuple, table, snapshot) is the one I
> think the patch should mainly provide.

This is actually a very interesting list.  We could rephrase the
high-level question about the design of this feature as "what is the
best way to make sure that you have these things available?".  Action
and tuple are trivial to get, and table isn't too hard either.  It's
really the snapshot - and all the downstream information that can only
be obtained via using that snapshot - that is the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 23:56, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs  wrote:
> But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?
>>>
>>> Oh, give me a break.  You're willfully missing my point.  And to quote
>>> Fred Brooks, nine women can't make a baby in one month.
>>
>> No, I'm not. The question is not how quickly can N people achieve a
>> single thing, but how long will it take a few skilled people working
>> on carefully selected tasks that have few dependencies between them to
>> achieve something.
>
> The bottleneck is getting the design right, not writing the code.
> Selecting tasks for people to work on without an agreement on the
> design will not advance the process, unless we just accept whatever
> code you choose to right based on whatever design you happen to pick.

Right. Which is why there are multiple people doing design on
different aspects and much work going into prototyping to give some
useful information into debates that would otherwise be bikeshedding.


>> How exactly did you arrive at your conclusion? Why is yours right and
>> mine wrong?
>
> I estimated the amount of work that would be required to do this right
> and compared it to other large projects that have been successfully
> done in the past.  I think you are looking at something on the order
> of magnitude of the Windows port, which took about four releases to
> become stable, or the SE-Linux project, which still isn't
> feature-complete.  Even if it's only a HS-sized project, that took two
> releases, as did streaming replication.  SSI got committed within one
> release cycle, but there were several years of design and advocacy
> work before any code was written, so that, too, was really a
> multi-year project.

The current BDR project uses concepts that have been in discussion for
many years, something like 6-7 years. I first wrote multi-master for
Postgres in userspace in 2007 and the current designs build upon that.
My detailed design for WAL-based logical replication was produced in
2009. I'm not really breaking new ground in the design, and we're
reusing significant parts of existing code. You don't know that cos
you didn't think to ask, which is a little surprising.

On top of that, Andres has refined many of the basic ideas and what he
presents here is a level above my initial design work. I have every
confidence that he'll work independently and produce useful code while
other aspects are considered.


> I'll confine my comments on the second part of the question to the
> observation that it is a bit early to know who is right and who is
> wrong, but the question could just as easily be turned on its head.
>
>> No, I have four people who had initial objections and who have not
>> commented on the fact that the points made are regrettably incorrect.
>
> I think Kevin addressed this point better than I can.  Asserting
> something doesn't make it true, and you haven't offered any rational
> argument against the points that have been made, probably because
> there isn't one.  We *cannot* justify steeling 100% of the available
> bit space for a feature that many people won't use and may not be
> enough to address the real requirement anyway.


You've some assertions above that aren't correct, so its good that you
recognise assertions may not be true, on both sides. I ask that you
listen a little before casting judgements. We won't get anywhere
otherwise.

The current wasted space on 64bit systems is 6 bytes per record. The
current suggestion is to use 2 of those, in a flexible manner that
allows future explansion if it is required, and only if it is
required. That flexibility covers anything else we need. Asserting
that "100% of the available space" is being used is wrong, and to
continue to assert that even when told otherwise multiple times is
something else.


>> Since at least 3 of the people making such comments did not attend the
>> full briefing meeting in Ottawa, I am not particularly surprised.
>> However, I do expect people that didn't come to the meeting to
>> recognise that they are likely to be missing information and to listen
>> closely, as I listen to them.
>
> Participation in the community development process is not contingent
> on having flown to Ottawa in May, or on having decided to spend that
> evening at your briefing meeting.  Attributing to ignorance what is
> adequately explained by honest disagreement is impolite.

I don't think people need to have attended that briefing. But if they
did not, then they do need to be aware they missed hours of community
discussion and explanation, built on top of months of careful
investigation, all of which was aimed at helping everybody understand.

Jumping straight into a discussion is what the community is about

Re: [HACKERS] WAL format changes

2012-06-20 Thread Fujii Masao
On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander  wrote:
> On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas  wrote:
>> On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
>>  wrote:
>>> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
>>> a uint64, on top of my other WAL format patches. I think we should go ahead
>>> with this.
>>
>> +1.
>>
>>> The LSNs on pages are still stored in the old format, to avoid changing the
>>> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
>>> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
>>> is required.
>>
>> Seems fine.
>>
>>> Should we keep the old representation in the replication protocol messages?
>>> That would make it simpler to write a client that works with different
>>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>>> should mandate network-byte order for all the integer and XLogRecPtr fields
>>> in the replication protocol. That would make it easier to write a client
>>> that works across different architectures, in >= 9.3. The contents of the
>>> WAL would of course be architecture-dependent, but it would be nice if
>>> pg_receivexlog and similar tools could nevertheless be
>>> architecture-independent.
>>
>> I share Andres' question about how we're doing this already.  I think
>> if we're going to break this, I'd rather do it in 9.3 than 5 years
>> from now.  At this point it's just a minor annoyance, but it'll
>> probably get worse as people write more tools that understand WAL.
>
> If we are looking at breaking it, and we are especially concerned
> about something like pg_receivexlog... Is it something we could/should
> change in the protocl *now* for 9.2, to make it non-broken in any
> released version? As in, can we extract just the protocol change and
> backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs  wrote:
> I'm sure Jeff submitted this because of the need for a standard test,
> rather than the wish to actually modify pgbench itself.
>
> Can I suggest that we include a list of standard scripts with pgbench
> for this purpose? These can then be copied alongside the binary when
> we do an install.

I was thinking along similar lines myself.  At the least, I think we
can't continue to add a short option for every new test type.
Instead, maybe we could have --test-type=WHATEVER, and perhaps that
then reads whatever.sql from some compiled-in directory.  That would
allow us to sanely support a moderately large number of tests.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund  
wrote:
> > I would prefer the event trigger way because that seems to be the most
> > extensible/reusable. It would allow a fully replicated databases and
> > catalog only instances.
> > I think we need to design event triggers in a way you cannot simply
> > circumvent them. We already have the case that if users try to screw
> > around system triggers we give back wrong answers with the planner
> > relying on foreign keys btw.
> > If the problem is having user trigger after system triggers: Lets make
> > that impossible. Forbidding DDL on the other instances once we have that
> > isn't that hard.
> 
> So, this is interesting.  I think something like this could solve the
> problem, but then why not just make it built-in code that runs from
> the same place as the event trigger rather than using the trigger
> mechanism per se?  Presumably the "trigger" code's purpose is going to
> be to inject additional data into the WAL stream (am I wrong?) which
> is not something you're going to be able to do from PL/pgsql anyway,
> so you don't really need a trigger, just a call to some C function -
> which not only has the advantage of being not bypassable, but is also
> faster.
I would be totally fine with that. As long as event triggers provide the 
infrastructure that shouldn't be a big problem.

> > Perhaps all that will get simpler if we can make reading the catalog via
> > custom built snapshots work as you proposed otherwhere in this thread.
> > That would make checking errors way much easier even if you just want to
> > apply to a database with exactly the same schema. Thats the next thing I
> > plan to work on.
> I realized a problem with that idea this morning: it might work for
> reading things, but if anyone attempts to write data you've got big
> problems.  Maybe we could get away with forbidding that, not sure.
Hm, why is writing a problem? You mean io conversion routines writing data? 
Yes, that will be a problem. I am fine with simply forbidding that, we should 
be able to catch that and provide a sensible error message, since SSI we have 
the support for that.

> Would be nice to get some input from other hackers on this.
Oh, yes!


> > I agree that the focus isn't 100% optimal and that there are *loads* of
> > issues we haven't event started to look at. But you need a point to
> > start and extraction & apply seems to be a good one because you can
> > actually test it without the other issues solved which is not really the
> > case the other way round.
> > Also its possible to plug in the newly built changeset extraction into
> > existing solutions to make them more efficient while retaining most of
> > their respective framework.
> > 
> > So I disagree that thats the wrong part to start with.
> 
> I think extraction is a very sensible place to start; actually, I
> think it's the best possible place to start.  But this particular
> thread is about adding origin_ids to WAL, which I think is definitely
> not the best place to start.
Yep. I think the reason everyone started at it is that the patch was actually 
really simple ;).
Note that the wal enrichement & decoding patches were before the origin_id 
patch in the patchseries ;)

> > I definitely do want to provide code that generates a textual
> > representation of the changes. As you say, even if its not used for
> > anything its needed for debugging. Not sure if it should be sql or maybe
> > the new slony representation. If thats provided and reusable it should
> > make sure that ontop of that other solutions can be built.
> Oh, yeah.  If we can get that, I will throw a party.
Good ;)

> > I find your supposition that I/we just want to get MMR without regard for
> > anything else a bit offensive. I wrote at least three times in this
> > thread that I do think its likely that we will not get more than the
> > minimal basis for implementing MMR into 9.3. I wrote multiple times that
> > I want to provide the basis for multiple solutions. The prototype -
> > while obviously being incomplete - tried hard to be modular.
> > You cannot blame us that we want the work we do to be *also* usable for
> > what one of our major aims?
> > What can I do to convince you/others that I am not planning to do
> > something "evil" but that I try to reach as many goals at once as
> > possible?
> Sorry.  I don't think you're planning to do something evil, but before
> I thought you said you did NOT want to write the code to extract
> changes as text or something similar.
Hm. I might have been a bit ambiguous when saying that I do not want to 
provide everything for that use-case.
Once we have a callpoint that has a correct catalog snapshot for exactly the 
tuple in question text conversion is damn near trivial. The point where you 
get passed all that information (action, tuple, table, snapshot) is the one I 
think the patch should mainly provide.

> I think that

Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Josh Berkus

> 
> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
> 
> to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY.  At that point, check out the CSV-array FDW.

Of course, if someone writes a WIP patch which implements the above, we
can evaluate it then.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 17:41, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> No, I'm suggesting it would probably be at least a bit of a win here
>> to cache the constant, and only have to do a strxfrm() + strcmp() per
>> comparison.
>
> Um, have you got any hard evidence to support that notion?  The
> traditional advice is that strcoll is faster than using strxfrm unless
> the same strings are to be compared repeatedly.  I'm not convinced that
> saving the strxfrm on just one side will swing the balance.

Fair enough,  but I'd suggest that that traditional advice assumes
that strcoll()'s space-efficiency and general avoidance of dynamic
allocation is an important factor, which, for us, it clearly isn't,
since we can re-use a single buffer in the manner of Robert's text
sortsupport patch for each and every per-tuple strxfrm() blob (when
traversing an index). This is a direct quote from glibc's strcoll_l.c
:

 Perform the first pass over the string and while doing this find
 and store the weights for each character.  Since we want this to
 be as fast as possible we are using `alloca' to store the temporary
 values.  But since there is no limit on the length of the string
 we have to use `malloc' if the string is too long.  We should be
 very conservative here.

Here, alloca is used to allocate space in a stack frame. I believe
that this is an entirely inappropriate trade-off for Postgres to be
making. strxfrm(), in constrast, leaves buffer sizing and management
up to the caller. That has to be a big part of the problem here.

>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
>
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?

According to the single-unix specification's strcoll() documentation,
"The strxfrm() and strcmp() functions should be used for sorting large
lists". If that isn't convincing enough for you, there is the fact
that glibc's strcmp() is clearly highly optimised for each and every
architecture, and that we are currently throwing away an extra
strcmp() in the event of strcoll() equality.

> I'd also note that any comparisons you may have seen about this are certainly 
> not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).

What about the fact that strcoll() may be repeatedly allocating and
freeing memory per comparison? The blobs really aren't that much
larger than the strings to be sorted, which are typically quite short.

> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.

The advantage of my proposed implementation is precisely that I won't
have to redefine the meaning of equality, and that only the text
datatype will have to care about equivalency, so you can just skip
over an explanation of equivalency for most audiences.

If you feel that strongly about it, and I have no possible hope of
getting this accepted, I'm glad that I know now rather than after
completing a significant amount of work on this. I would like to hear
other people's opinions before I drop it though.

> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

glibc is the implementation that really matters. My test-case used
en_US.UTF-8 as its locale, which has to be one of the least stressful
to strcoll() - I probably could have shown a larger improvement just
by selecting a locale that was known to have to make more passes,
like, say, hu_HU.UTF-8.

I expect to have access to a 16 core server next week, which Bull have
made available to me. Maybe I'll get some interesting performance
numbers from it.

The reason that I don't want to use the blob with original string hack
is because it's ugly, space-inefficient, unnecessary and objectively
incorrect, since it forces us to violate the conformance requirement
C9 of Unicode 3.0, marginal though that may be.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Marc Mamin
>  -Ursprüngliche Nachricht-
>  Von: pgsql-hackers-ow...@postgresql.org im Auftrag von Josh Berkus
>  Gesendet: Mi 6/20/2012 7:06
>  An: pgsql-hackers@postgresql.org
>  Betreff: Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

...

>   (1) is valuable
>  for backwards compatibility, and for cases where the CSV header was
>  generated by another program (Excel?) so the column names don't match.


4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)

to cover the case when column names do not match.

my 2 pences,

Marc Mamin


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 12:41 PM, Tom Lane  wrote:
>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
>
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?  I'd also note
> that any comparisons you may have seen about this are certainly not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).
>
> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.
> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

On the testing I've done, the strcmp() tie-breaking rarely gets run
anyway.  Unless you're sorting data with only a few distinct values,
most comparisons are between values that are distinct under any choice
of collation, and therefore strcoll() returns 0 very rarely, and
therefore the additional runtime it consumes does not matter very
much.  Also, it's quite a bit faster than strcoll() anyway, so even
when it does run it doesn't add much to the total time.

I think the elephant in the room here is that we're relying on the OS
to do everything for us, and the OS API we use (strcoll) requires an
extra memcpy and is also dreadfully slow.  If we could solve that
problem, it would save us a lot more than worrying about the extra
strcmp().  Of course, solving that problem is hard: we either have to
get the glibc and FreeBSD libc folks to provide a better API (that
takes lengths for each input instead of relying on trailing nul
bytes), or reimplement locales within PG, or store trailing NUL bytes
that we don't really need in the index so we can apply strcoll
directly, none of which are very appealing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund  wrote:
> I would prefer the event trigger way because that seems to be the most
> extensible/reusable. It would allow a fully replicated databases and catalog
> only instances.
> I think we need to design event triggers in a way you cannot simply circumvent
> them. We already have the case that if users try to screw around system
> triggers we give back wrong answers with the planner relying on foreign keys
> btw.
> If the problem is having user trigger after system triggers: Lets make that
> impossible. Forbidding DDL on the other instances once we have that isn't that
> hard.

So, this is interesting.  I think something like this could solve the
problem, but then why not just make it built-in code that runs from
the same place as the event trigger rather than using the trigger
mechanism per se?  Presumably the "trigger" code's purpose is going to
be to inject additional data into the WAL stream (am I wrong?) which
is not something you're going to be able to do from PL/pgsql anyway,
so you don't really need a trigger, just a call to some C function -
which not only has the advantage of being not bypassable, but is also
faster.

> Perhaps all that will get simpler if we can make reading the catalog via
> custom built snapshots work as you proposed otherwhere in this thread. That
> would make checking errors way much easier even if you just want to apply to
> a database with exactly the same schema. Thats the next thing I plan to work
> on.

I realized a problem with that idea this morning: it might work for
reading things, but if anyone attempts to write data you've got big
problems.  Maybe we could get away with forbidding that, not sure.
Would be nice to get some input from other hackers on this.

>> They could also modify the catalogs directly, although it's possible we
>> don't care quite as much about that case (but on the other hand people
>> do sometimes need to do it to solve real problems).
> With that you already can crash the database perfectly fine today. I think
> trying to care for that is a waste of time.

You're probably right.

> I agree that the focus isn't 100% optimal and that there are *loads* of issues
> we haven't event started to look at. But you need a point to start and
> extraction & apply seems to be a good one because you can actually test it
> without the other issues solved which is not really the case the other way
> round.
> Also its possible to plug in the newly built changeset extraction into
> existing solutions to make them more efficient while retaining most of their
> respective framework.
>
> So I disagree that thats the wrong part to start with.

I think extraction is a very sensible place to start; actually, I
think it's the best possible place to start.  But this particular
thread is about adding origin_ids to WAL, which I think is definitely
not the best place to start.

> I definitely do want to provide code that generates a textual representation
> of the changes. As you say, even if its not used for anything its needed for
> debugging. Not sure if it should be sql or maybe the new slony representation.
> If thats provided and reusable it should make sure that ontop of that other
> solutions can be built.

Oh, yeah.  If we can get that, I will throw a party.

> I find your supposition that I/we just want to get MMR without regard for
> anything else a bit offensive. I wrote at least three times in this thread
> that I do think its likely that we will not get more than the minimal basis
> for implementing MMR into 9.3. I wrote multiple times that I want to provide
> the basis for multiple solutions. The prototype - while obviously being
> incomplete - tried hard to be modular.
> You cannot blame us that we want the work we do to be *also* usable for what
> one of our major aims?
> What can I do to convince you/others that I am not planning to do something
> "evil" but that I try to reach as many goals at once as possible?

Sorry.  I don't think you're planning to do something evil, but before
I thought you said you did NOT want to write the code to extract
changes as text or something similar.  I think that would be a really
bad thing to skip for all kinds of reasons.  I think we need that as a
foundational technology before we do much else.  Now, once we have
that, if we can safely detect cases where it's OK to bypass decoding
to text and skip it in just those cases, I think that's great
(although possibly difficult to implement correctly).  I basically
feel that without decode-to-text, this can't possibly be a basis for
multiple solutions; it will be a basis only for itself, and extremely
difficult to debug, too.  No other replication solution can even
theoretically have any use for the raw on-disk tuple, at least not
without horrible kludgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To ma

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Christopher Browne
On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund  wrote:
> On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
>> Simon Riggs  wrote:
>> > This is not transaction metadata, it is WAL record metadata
>> > required for multi-master replication, see later point.
>
>> > We need to add information to every WAL record that is used as the
>> > source for generating LCRs.
>> If the origin ID of a transaction doesn't count as transaction
>> metadata (i.e., data about the transaction), what does?  It may be a
>> metadata element about which you have special concerns, but it is
>> transaction metadata.  You don't plan on supporting individual WAL
>> records within a transaction containing different values for origin
>> ID, do you?  If not, why is it something to store in every WAL
>> record rather than once per transaction?  That's not intended to be
>> a rhetorical question.
> Its definitely possible to store it per transaction (see the discussion around
> http://archives.postgresql.org/message-
> id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via
> the originating node a considerably more complex thing. With our proposal you
> can do it without any complexity involved, on a low level. Storing it per
> transaction means you can only stream out the data to other nodes *after*
> fully reassembling the transaction. Thats a pitty, especially if we go for a
> design where the decoding happens in a proxy instance.

I guess I'm not seeing the purpose to having the origin node id in the
WAL stream either.

We have it in the Slony sl_log_* stream, however there is a crucial
difference, in that sl_log_* is expressly a shared structure.  In
contrast, WAL isn't directly sharable; you don't mix together multiple
WAL streams.

It seems as though the point in time at which you need to know the
origin ID is the moment at which you're deciding to read data from the
WAL files, and knowing which stream you are reading from is an
assertion that might be satisfied by looking at configuration that
doesn't need to be in the WAL stream itself.  It might be *nice* for
the WAL stream to be self-identifying, but that doesn't seem to be
forcibly necessary.

The case where it *would* be needful is if you are in the process of
assembling together updates coming in from multiple masters, and need
to know:
   - This INSERT was replicated from node #1, so should be ignored downstream
   - That INSERT was replicated from node #2, so should be ignored downstream
   - This UPDATE came from the local node, so needs to be passed to
downstream users

Or perhaps something else is behind the node id being deeply embedded
into the stream that I'm not seeing altogether.

> Other metadata will not be needed on such a low level.
>
> I also have to admit that I am very hesitant to start developing some generic
> "transaction metadata" framework atm. That seems to be a good way to spend a
> good part of time in discussion and disagreeing. Imo thats something for
> later.

Well, I see there being a use in there being at least 3 sorts of LCR records:
a) Capturing literal SQL that is to replayed downstream.  This
parallels two use cases existing in existing replication systems:
   i) In pre-2.2 versions of Slony, statements are replayed literally.
 So there's a stream of INSERT/UPDATE/DELETE statements.
   ii) DDL capture and replay.  In existing replication systems, DDL
isn't captured implicitly, the way Dimitri's Event Triggers are to do,
but rather is captured explicitly.
   There should be a function to allow injecting such SQL explicitly;
that is sure to be a useful sort of thing to be able to do.

b) Capturing tuple updates in a binary form that can be turned readily
into heap updates on a replica.
Unfortunately, this form is likely not to play well when
replicating across platforms or Postgres versions, so I suspect that
this performance optimization should be implemented as a *last*
resort, rather than first.  Michael Jackson had some "rules of
optimization" that said "don't do it", and, for the expert, "don't do
it YET..."

c) Capturing tuple data in some reasonably portable and readily
re-writable form.
Slony 2.2 changes from "SQL fragments" (of a) i) above) to storing
updates as an array of text values indicating:
 - relation name
 - attribute names
 - attribute values, serialized into strings
I don't know that this provably represents the *BEST*
representation, but it definitely will be portable where b) would not
be, and lends itself to being able to reuse query plans, where a)
requires extraordinary amounts of parsing work, today.  So I'm pretty
sure it's better than a) and b) for a sizable set of cases.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Josh Berkus

> In the past people have asked me to have copy use the CSV header line in
> place of supplying a column list in the COPY command. I can certainly
> see some utility in that, and I think it might achieve what David wants.
> Using that scenario it would be an error to supply an explicit column
> list and also use the header line. But then I don't think CHECK_HEADER
> would be the right name for the option. In any case, specifying a name
> before we settle on an exact behaviour seems wrong :-)

Actually, I can see three valid and valuable-to-users behaviors here:

1) current behavior, header line is ignored.

2) CHECK HEADER: a column list is supplied, but a "check header" flag is
set.  If the column list and header list don't match *exactly*, you get
an error.

3) USE HEADER: no column list is supplied, but a "use header" flag is
set.  A column list is created to match the columns from the CSV header.
 Of necessity, this will consist of all or some of the columns in the
table.  If columns are supplied which are not in the table, then you get
an error (as well as if columns are missing which are NOT NULL, as you
get at present).

(2) is more valuable to people who want to check data integrity
rigorously, and test for unexpected API changes.  (3) is more valuable
for regular users who want CSV import to "just work".  (1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs
On 20 June 2012 23:34, Kevin Grittner  wrote:
> Simon Riggs  wrote:
>> Kevin Grittner  wrote:
 Heikki Linnakangas  wrote:
>>>
 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.
>>>
>>> Right.
>>
>> Wrong, as explained.
>
> The point is not wrong; you are simply not responding to what is
> being said.

Heikki said that the origin ID was not required for all MMR
configs/scenarios. IMHO that is wrong, with explanation given.

By agreeing with him, I assumed you were sharing that assertion,
rather than saying something else.


> You have not explained why an origin ID is required when there is no
> replication, or if there is master/slave logical replication, or
...

You're right; I never claimed it was needed. Origin Id is only needed
for multi-master replication and that is the only context I've
discussed it.


>> This is not transaction metadata, it is WAL record metadata
>> required for multi-master replication, see later point.
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs.
>
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does?  It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata.  You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you?  If not, why is it something to store in every WAL
> record rather than once per transaction?  That's not intended to be
> a rhetorical question.  I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.

> As long as the WAL stream is the medium, options are very
> constrained.  You can code a very fast engine to handle a single
> type of configuration that way, and perhaps that should be a
> supported feature, but it's not a configuration I've needed yet.
> (Well, on reflection, if it had been available and easy to use, I
> can think of *one* time I *might* have used it for a pair of nodes.)
> It seems to me that you are so focused on this one use case that you
> are not considering how design choices which facilitate fast
> development of that use case paint us into a corner in terms of
> expanding to other use cases.

>>> I think removing origin ID from this patch and submitting a
>>> separate patch for a generalized transaction metadata system is
>>> the sensible way to go.
>>
>> We already have a very flexible WAL system for recording data of
>> interest to various resource managers. If you wish to annotate a
>> transaction, you can either generate a new kind of WAL record or
>> you can enhance a commit record.
>
> Right.  Like many of us are suggesting should be done for origin ID.
>
>> XLOG_NOOP records can already be generated by your application if
>> you wish to inject additional metadata to the WAL stream. So no
>> changes are required for you to implement the generalised
>> transaction metadata scheme you say you require.
>
> I'm glad it's that easy.  Are there SQL functions to for that yet?


Yes, another possible design is to generate a new kind of WAL record
for the origin id.

Doing it that way will slow down multi-master by a measurable amount,
and slightly bloat the WAL stream.

The proposed way uses space that is currently wasted and likely to
remain so. Only 2 bytes of 6 bytes available are proposed for use,
with a flag design that allows future extension if required. When MMR
is not in use, the WAL records would look completely identical to the
way they look now, in size, settings and speed of writing them.

Putting the origin id onto each WAL record allows very fast and simple
stateless filtering. I suggest using it because those bytes have been
sitting there unused for close to 10 years now and no better use
springs to mind.

The proposed design is the fastest way of implementing MMR, without
any loss for non-users.

As I noted before, slowing down MMR by a small amount causes geometric
losses in performance across the whole cluster.


>> Not sure how or why that relates to requirements for multi-master.
>
> That depends on whether you want to leave the door open to other
> logical replication than the one use case on which you are currently
> focused.  I even consider some of those other cases multi-master,
> especially when multiple databases are replicating to a single table
> on another server.  I'm not clear on your definition -- it seems to
> be rather more narrow.  Maybe we need to define some terms somewhere
> to facilitate discussion.  Is there a Wiki page where that would
> make sense?

The project is called BiDirectional Replication to ensure that people
understood this is not just multi-master. But that doesn't mean that
multi-master can't have its own specific requirements.

Adding origin

Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
Another related case: you have a file with headers and columns (n, t, 
x, y, z) but your table only has n and t. How would you tell COPY to 
discard the junk columns? Currently it just complains that they are 
there. 



That's one of the main use cases for file_text_array_fdw.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 12:18 PM, Tom Lane wrote:

David Fetter  writes:

OK, new proposal:
COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.
Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.





In the past people have asked me to have copy use the CSV header line in 
place of supplying a column list in the COPY command. I can certainly 
see some utility in that, and I think it might achieve what David wants. 
Using that scenario it would be an error to supply an explicit column 
list and also use the header line. But then I don't think CHECK_HEADER 
would be the right name for the option. In any case, specifying a name 
before we settle on an exact behaviour seems wrong :-)


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks

2012-06-20 Thread Tom Lane
Jaime Casanova  writes:
> On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera
>  wrote:
>> This is v12 of the foreign key locks patch.

> Just noticed that this patch needs a rebase because of the refactoring
> Tom did in ri_triggers.c

Hold on a bit before you work on that code --- I've got one more bit of
hacking I want to try before walking away from it.  I did some oprofile
work on Dean's example from

and noticed that it looks like ri_FetchConstraintInfo is eating a
noticeable fraction of the runtime, which is happening because it is
called to deconstruct the relevant pg_constraint row for each tuple
we consider firing the trigger for (and then again, if we do fire the
trigger).  I'm thinking it'd be worth maintaining a backend-local cache
for the deconstructed data, and am going to go try that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
Hi,

On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund  
wrote:
> > Were not the only ones here that are performing scope creep though... I
> > think about all people who have posted in the whole thread except maybe
> > Tom and Marko are guilty of doing so.
> > 
> > I still think its rather sensible to focus on exactly duplicated schemas
> > in a very first version just because that leaves out some of the
> > complexity while paving the road for other nice things.
> 
> Well, I guess what I want to know is: what does focusing on exactly
> duplicated schemas mean?  If it means we'll disable DDL for tables
> when we turn on replication, that's basically the Slony approach: when
> you want to make a DDL change, you have to quiesce replication, do it,
> and then resume replication.  I would possibly be OK with that
> approach.  If it means that we'll hope that the schemas are duplicated
> and start spewing garbage data when they're not, then I'm not
> definitely not OK with that approach.  If it means using event
> triggers to keep the catalogs synchronized, then I don't think I don't
> think that's adequately robust.  The user could add more event
> triggers that run before or after the ones the replication system
> adds, and then you are back to garbage decoding (or crashes). 
I would prefer the event trigger way because that seems to be the most 
extensible/reusable. It would allow a fully replicated databases and catalog 
only instances.
I think we need to design event triggers in a way you cannot simply circumvent 
them. We already have the case that if users try to screw around system 
triggers we give back wrong answers with the planner relying on foreign keys 
btw.
If the problem is having user trigger after system triggers: Lets make that 
impossible. Forbidding DDL on the other instances once we have that isn't that 
hard.

Perhaps all that will get simpler if we can make reading the catalog via 
custom built snapshots work as you proposed otherwhere in this thread. That 
would make checking errors way much easier even if you just want to apply to  
a database with exactly the same schema. Thats the next thing I plan to work 
on.

> They could also modify the catalogs directly, although it's possible we
> don't care quite as much about that case (but on the other hand people
> do sometimes need to do it to solve real problems).
With that you already can crash the database perfectly fine today. I think 
trying to care for that is a waste of time.

> Although I am
> 100% OK with pairing back the initial feature set - indeed, I strongly
> recommend it - I think that robustness is not a feature which can be
> left out in v1 and added in later.  All the robustness has to be
> designed in at the start, or we will never have it.
I definitely don't intend to cut down on robustness.

> On the whole, I think we're spending far too much time talking about
> code and far too little time talking about what the overall design
> should look like.
Agreed.

> We are having a discussion about whether or not MMR
> should be supported by sticking a 16-bit node ID into every WAL record
> without having first decided whether we should support MMR, whether
> that requires node IDs, whether they should be integers, whether those
> integers should be 16 bits in size, whether they should be present in
> WAL, and whether or not the record header is the right place to put
> them.  There's a right order in which to resolve those questions, and
> this isn't it.  More generally, I think there is a ton of complexity
> that we're probably overlooking here in focusing in on specific coding
> details.  I think the most interesting comment made to date is Steve
> Singer's observation that very little of Slony is concerned with
> changeset extraction or apply.  Now, on the flip side, all of these
> patches seem to be concerned with changeset extraction and apply.
> That suggests that we're missing some pretty significant pieces
> somewhere in this design.  I think those pieces are things like error
> recovery, fault tolerance, user interface design, and control logic.
> Slony has spent years trying to get those things right.  Whether or
> not they actually have gotten them right is of course an arguable
> point, but we're unlikely to do better by ignoring all of those issues
> and implementing whatever is most technically expedient.
I agree that the focus isn't 100% optimal and that there are *loads* of issues 
we haven't event started to look at. But you need a point to start and 
extraction & apply seems to be a good one because you can actually test it 
without the other issues solved which is not really the case the other way 
round.
Also its possible to plug in the newly built changeset extraction into 
existing solutions to make them more efficient while retaining most of their 
respective framework.

So I disagree that thats the wrong part to start with.

> >> You've got f

Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Alvaro Herrera

Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012:
> On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

> > In your original proposal, I was rather wondering what should happen
> > if the incoming file didn't have the same set of columns called out
> > in the COPY command's column list.  (That is, while *rearranging*
> > the columns might be thought non-astonishing, I'm less convinced
> > about a copy operation that inserts or defaults columns differently
> > from what the command said should happen.)  If we're just checking
> > for a match, that question goes away.
> 
> Let's imagine we have a table foo(id serial, t text, n numeric) and a
> CSV file foo.csv with headers (n, t).  Just to emphasize, the column
> ordering is different in the places where they match.

For the record, IIRC we had one person trying to do this in the spanish
list not long ago.

Another related case: you have a file with headers and columns (n, t, x, y, z)
but your table only has n and t.  How would you tell COPY to discard the
junk columns?  Currently it just complains that they are there.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:
> David Fetter  writes:
> > OK, new proposal:
> 
> > COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
> > an option which requires that HEADER be enabled and mandates that the
> > column names in the header match the columns coming in.
> 
> > Has someone got a better name for this option than
> > KEEP_HEADER_COLUMN_NAMES?
> 
> Well, if it's just checking that the list matches, then maybe
> CHECK_HEADER would do.

OK

> In your original proposal, I was rather wondering what should happen
> if the incoming file didn't have the same set of columns called out
> in the COPY command's column list.  (That is, while *rearranging*
> the columns might be thought non-astonishing, I'm less convinced
> about a copy operation that inserts or defaults columns differently
> from what the command said should happen.)  If we're just checking
> for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t).  Just to emphasize, the column
ordering is different in the places where they match.

Would

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER)

now "just work" (possibly with some performance penalty) and 

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER)

only "work," i.e. import gobbledygook, in the case where every t entry
in foo.csv happened to be castable to NUMERIC?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Magnus Hagander
On Wed, Jun 20, 2012 at 12:35 PM, Florian Pflug  wrote:
> On Jun19, 2012, at 17:36 , Robert Haas wrote:
>> On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
>>  wrote:
>>> On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
 The fly in the ointment with any of these ideas is that the "configure
 list" is not a list of exact cipher names, as per Magnus' comment that
 the current default includes tests like "!aNULL".  I am not sure that
 we know how to evaluate such conditions if we are applying an
 after-the-fact check on the selected cipher.  Does OpenSSL expose any
 API for evaluating whether a selected cipher meets such a test?
>>>
>>> I'm not sure whether there's an API for it, but you can certainly check
>>> manually with "openssl ciphers -v", for example:
>>>
>>> $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
>>> NULL-SHA                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=SHA1
>>> NULL-MD5                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=MD5
>>>
>>> ...etc...
>>>
>>> So unless the openssl includes the code twice there must be a way to
>>> extract the list from the library.
>>
>> There doubtless is, but I'd being willing to wager that you won't be
>> able to figure out the exact method without reading the source code
>> for 'opennssl ciphers' to see how it was done there, and most likely
>> you'll find that at least one of the functions they use has no man
>> page.  Documentation isn't their strong point.
>
> Yes, unfortunately.
>
> I wonder though if shouldn't restrict the allowed ciphers list to being
> a simple list of supported ciphers. If our goal is to support multiple
> SSL libraries transparently then surely having openssl-specific syntax
> in the config file isn't exactly great anyway...

That is a very good point. Before we design *another* feature that
relies on it, we should verify if the syntax is compatible in the
other libraries that would be interesting (gnutls and NSS primarily),
and if it's not that at least the *functionality* exists ina
compatible way. So we don't put ourselves in a position where we can't
proceed.

And yes, that's vapourware so far. But I know at least Claes (added to
CC) has said he wants to work on it during this summer, and I've
promised to help him out and review as well, if/when he gets that far.
But even without that, we should try to keep the door to these other
library implementations as open as possible.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Tom Lane
Peter Geoghegan  writes:
> No, I'm suggesting it would probably be at least a bit of a win here
> to cache the constant, and only have to do a strxfrm() + strcmp() per
> comparison.

Um, have you got any hard evidence to support that notion?  The
traditional advice is that strcoll is faster than using strxfrm unless
the same strings are to be compared repeatedly.  I'm not convinced that
saving the strxfrm on just one side will swing the balance.

> The fact is that this is likely to be a fairly significant
> performance win, because strxfrm() is quite simply the way you're
> supposed to do collation-aware sorting, and is documented as such. For
> that reason, C standard library implementations should not be expected
> to emphasize its performance - they assume that you're using strxfrm()
> + their highly optimised strcmp()

Have you got any evidence in support of this claim, or is it just
wishful thinking about what's likely to be inside libc?  I'd also note
that any comparisons you may have seen about this are certainly not
accounting for the effects of data bloat from strxfrm (ie, possible
spill to disk, more merge passes, etc).

In any case, if you have to redefine the meaning of equality in order
to justify a performance patch, I'm prepared to walk away at the start.
The range of likely performance costs/benefits across different locales
and different implementations is so wide that if you can't show it to be
a win even with the strcmp tiebreaker, it's not likely to be a reliable
win without that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Alvaro Herrera
Excerpts from Tom Lane's message of mié jun 20 11:49:51 -0400 2012:
> 
> Alvaro Herrera  writes:
> > I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
> > the list of ciphers starting from the stringified configuration
> > parameter and iterate on them.
> 
> Do you mean that it will produce an expansion of the set of ciphers
> meeting criteria like "!aNULL"?

Attached is a simple program that does that.  You pass 'ALL:!aNULL' as
its first arg and it produces such a list.

> If so, I think we are set; we can
> easily check to see if the active cipher is in that list, no?

Great.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
#include 
#include 
#include 

#include 
#include 

int main(int argc, char *argv[])
{
const SSL_METHOD *method = TLSv1_client_method();
SSL_CTX*ctx;
SSL*ssl = NULL;
char   *ciphers;
int i;
   
if (argc < 2)
{
fprintf(stderr, "ciphers not specified\n");
exit(1);
}

ciphers = argv[1];

SSL_library_init();

ctx = SSL_CTX_new(method);
if (!ctx)
{
fprintf(stderr, "something went wrong\n");
exit(1);
}

if (!SSL_CTX_set_cipher_list(ctx, ciphers))
{
fprintf(stderr, "unable to set cipher list\n");
exit(1);
}

ssl = SSL_new(ctx);
if (!ssl)
{
fprintf(stderr, "unable to create the SSL object\n");
exit(1);
}

for (i = 0;; i++)
{
const char   *cipher;

cipher = SSL_get_cipher_list(ssl, i);
if (cipher == NULL)
{
fprintf(stderr, "end of cipher list?\n");
break;
}
printf("cipher: %s\n", cipher);
}

SSL_CTX_free(ctx);
SSL_free(ssl);

return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-06-20 Thread John Lumby

---
Problem I'm trying to solve:
    For partitioned tables,  make it possible to use RETURNING clause on INSERT 
INTO
   together with DO INSTEAD rule

[  Note  -  wherever I say INSERT I also mean UPDATE and DELETE ]

---
Current behaviour :

    An INSERT which has a RETURNING clause and which is to be rewritten based on
    a rule will be accepted if the rule is an "unconditional DO INSTEAD".
    In general I believe "unconditional" means "no WHERE clause", but in 
practice
    if the rule is of the form 
   CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    this is treated as conditional and the query is rejected.    

    Testcase:
    A table T is partitioned and has two or more columns,  one of which 
    is an id column declared as
 id bigint DEFAULT nextval('history_id_seq'::regclass) NOT NULL
    and the application issues
  "INSERT into history  (column-list which excludes id)  values () 
RETURNING id"

    I can get the re-direction of the INSERT *without* RETURNING to work using
    either trigger or rule, in which the trigger/rule invokes a procedure, but
    whichever way I do it, I could not get this RETURNING clause to work.

    For a trigger,
  the INSERT ... RETURNING was accepted but returned no rows, (as I would
  expect), and for the RULE, the INSERT ... RETURNING was rejected with :
 
  ERROR:  cannot perform INSERT RETURNING on relation "history"
  HINT:  You need an unconditional ON INSERT DO INSTEAD rule with a 
RETURNING clause.

  but this hint was not much help,  since :

   For a rule,
 CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    ERROR:  syntax error at or near "returning"
    LINE 1: ...DO INSTEAD SELECT history_insert_partitioned(NEW) returning ...

    Here the function history_insert_partitioned is something like
    CREATE FUNCTION history_insert_partitioned( NEW public.history) RETURNS 
BIGINT AS $$
    DECLARE
   ...
    BEGIN
   ...
 < acccess NEW fields e.g. timestamp>
 <  construct partitioned table name>
  < EXECUTE 'INSERT INTO ' partitioned table
   ...
    RETURN history_id;
    END;
    $$
    LANGUAGE plpgsql

---
Some references to discussion of this requirement :
  .  http://wiki.postgresql.org/wiki/Todo
 item "Make it possible to use RETURNING together with conditional DO 
INSTEAD rules,
   such as for partitioning setups"
  .  http://archives.postgresql.org/pgsql-general/2012-06/msg00377.php
  .  http://archives.postgresql.org/pgsql-general/2010-12/msg00542.php
  .  
http://acodapella.blogspot.it/2011/06/hibernate-postgresql-table-partitioning.html

---
Proposal:
  .  I propose to extend the rule system to recognize cases where the INSERT 
query specifies
 RETURNING and the rule promises to return a row,  and to then permit this 
query to run
 and return the expected row.   In effect,  to widen the definition of 
"unconditional"
 to handle cases such as my testcase.
  .  One comment is that all the infrastructure for returning one row from the 
re-written query
 is already present in the code,  and the non-trivial question is how to 
ensure the
 new design is safe in preventing any rewrite that actually would not 
return a row.
  .  In this patch,   I have chosen to make use of the LIMIT clause  -
 I add a side-effect implication to a LIMIT clause when it occurs in a 
rewrite of an INSERT
 to mean "this rule will return a row".
 So,  with my patch, same testcase,  same function 
history_insert_partitioned and new rule
    CREATE RULE insert_part_history as ON INSERT to history \
   DO INSTEAD SELECT history_insert_partitioned(NEW) LIMIT 1
 the INSERT is accepted and returns the id.
 This use of LIMIT clause is probably contentious but I wished to avoid 
introducing new
 SQL syntax,  and the LIMIT clause does have a connotation of returning 
rows.

---
I attach patch based on clone of postgresql.git as of yesterday (120619-145751 
EST)
I have tested the patch with INSERT and UPDATE  (not tested with DELETE but 
should work).
The patch is not expected to be final but just to show how I did it.

John
  --- src/backend/optimizer/plan/planner.c.orig	2012-06-19 14:59:21.264574275 -0400
+++ src/backend/optimizer/plan/planner.c	2012-06-19 15:10:54.776590758 -0400
@@ -226,7 +226,8 @@ standard_planner(Query *parse, int curso
 
 	result->commandType = parse->commandType;
 	result->queryId = parse->queryId;
-	result->hasReturning = (

Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
David Fetter  writes:
> OK, new proposal:

> COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
> an option which requires that HEADER be enabled and mandates that the
> column names in the header match the columns coming in.

> Has someone got a better name for this option than
> KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread David Fetter
On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote:
> David Fetter  writes:
> > Rather than being totally ignored in the COPY OUT (CSV HEADER)
> > case, the header line in should be parsed to establish which
> > columns are where and rearranging the output if needed.
> 
> This is not "fix a POLA violation".  This is a
> non-backwards-compatible new feature, which would have to have a
> switch to turn it on.

OK, new proposal:

COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Fujii Masao
On Wed, Jun 20, 2012 at 12:40 PM, Amit Kapila  wrote:
>>> I believe if WAL files are proper as mentioned in Alvaro's mail, the
> purposed logic should generate
>>> correct values.
>>> Do you see any problem in logic purposed in my original mail.
>>> Can I resume my work on this feature?
>
>> Maybe I'm missing your point, but... why don't you just use PITR to
>> recover from the corruption of pg_control?
>
> AFAIK PITR can be used in a scenario where there is a base back-up and we
> have archived
> the WAL files after that, now it can use WAL files to apply on base-backup.

Yes. If you want to recover the database from the media crash like the
corruption of pg_control file, you basically should take a base backup
and set up continuous archiving.

> In this scenario we don't know a point from where to start the next replay.
> So I believe it will be difficult to use PITR in this scenario.

You can find out the point from the complete pg_control file which was
restored from the backup.

If pg_control is corrupted, we can easily imagine that other database files
would also be corrupted. I wonder how many cases where only pg_control
file gets corrupted are. In that case, pg_resetxlog is unhelpful at all.
You need to use PITR, intead.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-20 Thread Peter Geoghegan
On 20 June 2012 15:55, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> I think that this change may have made the difference between the
>> Hungarians getting away with it and not getting away with it. Might it
>> have been that for text, they were using some operator that wasn't '='
>> (perhaps one which has no fastpath, and thus correctly made a
>> representation about equivalency) rather than texteq prior to this
>> commit?
>
> Uh, no.  There aren't any magic variants of equality now, and there were
> not back then either.  I'm inclined to think that the "Hungarian
> problem" did exist long before we fixed it.

I was suggesting that an operator other than equality may have been
used for some reason. It probably isn't a significant issue though.

> Um ... are you proposing to replace text btree index entries with
> strxfrm values?  Because if you aren't, I don't see that this is likely
> to win anything.  And if you are, it loses on the grounds of (a) index
> bloat and (b) loss of ability to do index-only scans.

No, I'm suggesting it would probably be at least a bit of a win here
to cache the constant, and only have to do a strxfrm() + strcmp() per
comparison. Not enough to justify all the added complexity of course,
but I wouldn't seek to justify this on the basis of improvements to
the speed of btree traversal.  It would obviously be much more
valuable for tuple sorting.

> Personally I think this is not a direction we want to go in.  I think
> that it's going to end up a significant performance loss in many cases,
> break backwards compatibility in numerous ways, and provide a useful
> behavioral improvement to only a small minority of users.

I may have over-emphasized the improvement in correctness that this
would bring, which I personally consider to be very much a secondary
benefit. The fact is that this is likely to be a fairly significant
performance win, because strxfrm() is quite simply the way you're
supposed to do collation-aware sorting, and is documented as such. For
that reason, C standard library implementations should not be expected
to emphasize its performance - they assume that you're using strxfrm()
+ their highly optimised strcmp() (as I've said, the glibc
implementation is written in ASM, and makes use of hand-written SSSE3
instructions on x86_64 for example).

The only performance downside that I can think of is the added check
for equivalence for each tuple within _bt_checkkeys() - perhaps you
were thinking of something else that hasn't occurred to me though.
Were you?

The added overhead is only going to be paid only once per index scan,
and not once per tuple returned by an index scan. Since equality
implies equivalence for us, there is no need to check equivalence
until something is unequal, and when that happens, and equivalence
doesn't hold, we're done. Meanwhile, that entire index scan just got
measurably faster from caching the constant's strxfrm() blob.

Now, maybe it is a bit funky that this equivalence test has to happen
even though the vast majority of locales don't care. If the only
alternative is to bloat up the strxfrm() blobs with the original
string, I'd judge that the funkiness is well worth it.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs  wrote:
 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.
>>>
>>> OK, so your estimate is 1.5 people to do that. And if we have more
>>> people, should they sit around doing nothing?
>>
>> Oh, give me a break.  You're willfully missing my point.  And to quote
>> Fred Brooks, nine women can't make a baby in one month.
>
> No, I'm not. The question is not how quickly can N people achieve a
> single thing, but how long will it take a few skilled people working
> on carefully selected tasks that have few dependencies between them to
> achieve something.

The bottleneck is getting the design right, not writing the code.
Selecting tasks for people to work on without an agreement on the
design will not advance the process, unless we just accept whatever
code you choose to right based on whatever design you happen to pick.

> How exactly did you arrive at your conclusion? Why is yours right and
> mine wrong?

I estimated the amount of work that would be required to do this right
and compared it to other large projects that have been successfully
done in the past.  I think you are looking at something on the order
of magnitude of the Windows port, which took about four releases to
become stable, or the SE-Linux project, which still isn't
feature-complete.  Even if it's only a HS-sized project, that took two
releases, as did streaming replication.  SSI got committed within one
release cycle, but there were several years of design and advocacy
work before any code was written, so that, too, was really a
multi-year project.

I'll confine my comments on the second part of the question to the
observation that it is a bit early to know who is right and who is
wrong, but the question could just as easily be turned on its head.

> No, I have four people who had initial objections and who have not
> commented on the fact that the points made are regrettably incorrect.

I think Kevin addressed this point better than I can.  Asserting
something doesn't make it true, and you haven't offered any rational
argument against the points that have been made, probably because
there isn't one.  We *cannot* justify steeling 100% of the available
bit space for a feature that many people won't use and may not be
enough to address the real requirement anyway.

> Since at least 3 of the people making such comments did not attend the
> full briefing meeting in Ottawa, I am not particularly surprised.
> However, I do expect people that didn't come to the meeting to
> recognise that they are likely to be missing information and to listen
> closely, as I listen to them.

Participation in the community development process is not contingent
on having flown to Ottawa in May, or on having decided to spend that
evening at your briefing meeting.  Attributing to ignorance what is
adequately explained by honest disagreement is impolite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund
On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> Simon Riggs  wrote:
> > This is not transaction metadata, it is WAL record metadata
> > required for multi-master replication, see later point.

> > We need to add information to every WAL record that is used as the
> > source for generating LCRs.
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does?  It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata.  You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you?  If not, why is it something to store in every WAL
> record rather than once per transaction?  That's not intended to be
> a rhetorical question.
Its definitely possible to store it per transaction (see the discussion around 
http://archives.postgresql.org/message-
id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via 
the originating node a considerably more complex thing. With our proposal you 
can do it without any complexity involved, on a low level. Storing it per 
transaction means you can only stream out the data to other nodes *after* 
fully reassembling the transaction. Thats a pitty, especially if we go for a 
design where the decoding happens in a proxy instance.

Other metadata will not be needed on such a low level.

I also have to admit that I am very hesitant to start developing some generic 
"transaction metadata" framework atm. That seems to be a good way to spend a 
good part of time in discussion and disagreeing. Imo thats something for 
later.

> I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.
I don't think thats true. See the above referenced subthread for reasons why I 
think the origin id is important.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Tom Lane
Alvaro Herrera  writes:
> I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain
> the list of ciphers starting from the stringified configuration
> parameter and iterate on them.

Do you mean that it will produce an expansion of the set of ciphers
meeting criteria like "!aNULL"?  If so, I think we are set; we can
easily check to see if the active cipher is in that list, no?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Andrew Dunstan



On 06/20/2012 11:02 AM, David Fetter wrote:

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?



First you are talking about COPY IN, not COPY OUT, surely.

This is not a bug, it is documented in exactly the place that all other 
COPY options are documented. The file_fdw page refers the reader to the 
COPY docs for details. Unless you want us to duplicate the entire COPY 
docs in the file_fdw page this seems entirely reasonable.


The current behaviour was discussed at some length back when we 
implemented the HEADER feature, IIRC, and is quite intentional. I don't 
think we should alter the current behaviour, as plenty of people rely on 
it, some to my certain knowledge. I do see a reasonable case for adding 
a new behaviour which takes notice of the header line, although it's 
likely to have plenty of wrinkles.


Reordering columns like you suggest might well have a significant impact 
on COPY performance, BTW. Also note that I created the file_text_array 
FDW precisely for people who want to be able to cherry pick and reorder 
columns. See 



cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

2012-06-20 Thread Tom Lane
David Fetter  writes:
> Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
> the header line in should be parsed to establish which columns are
> where and rearranging the output if needed.

This is not "fix a POLA violation".  This is a non-backwards-compatible
new feature, which would have to have a switch to turn it on.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Florian Pflug
On Jun20, 2012, at 17:34 , Tom Lane wrote:
> Florian Pflug  writes:
>> I wonder though if shouldn't restrict the allowed ciphers list to being
>> a simple list of supported ciphers. If our goal is to support multiple
>> SSL libraries transparently then surely having openssl-specific syntax
>> in the config file isn't exactly great anyway...
> 
> No, we don't want to go there, because then we'd have to worry about
> keeping the default list in sync with what's supported by the particular
> version of the particular library we chance to be using.  That's about
> as far from transparent as you can get.  A notation like "DEFAULT"
> is really quite ideal for our purposes in that respect.

No argument with that, but does that mean we have to allow the full
syntax supported by OpenSSL (i.e., those +,-,! prefixes)? Maybe we could
map an empty list to DEFAULT and otherwise interpret it as a list of 
ciphers?

It'd make the whole NULL-cipher business easy, because once we know that
the cipher specified doesn't contain !NULL (which removes NULL *permanently*),
we can simply append NULL to allow "all these ciphers plus NULL".

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas
On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund  wrote:
> Were not the only ones here that are performing scope creep though... I think
> about all people who have posted in the whole thread except maybe Tom and
> Marko are guilty of doing so.
>
> I still think its rather sensible to focus on exactly duplicated schemas in a
> very first version just because that leaves out some of the complexity while
> paving the road for other nice things.

Well, I guess what I want to know is: what does focusing on exactly
duplicated schemas mean?  If it means we'll disable DDL for tables
when we turn on replication, that's basically the Slony approach: when
you want to make a DDL change, you have to quiesce replication, do it,
and then resume replication.  I would possibly be OK with that
approach.  If it means that we'll hope that the schemas are duplicated
and start spewing garbage data when they're not, then I'm not
definitely not OK with that approach.  If it means using event
triggers to keep the catalogs synchronized, then I don't think I don't
think that's adequately robust.  The user could add more event
triggers that run before or after the ones the replication system
adds, and then you are back to garbage decoding (or crashes).  They
could also modify the catalogs directly, although it's possible we
don't care quite as much about that case (but on the other hand people
do sometimes need to do it to solve real problems).  Although I am
100% OK with pairing back the initial feature set - indeed, I strongly
recommend it - I think that robustness is not a feature which can be
left out in v1 and added in later.  All the robustness has to be
designed in at the start, or we will never have it.

On the whole, I think we're spending far too much time talking about
code and far too little time talking about what the overall design
should look like.  We are having a discussion about whether or not MMR
should be supported by sticking a 16-bit node ID into every WAL record
without having first decided whether we should support MMR, whether
that requires node IDs, whether they should be integers, whether those
integers should be 16 bits in size, whether they should be present in
WAL, and whether or not the record header is the right place to put
them.  There's a right order in which to resolve those questions, and
this isn't it.  More generally, I think there is a ton of complexity
that we're probably overlooking here in focusing in on specific coding
details.  I think the most interesting comment made to date is Steve
Singer's observation that very little of Slony is concerned with
changeset extraction or apply.  Now, on the flip side, all of these
patches seem to be concerned with changeset extraction and apply.
That suggests that we're missing some pretty significant pieces
somewhere in this design.  I think those pieces are things like error
recovery, fault tolerance, user interface design, and control logic.
Slony has spent years trying to get those things right.  Whether or
not they actually have gotten them right is of course an arguable
point, but we're unlikely to do better by ignoring all of those issues
and implementing whatever is most technically expedient.

>> You've got four people objecting to this patch now, all of whom happen
>> to be committers.  Whether or not MMR goes into core, who knows, but
>> it doesn't seem that this patch is going to fly.
> I find that a bit too early to say. Sure it won't fly exactly as proposed, but
> hell, who cares? What I want to get in is a solution to the specific problem
> the patch targets. At least you have, not sure about others, accepted that the
> problem needs a solution.
> We do not agree yet how that solution looks should like but thats not exactly
> surprising as we started discussing the problem only a good day ago.

Oh, no argument with any of that.  I strongly object to the idea of
shoving this patch through as-is, but I don't object to solving the
problem in some other, more appropriate way.  I think that won't look
much like this patch, though; it will be some new patch.

> If people agree that your proposed way of just one flag bit is the way to go
> we will have to live with that. But thats different from saying the whole
> thing is dead.

I think you've convinced me that a single flag-bit is not enough, but
I don't think you've convinced anyone that it belongs in the record
header.

>> As I would rather see this project
>> succeed, I recommend that you don't do that.  Both you and Andres seem
>> to believe that MMR is a reasonable first target to shoot at, but I
>> don't think anyone else - including the Slony developers who have
>> commented on this issue - endorses that position.
> I don't think we get full MMR into 9.3. What I am proposing is that we build
> in the few pieces that are required to implement MMR *ontop* of whats
> hopefully in 9.3.
> And I think thats a realistic goal.

I can't quite follow that sentence, but my general sense is that,
wh

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner
Simon Riggs  wrote:
> Kevin Grittner  wrote:
>>> Heikki Linnakangas  wrote:
>>
>>> I don't like the idea of adding the origin id to the record
>>> header. It's only required in some occasions, and on some record
>>> types.
>>
>> Right.
> 
> Wrong, as explained.
 
The point is not wrong; you are simply not responding to what is
being said.
 
You have not explained why an origin ID is required when there is no
replication, or if there is master/slave logical replication, or
there are multiple masters with non-overlapping primary keys
replicating to a single table in a consolidated database, or each
master replicates to all other masters directly, or any of various
other scenarios raised on this thread.  You've only explained why
it's necessary for certain configurations of multi-master
replication where all rows in a table can be updated on any of the
masters.  I understand that this is the configuration you find most
interesting, at least for initial implementation.  That does not
mean that the other situations don't exist as use cases or should be
not be considered in the overall design.
 
I don't think there is anyone here who would not love to see this
effort succeed, all the way to multi-master replication in the
configuration you are emphasizing.  What is happening is that people
are expressing concerns about parts of the design which they feel
are problematic, and brainstorming about possible alternatives.  As
I'm sure you know, fixing a design problem at this stage in
development is a lot less expensive than letting the problem slide
and trying to deal with it later.
 
> It isn't true that this is needed only for some configurations of
> multi-master, per discussion.
 
I didn't get that out of the discussion; I saw a lot of cases
mentioned as not needing it to which you simply did not respond.
 
> This is not transaction metadata, it is WAL record metadata
> required for multi-master replication, see later point.
> 
> We need to add information to every WAL record that is used as the
> source for generating LCRs.
 
If the origin ID of a transaction doesn't count as transaction
metadata (i.e., data about the transaction), what does?  It may be a
metadata element about which you have special concerns, but it is
transaction metadata.  You don't plan on supporting individual WAL
records within a transaction containing different values for origin
ID, do you?  If not, why is it something to store in every WAL
record rather than once per transaction?  That's not intended to be
a rhetorical question.  I think it's because you're still thinking
of the WAL stream as *the medium* for logical replication data
rather than *the source* of logical replication data.
 
As long as the WAL stream is the medium, options are very
constrained.  You can code a very fast engine to handle a single
type of configuration that way, and perhaps that should be a
supported feature, but it's not a configuration I've needed yet. 
(Well, on reflection, if it had been available and easy to use, I
can think of *one* time I *might* have used it for a pair of nodes.)
It seems to me that you are so focused on this one use case that you
are not considering how design choices which facilitate fast
development of that use case paint us into a corner in terms of
expanding to other use cases.
 
>> I think removing origin ID from this patch and submitting a
>> separate patch for a generalized transaction metadata system is
>> the sensible way to go.
> 
> We already have a very flexible WAL system for recording data of
> interest to various resource managers. If you wish to annotate a
> transaction, you can either generate a new kind of WAL record or
> you can enhance a commit record.
 
Right.  Like many of us are suggesting should be done for origin ID.
 
> XLOG_NOOP records can already be generated by your application if
> you wish to inject additional metadata to the WAL stream. So no
> changes are required for you to implement the generalised
> transaction metadata scheme you say you require.
 
I'm glad it's that easy.  Are there SQL functions to for that yet?
 
> Not sure how or why that relates to requirements for multi-master.
 
That depends on whether you want to leave the door open to other
logical replication than the one use case on which you are currently
focused.  I even consider some of those other cases multi-master,
especially when multiple databases are replicating to a single table
on another server.  I'm not clear on your definition -- it seems to
be rather more narrow.  Maybe we need to define some terms somewhere
to facilitate discussion.  Is there a Wiki page where that would
make sense?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-20 Thread Tom Lane
Florian Pflug  writes:
> I wonder though if shouldn't restrict the allowed ciphers list to being
> a simple list of supported ciphers. If our goal is to support multiple
> SSL libraries transparently then surely having openssl-specific syntax
> in the config file isn't exactly great anyway...

No, we don't want to go there, because then we'd have to worry about
keeping the default list in sync with what's supported by the particular
version of the particular library we chance to be using.  That's about
as far from transparent as you can get.  A notation like "DEFAULT"
is really quite ideal for our purposes in that respect.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg default's verbosity?

2012-06-20 Thread Peter Eisentraut
On tis, 2012-06-19 at 02:15 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > There might be something to the idea of demoting a few of the things
> > we've traditionally had as NOTICEs, though.  IME, the following two
> > messages account for a huge percentage of the chatter:
> 
> > NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
> > serial column "foo.a"
> > NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> > "foo_pkey" for table "foo"
> 
> Personally, I'd have no problem with flat-out dropping (not demoting)
> both of those two specific messages.  I seem to recall that Bruce has
> lobbied for them heavily in the past, though.

I don't like these messages any more than the next guy, but why drop
only those, and not any of the other NOTICE-level messages?  The meaning
of NOTICE is pretty much, if this is the first time you're using
PostgreSQL, let me tell you a little bit about how we're doing things
here.  If you've run your SQL script more than 3 times, you won't need
them anymore.  So set your client_min_messages to WARNING then.  That
should be pretty much standard for running SQL scripts, in addition to
all the other stuff listed here:
http://petereisentraut.blogspot.fi/2010/03/running-sql-scripts-with-psql.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release versioning inconsistency

2012-06-20 Thread Tom Lane
Peter Eisentraut  writes:
> On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote:
>> That might actually be a good idea. We can't really change the way we
>> named the betas, but it's not too late to consider naming the actual
>> release as 9.2.0...

> The final release was always going to be called 9.2.0, but naming the
> beta 9.2.0betaX is wrong.  There was a previous discussion about that
> particular point.

Yes.  There is no reason to change the naming scheme we have been using
for years now (at least since version_stamp.pl was invented for 7.4).
The only problem is that somebody got the name of the directory wrong on
the FTP server.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-20 Thread Tom Lane
Amit Kapila  writes:
>> I'm almost inclined to suggest that we not get next-LSN from WAL, but
>> by scanning all the pages in the main data store and computing the max
>> observed LSN.  This is clearly not very attractive from a performance
>> standpoint, but it would avoid the obvious failure mode where you lost
>> some recent WAL segments along with pg_control.

> According to my analysis, this will have some problem. 

I think you're missing the point.  There is no possible way to guarantee
database consistency after applying pg_resetxlog, unless the database
had been cleanly shut down beforehand.  The reset will lose the xlog
information that was needed to restore consistency.  So arguing from
examples that demonstrate this is rather pointless.  Rather, the value
of pg_resetxlog is to be able to start the database at all so that info
can be extracted from it.  What we are looking for is not perfection,
because that's impossible, but just to not make a bad situation worse.
The reason I'm concerned about selecting a next-LSN that's certainly
beyond every LSN in the database is that not doing so could result in
introducing further corruption, which would be entirely avoidable with
more care in choosing the next-LSN.

> Pg_resetxlog -
> It will generate the next-LSN point as 109 which when used for recovery will 
> generate inconsistent database.
> However if we would have relied on WAL, it would have got next-LSN as 107.

Umm ... the entire point of pg_resetxlog is to throw away WAL.  Not to
rely on it.

It's conceivable that there would be some use in a tool that searches
the available WAL files for the latest checkpoint record and recreates a
pg_control file pointing at that checkpoint, without zapping the WAL
files.  This would be much different in purpose and usage from
pg_resetxlog, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >