Re: [HACKERS] [PATCH] binary heap implementation

2012-11-14 Thread Andres Freund
On 2012-11-14 18:41:12 +0530, Abhijit Menon-Sen wrote:
 There are two or three places in the Postgres source that implement heap
 sort (e.g. tuplesort.c, nodeMergeAppend.c), and it's also needed by the
 BDR code. It seemed reasonable to factor out the functionality.

This replaces the simpleheap.[ch] I had in the last submitted series.

 I've attached a patch (binaryheap.diff) that contains a straightforward
 implementation of a binary heap (originally written by Andres, with a
 few bugfixes and cleanups by me).

I want to emphasise that the only good thing about my submitted version
was that it actually compiled. So quite a bit of the work here is from
Abhijit...

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] S_ISLNK

2012-11-14 Thread Nicholas White
Hi - I'm cross-compiling the master branch (cygwin/mingw) and have found a
reference to S_ISLNK that isn't guarded by #ifndef WIN32 like the ones in
basebackup.c are. Could you merge the attached fix? Thanks -

Nick


exec-symlink-ifdef.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] S_ISLNK

2012-11-14 Thread Tom Lane
Nicholas White n.j.wh...@gmail.com writes:
 Hi - I'm cross-compiling the master branch (cygwin/mingw) and have found a
 reference to S_ISLNK that isn't guarded by #ifndef WIN32 like the ones in
 basebackup.c are.

That whole function is guarded by HAVE_READLINK, so I'm not seeing the
problem (and neither are the Windows members of the buildfarm).  What
environment are you in that has readlink() and not S_ISLNK?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does delete from table where not exists (select 1 from ... LIMIT 1) perform badly?

2012-11-14 Thread Tom Lane
Palle Girgensohn gir...@pingpong.net writes:
 How come the planner treats the
 delete from table where not extists(select 1 from table2 where ... LIMIT 1)

 so differently, and usually badly, when the LIMIT 1 is there.

Because it can't optimize it into an antijoin.

 In older
 version of postgresql, I remember that the effect was the opposite, a
 limit 1 would actually perform substantially better. Hence we have old
 code (and old habits), where the LIMIT 1 is still used.

Well, you're basically forcing it into the same type of plan you would
have gotten before antijoins were implemented (circa 8.4), so I don't
see that this is a regression.  But I'd get rid of the LIMIT 1 if I were
you.  It's been a *very* long time since that was a net benefit in an
EXISTS subquery, if indeed it ever was --- AFAIR, even the earliest PG
versions that understood about optimizing for fast-start plans would do
so in an EXISTS subquery, with or without any LIMIT.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Bruce Momjian
On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:
 On Wed, Nov 14, 2012 at 2:03 AM, Bruce Momjian br...@momjian.us wrote:
  At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
  out the dump/restore times.
 
 My percentage numbers only included CPU time and I used SSD storage.
 For the most part there was no IO wait to speak of, but it's
 completely expected that thousands of link calls are not free.

Agreed.  I was looking at wall clock time so I could see the total
impact of everything pg_upgrade does.

  Postgres time itself breaks down with 10% for shutdown checkpoint and
  90% for regular running, consisting of 16% parsing, 13% analyze, 20%
  plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.
 
  That SVG graph was quite impressive.
 
 I used perf and Gprof2Dot for this. I will probably do a blog post on
 how to generate these graphs. It's much more useful for me than a
 plain flat profile as I don't know by heart which functions are called
 by which.

Yes, please share that information.

  It looks to me that most benefit could be had from introducing more
  parallelism. Are there any large roadblocks to pipelining the dump and
  restore to have them happen in parallel?
 
  I talked to Andrew Dustan about parallelization in pg_restore.  First,
  we currently use pg_dumpall, which isn't in the custom format required
  for parallel restore, but if we changed to custom format, create table
  isn't done in parallel, only create index/check constraints, and trigger
  creation, etc.  Not sure if it worth perusing this just for pg_upgrade.
 
 I agree that parallel restore for schemas is a hard problem. But I
 didn't mean parallelism within the restore, I meant that we could
 start both postmasters and pipe the output from dump directly to
 restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea.  The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity.  Here are the challenges of this approach:

*  I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems

*  pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number.  We can make this optional and force
different port numbers.

Let me work up a prototype in the next few days and see how it performs.
Thanks for the great idea.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Fujii Masao
On Wed, Nov 14, 2012 at 8:30 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I wrote:
  From: Tom Lane [mailto:t...@sss.pgh.pa.us]

  I wrote:
   Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes:
   I have a question.  I think it would be also better to extend the syntax
   for the SQL COPY command in the same way, ie,
   COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
   format 'csv'
 
   Yeah, sure --- that case is already superuser-only, so why not give it
   the option of being a popen instead of just fopen.
 
  BTW, one thought that comes to mind is that such an operation is
  extremely likely to fail under environments such as SELinux.  That's
  not necessarily a reason not to do it, but we should be wary of
  promising that it will work everywhere.  Probably a documentation note
  about this would be enough.

 OK  I'll revise the patch.

 I've revised the patch.  In this version a user can specify hooks for pre- and
 post-processor executables for COPY and \copy in the follwoing way:

 $ echo '/bin/gunzip -c $1'  decompress.sh
 $ chmod +x decompress.sh

 In the case of the COPY command,

 postgres=# COPY foo FROM '/home/pgsql/decompress.sh 
 /home/pgsql/foo.csv.gz'
 WITH (format 'csv');

 Also, in the case of the \copy instruction,

 postgres=# \copy foo from '/home/pgsql/decompress.sh 
 /home/pgsql/foo.csv.gz'
 with (format 'csv')

 As shown in the example above, I've assumed that the syntax for this option 
 for
 e.g., the COPY command is:

 COPY table_name FROM 'progname filename' WITH ...
 COPY table_name TO 'progname filename' WITH ...

 Here, progname for COPY IN is the user-supplied program that takes filename as
 its argument and that writes on standard output.

What about further extending the COPY IN syntax to the following?

COPY table_name FROM 'progname [ option, ... ]' WITH ...

I'd just like to execute

COPY vmstat_table FROM 'vmstat' WITH ...

 Also, prgoname for COPY OUT is
 the user-supplied program that reads standard input and writes to filename 
 taken
 as its argument.  This makes simple the identification and verification of
 progname and filename.

 Todo:
 * Documentation including documentation note about the limitation for
 environments such as SELinux mentioned by Tom.
 * More test

 Any comments and suggestions are welcomed.

Isn't it dangerous to allow a user to execute external program in
server side via SQL?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-14 Thread Amit kapila
On Tuesday, November 13, 2012 9:29 AM Amit kapila wrote:
On Monday, November 12, 2012 12:07 PM Greg Smith wrote:
On 11/9/12 11:59 PM, Amit kapila wrote:

 1) Change to add scanning a .conf directory in the default configuration
 using include-dir.  This is a quick fix.  I predict most of the
 headaches around it will end up being for packagers rather than the core
 code to deal with.

 You could submit this as a small thing to be evaluated on its own.  How
 it's done is going to be controversial.  Might as well get that fighting
 focused against a sample implementation as soon as possible.

 As per my understanding,
 a. during initdb, new conf directory can be created and also create .auto 
 file in it.
 b. use include_dir at end of postgresql.conf to include directory created in 
 above step.
 c. server start/sighup will take care of above include_dir

WIP patch to address above point is attached.
It needs cleanup w.r.t moving function for absolute path to common place where 
initdb as well as server code can use it. 


 2) Get familiar with navigating the GUC data and figuring out what,
 exactly, needs to be written out.  This should include something that
 navigates like things appear after a RESET, ignoring per-user or
 per-session changes when figuring out what goes there.  It seems
 inevitable that some amount of validating against the source
 information--what pg_settings labels source, sourcefile, and sourceline
 will be needed.  An example is the suggestion Magnus made for confirming
 that the include-dir is still active before writing something there.

 Okay, what I will do for this is that I shall explain in next mail the way to 
 do by navigating GUC's.

One basic idea to do execution of SQL Command with GUC's is described as below:

1. Take Global level lock so that no other session can run the ALTER 
SYSTEM/built-in Command to change config parameter
2. Navigate through GUC variable's and remember all GUC's (of .auto file ) 
reset_val. 
3. Compare the config parameter to be changed with the parameters stored in 
step-2 and if it exists, replace its value else add new variable-value to it.
4. Flush the file with all parameters computed in Step-3.
5. Signal all other backends to update this value in their GUC's reset_val, so 
that all backends always have recent copy.
6. When all backends have updated, change corresponding reset_val in current 
session as well.
7. Release the Global level lock.

Some problems with the above approach:
a. When the signal is sent in step-5, if other backend is also waiting on 
global lock, it can cause signal handling little tricky,
may be special handling needs to be done to handle this situation
b. If after step-5 or 6, rollback happened it might be difficult to rollback. 
In general if this command executes in transaction-block, the same issue can 
arise.
c. Updation of reset_val for parameters which cannot be reset without restart 
is wrong. For such kind of parameters, I think we can give warning to users.

I think this is the initial idea to check if I am thinking on lines you have in 
mind.

Comments/Suggestions?

With Regards,
Amit Kapila.

change_pgconf_for_ config_dir.patch
Description: change_pgconf_for_ config_dir.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-14 Thread Fujii Masao
On Wed, Nov 14, 2012 at 5:53 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 13, 2012 at 11:46 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Without this utility, it's difficult to calculate the maximum LSN of
 data page, so
 basically we needed to take a backup when starting the standby. In the 
 future,
 thanks to this utility, we can calculate the maximum LSN, and can skip a 
 backup
 if that LSN is less than the master (i.e., last applied LSN, IOW,
 timeline switch LSN).

 Doesn't the minimum recovery point give us that?

Yes, but only in the standby. The master doesn't record the minimum recovery
point at all. So, when we start the pre-master as new standby after failover,
we need this utility to know that LSN. Or we need to change the master so that
it records the minimum recovery point like the standby.

BTW, it might be useful to introduce new replication option that makes the data
page fush wait for its corresponding WAL to be replicated. By using this option,
we can ensure that any data page in the master always precede the standby.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 10:08 AM, Bruce Momjian wrote:

On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:


I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea.  The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity.  Here are the challenges of this approach:

*  I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems



Instead of piping it directly, have pg_upgrade work as a tee, pumping 
bytes both to psql and a file. This doesn't seem terribly hard.




*  pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number.  We can make this optional and force
different port numbers.



Right.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Simon Riggs
On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:

 Here, progname for COPY IN is the user-supplied program that takes filename 
 as
 its argument and that writes on standard output.

 What about further extending the COPY IN syntax to the following?

 COPY table_name FROM 'progname [ option, ... ]' WITH ...

 I'd just like to execute

 COPY vmstat_table FROM 'vmstat' WITH ...

I think we should be using FDWs/SRFs here, not inventing new
syntax/architectures for executing external code, so -1 from me.

We can already do
INSERT table SELECT * FROM fdw;
with any logic for generating data lives inside an FDW or SRF.

If we want it in COPY we can have syntax like this...
COPY table FROM (SELECT * FROM fdw)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Bruce Momjian
On Wed, Nov 14, 2012 at 10:25:24AM -0500, Andrew Dunstan wrote:
 
 On 11/14/2012 10:08 AM, Bruce Momjian wrote:
 On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:
 
 I agree that parallel restore for schemas is a hard problem. But I
 didn't mean parallelism within the restore, I meant that we could
 start both postmasters and pipe the output from dump directly to
 restore. This way the times for dumping and restoring can overlap.
 Wow, that is a very creative idea.  The current code doesn't do that,
 but this has the potential of doubling pg_upgrade's speed, without
 adding a lot of complexity.  Here are the challenges of this approach:
 
 *  I would need to log the output of pg_dumpall as it is passed to psql
 so users can debug problems
 
 
 Instead of piping it directly, have pg_upgrade work as a tee,
 pumping bytes both to psql and a file. This doesn't seem terribly
 hard.

Right.  It isn't hard.

 *  pg_upgrade never runs the old and new clusters at the same time for
 fear that it will run out of resources, e.g. shared memory, or if they
 are using the same port number.  We can make this optional and force
 different port numbers.
 
 
 Right.

OK.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-14 Thread Fujii Masao
On Wed, Nov 14, 2012 at 5:35 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Tuesday, November 13, 2012 10:17 PM Fujii Masao wrote:
 On Tue, Nov 13, 2012 at 1:23 PM, Amit kapila amit.kap...@huawei.com
 wrote:
  On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote:
  Robert Haas escribió:
  On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com
 wrote:
 
   I think I can see all of those things being potentially useful.
 There
   are a couple of pending patches that will revise the WAL format

  I wonder if we shouldn't make this a separate utility, rather than
  something that is part of pg_resetxlog.  Anyone have a thought on
 that
  topic?
 
  That thought did cross my mind too.
 
 We might be able to use this utility to decide whether we need to take
 a fresh backup from the master onto the standby, to start old master
 as new standby after failover.

 When starting new standby after failover, any data page in the standby
 must
 not precede the master. Otherwise, the standby cannot catch up with the
 master
 consistently. But, the master might write the data page corresponding to
 the WAL which has not been replicated to the standby yet. So, if
 failover happens
 before that WAL has been replicated, the data page in old master would
 precede
 new master (i.e., old standby), and in this case the backup is required.
 OTOH,
 if maximum LSN in data page in the standby is less than the master, the
 backup
 is not required.

 When new standby will start the replication (RequestXLogStreaming()), it
 will
 send the startpoint, so won't in above scenario that startpoint will be
 ahead of new master
 (or new master won't have that LSN) and replication will not be
 eastablished?

The startpoint is the heading LSN of the WAL file including the latest
checkpoint record. Yes, there can be the case where the startpoint is
ahead of new master. In this case, replication would fail to be established
because of lack of requested WAL file. OTOH, there can be the case
where new master has already been ahead of the startpoint.

 So now user may not be able to decide whether he needs to do incremental or
 full backup from new master,
 is this the case you are trying to point?

Sorry, I could not parse this comment. Could you elaborate your concern again?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Fujii Masao
On Thu, Nov 15, 2012 at 12:31 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:

 Here, progname for COPY IN is the user-supplied program that takes filename 
 as
 its argument and that writes on standard output.

 What about further extending the COPY IN syntax to the following?

 COPY table_name FROM 'progname [ option, ... ]' WITH ...

 I'd just like to execute

 COPY vmstat_table FROM 'vmstat' WITH ...

 I think we should be using FDWs/SRFs here, not inventing new
 syntax/architectures for executing external code, so -1 from me.

 We can already do
 INSERT table SELECT * FROM fdw;
 with any logic for generating data lives inside an FDW or SRF.

 If we want it in COPY we can have syntax like this...
 COPY table FROM (SELECT * FROM fdw)

New syntax looks attractive to me because it's easy to use that.
It's not easy to implement the FDW for the external program which
a user wants to execute.

Of course if someone implements something like any_external_program_fdw,
I would change my mind..

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Alvaro Herrera
Simon Riggs escribió:
 On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:
 
  Here, progname for COPY IN is the user-supplied program that takes 
  filename as
  its argument and that writes on standard output.
 
  What about further extending the COPY IN syntax to the following?
 
  COPY table_name FROM 'progname [ option, ... ]' WITH ...
 
  I'd just like to execute
 
  COPY vmstat_table FROM 'vmstat' WITH ...
 
 I think we should be using FDWs/SRFs here, not inventing new
 syntax/architectures for executing external code, so -1 from me.

Hmm, but then you are forced to write C code, whereas the external
program proposal could have you writing a only shell script instead.
So there is some merit to this idea ... though we could have a
pipe_fdw that could let you specify an arbitrary program to run.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-14 Thread Amit Kapila
On Wednesday, November 14, 2012 9:12 PM Fujii Masao wrote:
 On Wed, Nov 14, 2012 at 5:35 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  On Tuesday, November 13, 2012 10:17 PM Fujii Masao wrote:
  On Tue, Nov 13, 2012 at 1:23 PM, Amit kapila amit.kap...@huawei.com
  wrote:
   On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote:
   Robert Haas escribió:
   On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila
 amit.kap...@huawei.com
  wrote:
  
I think I can see all of those things being potentially
 useful.
  There
are a couple of pending patches that will revise the WAL
 format
 
   I wonder if we shouldn't make this a separate utility, rather
 than
   something that is part of pg_resetxlog.  Anyone have a thought on
  that
   topic?
  
   That thought did cross my mind too.
  
  We might be able to use this utility to decide whether we need to
 take
  a fresh backup from the master onto the standby, to start old master
  as new standby after failover.
 
  When starting new standby after failover, any data page in the
 standby
  must
  not precede the master. Otherwise, the standby cannot catch up with
 the
  master
  consistently. But, the master might write the data page corresponding
 to
  the WAL which has not been replicated to the standby yet. So, if
  failover happens
  before that WAL has been replicated, the data page in old master
 would
  precede
  new master (i.e., old standby), and in this case the backup is
 required.
  OTOH,
  if maximum LSN in data page in the standby is less than the master,
 the
  backup
  is not required.
 
  When new standby will start the replication (RequestXLogStreaming()),
 it
  will
  send the startpoint, so won't in above scenario that startpoint will
 be
  ahead of new master
  (or new master won't have that LSN) and replication will not be
  eastablished?
 
 The startpoint is the heading LSN of the WAL file including the latest
 checkpoint record. Yes, there can be the case where the startpoint is
 ahead of new master. In this case, replication would fail to be
 established
 because of lack of requested WAL file. 

Now user can use this utility to decide if new-standby has max LSN greater
than max LSN of new-master he needs to use fullback-up on new-standby. Is my
understanding right?

OTOH, there can be the case
 where new master has already been ahead of the startpoint.


 But in this case, there is no need for this utility. Right?
 
  So now user may not be able to decide whether he needs to do
 incremental or
  full backup from new master,
  is this the case you are trying to point?
 
 Sorry, I could not parse this comment. Could you elaborate your concern
 again?

I wanted to understand the usecase mentioned by you for this utility. 
As far as I can understand is that it will be used to decide that on
new-standby (old-master) whether a full backup is needed from
New-master(old-standby). 
And that situation can occur when new-standby has startpoint LSN greater
than new-master?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Simon Riggs escribió:
 On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:
 Here, progname for COPY IN is the user-supplied program that takes filename 
 as
 its argument and that writes on standard output.

 I think we should be using FDWs/SRFs here, not inventing new
 syntax/architectures for executing external code, so -1 from me.

 Hmm, but then you are forced to write C code, whereas the external
 program proposal could have you writing a only shell script instead.

I disagree with Simon's objection also, because neither reading from
nor writing to an external program is likely to fit the model of
reading/updating a table very well.  For instance, there's no good
reason to suppose that reading twice will give the same results.  So
force-fitting this usage into the FDW model is not going to work well.

Nor do I really see the argument why a pipe_fdw module is cleaner
than a COPY TO/FROM pipe feature.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks

2012-11-14 Thread Alvaro Herrera
Alvaro Herrera wrote:

  * In heap_lock_tuple's  XMAX_IS_MULTI case
  
  [snip]
  
  why is it membermode  mode and not membermode = mode?
 
 Uh, that's a bug.  Fixed.  As noticed in the comment above that snippet,
 there was a deadlock possible here.  Maybe I should add a test to ensure
 this doesn't happen.

Done:
https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770

(I don't think this is worth a v24 submission).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] S_ISLNK

2012-11-14 Thread Nicholas White
Ah - OK. It turns out I'd run ./configure in the postgresql directory
before running it in my build directory, so I had two (different)
pg_config.hs! The below builds for me from a clean source tree now.
Thanks -

Nick

./path/to/build/configure CC=x86_64-w64-mingw32-gcc
--target=x86_64-w64-mingw32 --build=x86_64-w64-mingw32
--host=x86_64-w64-mingw32 --without-readline --without-zlib
--disable-thread-safety

On 14 November 2012 06:25, Tom Lane t...@sss.pgh.pa.us wrote:
 Nicholas White n.j.wh...@gmail.com writes:
 Hi - I'm cross-compiling the master branch (cygwin/mingw) and have found a
 reference to S_ISLNK that isn't guarded by #ifndef WIN32 like the ones in
 basebackup.c are.

 That whole function is guarded by HAVE_READLINK, so I'm not seeing the
 problem (and neither are the Windows members of the buildfarm).  What
 environment are you in that has readlink() and not S_ISLNK?

 regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 11:20 AM, Tom Lane wrote:

I disagree with Simon's objection also, because neither reading from
nor writing to an external program is likely to fit the model of
reading/updating a table very well.  For instance, there's no good
reason to suppose that reading twice will give the same results.  So
force-fitting this usage into the FDW model is not going to work well.

Nor do I really see the argument why a pipe_fdw module is cleaner
than a COPY TO/FROM pipe feature.





Yeah, I agree, although the syntax looks a bit unclean.

Maybe something like

COPY foo FROM wherever WITH (FILTER '/path/to/program')

might work better. You'd hook up the source to the filter as its stdin 
and read its stdout. Not sure what we'd do for \copy though.



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Simon Riggs
On 14 November 2012 16:20, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Simon Riggs escribió:
 On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:
 Here, progname for COPY IN is the user-supplied program that takes 
 filename as
 its argument and that writes on standard output.

 I think we should be using FDWs/SRFs here, not inventing new
 syntax/architectures for executing external code, so -1 from me.

 Hmm, but then you are forced to write C code, whereas the external
 program proposal could have you writing a only shell script instead.

 I disagree with Simon's objection also, because neither reading from
 nor writing to an external program is likely to fit the model of
 reading/updating a table very well.  For instance, there's no good
 reason to suppose that reading twice will give the same results.  So
 force-fitting this usage into the FDW model is not going to work well.

 Nor do I really see the argument why a pipe_fdw module is cleaner
 than a COPY TO/FROM pipe feature.

Perhaps not cleaner, but we do need

COPY table FROM (SELECT * FROM foo)

So we will then have both ways.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Simon Riggs
On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:

 What about further extending the COPY IN syntax to the following?

 COPY table_name FROM 'progname [ option, ... ]' WITH ...

 I'd just like to execute

 COPY vmstat_table FROM 'vmstat' WITH ...

If we go ahead with this, I think it needs additional keyword to
indicate that we will execute the file rather than read from it. I
don't think we should rely on whether the file is executable or not to
determine how we should treat it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Yeah, I agree, although the syntax looks a bit unclean.

Oh, I had not looked at the syntax closely.  I agree, that basically
sucks: it's overcomplicated and under-featured, because you can't
control the actual program command line very conveniently.  Nor do I see
a reason to force this into the model of program filtering a specific
file.  What happened to the previous proposal of treating the COPY
target as a pipe specification, ie

COPY table FROM 'some command line |';
COPY table TO '| some command line';

 Not sure what we'd do for \copy though.

Adding a pipe symbol to the target works exactly the same for \copy.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Robert Haas
On Tue, Nov 13, 2012 at 4:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What happens when you get an I/O failure on the checksum fork?  Assuming
 you're using 8K pages there, that would mean you can no longer verify
 the integrity of between one and four thousand pages of data.

True... but you'll have succeeded in your central aim of determining
whether your hardware has crapped out.  Answer: yes.

The existing code doesn't have any problem reporting back the user
those hardware failures which are reported to it by the OS.  The only
reason for the feature is for the database to be able to detect
hardware failures in situations where the OS claims that everything is
working just fine.

 Not to mention the race condition problems associated with trying to be
 sure the checksum updates hit the disk at the same time as the data-page
 updates.

 I think you really have to store the checksums *with* the data they're
 supposedly protecting.

If torn pages didn't exist, I'd agree with you, but they do.  Any
checksum feature is going to need to cope with the fact that, prior to
reaching consistency, there will be blocks on disk with checksums that
don't match, because 8kB writes are not atomic.  We fix that by
unconditionally overwriting the possibly-torn pages with full-page
images, and we could simply update the checksum fork at the same time.
 We don't have to do anything special to make sure that the next
checkpoint cycle successfully flushes both pages to disk before
declaring the checkpoint a success and moving the redo pointer; that
logic already exists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote:
 I'd just like to execute
 COPY vmstat_table FROM 'vmstat' WITH ...

 If we go ahead with this, I think it needs additional keyword to
 indicate that we will execute the file rather than read from it. I
 don't think we should rely on whether the file is executable or not to
 determine how we should treat it.

Agreed, and there's also the question of passing switches etc to the
program, so the string can't be a bare file name anyway.  I proposed
pipe symbols (|) in the string previously, but if you find that too
Unix-centric I suppose we could do

COPY TABLE FROM PROGRAM 'command line';
COPY TABLE TO PROGRAM 'command line';

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 11:39 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

Yeah, I agree, although the syntax looks a bit unclean.

Oh, I had not looked at the syntax closely.  I agree, that basically
sucks: it's overcomplicated and under-featured, because you can't
control the actual program command line very conveniently.  Nor do I see
a reason to force this into the model of program filtering a specific
file.  What happened to the previous proposal of treating the COPY
target as a pipe specification, ie

COPY table FROM 'some command line |';
COPY table TO '| some command line';




I'd like to be able to filter STDIN if possible - with this syntax how 
is COPY going to know to hook up STDIN to the program?


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 11/14/2012 11:39 AM, Tom Lane wrote:
 What happened to the previous proposal of treating the COPY
 target as a pipe specification, ie

 I'd like to be able to filter STDIN if possible - with this syntax how 
 is COPY going to know to hook up STDIN to the program?

Huh?  That's fairly nonsensical for the backend-side case; there's no
way that stdin (or stdout) of a backend is going to connect anywhere
useful for this purpose.  As for doing it on the psql side (\copy),
I think it would be more or less automatic.  If you do say

foo | psql -c \copy tab from 'bar |' dbname

then bar is going to inherit psql's stdin, which is coming from foo.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 11:56 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 11/14/2012 11:39 AM, Tom Lane wrote:

What happened to the previous proposal of treating the COPY
target as a pipe specification, ie

I'd like to be able to filter STDIN if possible - with this syntax how
is COPY going to know to hook up STDIN to the program?

Huh?  That's fairly nonsensical for the backend-side case; there's no
way that stdin (or stdout) of a backend is going to connect anywhere
useful for this purpose.  As for doing it on the psql side (\copy),
I think it would be more or less automatic.  If you do say

foo | psql -c \copy tab from 'bar |' dbname

then bar is going to inherit psql's stdin, which is coming from foo.





Why does it make less sense on the backend than COPY foo FROM STDIN ? 
Why shouldn't I want to be able to filter or transform the input? I have 
a client with a pretty complex backend-driven ETL tool. One of the 
annoying things about it is that we have to transfer the file to the 
backend before we can process it. I can imagine this leading to a 
similar annoyance.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 04:42:57PM -0800, Josh Berkus wrote:
 Jeff,
 
  OK, so here's my proposal for a first patch (changes from Simon's
  patch):
  
* Add a flag to the postgres executable indicating that it should use
  checksums on everything. This would only be valid if bootstrap mode is
  also specified.
* Add a multi-state checksums flag in pg_control, that would have
  three states: OFF, ENABLING, and ON. It would only be set to ON during
  bootstrap, and in this first patch, it would not be possible to set
  ENABLING.
* Remove GUC and use this checksums flag everywhere.
* Use the TLI field rather than the version field of the page header.
* Incorporate page number into checksum calculation (already done).

  Does this satisfy the requirements for a first step? Does it interfere
  with potential future work?
 
 So the idea of this implementation is that checksums is something you
 set at initdb time, and if you want checksums on an existing database,
 it's a migration process (e.g. dump and reload)?
 
 I think that's valid as a first cut at this.

pg_upgrade will need to check for the checksum flag and throw an error
if it is present in the new cluster but not the old one.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Jeff Janes
On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 Are sure the server you are dumping out of is head?

 I experimented a bit with dumping/restoring 16000 tables matching
 Bruce's test case (ie, one serial column apiece).  The pg_dump profile
 seems fairly flat, without any easy optimization targets.  But
 restoring the dump script shows a rather interesting backend profile:

 samples  %image name   symbol name
 3086139.6289  postgres AtEOXact_RelationCache
 9911 12.7268  postgres hash_seq_search
...

 The hash_seq_search time is probably mostly associated with
 AtEOXact_RelationCache, which is run during transaction commit and scans
 the relcache hashtable looking for tables created in the current
 transaction.  So that's about 50% of the runtime going into that one
 activity.

 There are at least three ways we could whack that mole:

 * Run the psql script in --single-transaction mode, as I was mumbling
 about the other day.  If we were doing AtEOXact_RelationCache only once,
 rather than once per CREATE TABLE statement, it wouldn't be a problem.
 Easy but has only a narrow scope of applicability.

 * Keep a separate list (or data structure of your choice) so that
 relcache entries created in the current xact could be found directly
 rather than having to scan the whole relcache.  That'd add complexity
 though, and could perhaps be a net loss for cases where the relcache
 isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added.  The overhead of that should be very low.

Are the three places where need_eoxact_work = true; the only places
where things need to be added to the new structure?  It seems like
there is no need to remove things from the list, because the things
done in AtEOXact_RelationCache are idempotent.


 * Limit the size of the relcache (eg by aging out
 not-recently-referenced entries) so that we aren't incurring O(N^2)
 costs for scripts touching N tables.  Again, this adds complexity and
 could be counterproductive in some scenarios.

I made the crude hack of just dumping the relcache whenever it was
1000 at eox.  The time to load 100,000 tables went from 62 minutes
without the patch to 12 minutes with it.  (loading with -1 -f took
23 minutes).

The next quadratic behavior is in init_sequence.

Cheers,

Jeff

diff --git a/src/backend/utils/cache/relcache.c
b/src/backend/utils/cache/relcache.c
index 8c9ebe0..3941c98 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2260,6 +2260,8 @@ AtEOXact_RelationCache(bool isCommit)
)
return;

+if (hash_get_num_entries(RelationIdCache)1000)
{RelationCacheInvalidate();}


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-14 Thread Fujii Masao
On Thu, Nov 15, 2012 at 12:55 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Now user can use this utility to decide if new-standby has max LSN greater
 than max LSN of new-master he needs to use fullback-up on new-standby. Is my
 understanding right?

No. The maximum LSN of data pages in new-standby should be compared with
the last replayed LSN (IOW, the last valid LSN of previous timeline)
of new-master.

OTOH, there can be the case
 where new master has already been ahead of the startpoint.


  But in this case, there is no need for this utility. Right?

No. The above comparison is required in this case.


  So now user may not be able to decide whether he needs to do
 incremental or
  full backup from new master,
  is this the case you are trying to point?

 Sorry, I could not parse this comment. Could you elaborate your concern
 again?

 I wanted to understand the usecase mentioned by you for this utility.
 As far as I can understand is that it will be used to decide that on
 new-standby (old-master) whether a full backup is needed from
 New-master(old-standby).

Yes.

 And that situation can occur when new-standby has startpoint LSN greater
 than new-master?

Whether the backup is required has nothing to do with the startpoint.
The backup is required when the data page in old-master precedes
the last applied LSN in old-standby (i.e., new-master) at the moment
of the failover. Without the backup, there is no way to revert the data
which is ahead of new-master.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-14 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 12:39:17AM -0500, Greg Smith wrote:
 On 11/9/12 6:14 PM, Jeff Davis wrote:
 On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote:
 Yeah.  I definitely think that we could shed an enormous amount of
 complexity by deciding that this is, for now, an option that can only
 be selected at initdb time.  That would remove approximately 85% of
 everything I've ever disliked about this patch - without, I think,
 precluding the possibility of improving things later.
 
 That's certainly true, but it introduces one large problem: upgrading
 would not work, which (in the past few releases) we've treated as a
 major showstopper for many features.
 
 If you have pages that were written with one datetime storage
 format, and you create a cluster using the other one, that will
 fail.  If checksums are done as an initdb time option, I see
 checksummed as being a block change on that level, and the
 precedent for not supporting it defensible.  pg_upgrade will need to
 check for a mismatch--new cluster has checksums turned on, old one
 doesn't--and abort out if that happens.  That can be lumped in with
 the other pg_controldata tests easily enough.

Yes, pg_upgrade can do that easily.

 What I think this argues for, though, is being precise about naming
 what goes into pg_controldata.  Let's say the initial commit target
 is an initdb time switch, but later finer grained ones are expected
 to be available.  I think the output and source code labels on the
 initdb controldata part should refer to this as something like
 checksums available then.  The word enabled could be misleading
 when there's finer grained enabling coming later.  We don't want
 people to run pg_controldata, see checksums:  enabled/on, and
 later discover they're not fully operational in that cluster yet.
 Saying checksums: available suggests there's somewhere else that
 should be checked, to tell if they're available on a given
 database/table or not.
 
 The sort of text I'm thinking of for the manual and/or warning is:
 
 You can't use pg_upgrade to migrate from a cluster where checksums
 are not available to one where they are.  This limitation may be
 removed by a future version.

available sounds like it is compiled in, but in this case, it means it
is active.  I think we are just going to need to rename it as we make it
finer grained.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Peter Eisentraut
On 11/11/12 6:59 PM, Andrew Dunstan wrote:
 I haven't followed this too closely, but I did wonder several days ago
 why this wasn't being made an initdb-time decision.

One problem I see with this is that it would make regression testing
much more cumbersome.  Basically, to do a proper job, you'd have to run
all the tests twice, once against each initdb setting.  Either we
automate this, which would mean everyone's tests are now running almost
twice as long, or we don't, which would mean that some critical piece of
low-level code would likely not get wide testing.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Alvaro Herrera
Peter Eisentraut escribió:
 On 11/11/12 6:59 PM, Andrew Dunstan wrote:
  I haven't followed this too closely, but I did wonder several days ago
  why this wasn't being made an initdb-time decision.
 
 One problem I see with this is that it would make regression testing
 much more cumbersome.  Basically, to do a proper job, you'd have to run
 all the tests twice, once against each initdb setting.  Either we
 automate this, which would mean everyone's tests are now running almost
 twice as long, or we don't, which would mean that some critical piece of
 low-level code would likely not get wide testing.

We already have that problem with the isolation tests regarding
transaction isolation levels: the tests are only run with whatever is
the default_transaction_isolation setting, which is read committed in
all buildfarm installs; so repeatable read and serializable are only
tested when someone gets around to tweaking an installation manually.  A
proposal has been floated to fix that, but it needs someone to actually
implement it.

I wonder if something similar could be used to handle this case as well.
I also wonder, though, if the existing test frameworks are really the
best mechanisms to verify block layer functionality.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Peter Eisentraut
On 11/14/12 11:50 AM, Andrew Dunstan wrote:
 COPY table FROM 'some command line |';
 COPY table TO '| some command line';

 
 
 I'd like to be able to filter STDIN if possible - with this syntax how
 is COPY going to know to hook up STDIN to the program?

Why don't you filter the data before it gets to stdin?  Some program is
feeding the data to stdin on the client side.  Why doesn't that do the
filtering?  I don't see a large advantage in having the data be sent
unfiltered to the server and having the server do the filtering.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 02:01 PM, Alvaro Herrera wrote:

Peter Eisentraut escribió:

On 11/11/12 6:59 PM, Andrew Dunstan wrote:

I haven't followed this too closely, but I did wonder several days ago
why this wasn't being made an initdb-time decision.

One problem I see with this is that it would make regression testing
much more cumbersome.  Basically, to do a proper job, you'd have to run
all the tests twice, once against each initdb setting.  Either we
automate this, which would mean everyone's tests are now running almost
twice as long, or we don't, which would mean that some critical piece of
low-level code would likely not get wide testing.

We already have that problem with the isolation tests regarding
transaction isolation levels: the tests are only run with whatever is
the default_transaction_isolation setting, which is read committed in
all buildfarm installs; so repeatable read and serializable are only
tested when someone gets around to tweaking an installation manually.  A
proposal has been floated to fix that, but it needs someone to actually
implement it.

I wonder if something similar could be used to handle this case as well.
I also wonder, though, if the existing test frameworks are really the
best mechanisms to verify block layer functionality.



There is nothing to prevent a buildfarm owner from using different 
settings - there is a stanza in the config file that provides for them 
to do so in fact.


Maybe a saner thing to do though would be to run the isolation tests two 
or three times with different PGOPTIONS settings. Maybe we need two ro 
three targets in the isolation test Makefile for that.


Regarding checksums, I can add an option for the initdb that the 
buildfarm script runs. We already run different tests for different 
encodings. Of course, constant expanding like this won't scale, so we 
need to pick the options we want to exrecise carefully.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for hint bit i/o mitigation

2012-11-14 Thread Atri Sharma
On Wed, Nov 7, 2012 at 9:47 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Nov 7, 2012 at 6:01 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
   Following the sig is a first cut at a patch (written by Atri) that
   attempts to mitigate hint bit i/o penalty when many pages worth of
   tuples are sequentially written out with the same transaction id.
   There have been other attempts to deal with this problem that fit
   niche cases (especially those that create the table in the same
   transaction as the one inserting) that work but don't really solve
  the
   problem generally.
 
  As we are now dealing with only the last xid(please refer to the details
  of the patch attached), the invalidation issues are not significant any
  more.
 
  I think you are right, but please check below the scenario I have in mind
  due to which I got confused:
 
  Session -1
  1. let's say for tup-1 on page, cachedVisibilityXid is not set, and it go
 inside SetHinbits and set it to xid of tuple which is let's say = 708
  2. now for all consecutive tuples which have same xmin (708), it can
  directly refer
 cached id and cached status, even if hint bit is not set.
 
  Other Sessions
  3. now from other sessions, large operations happened on all other tables
  except this table.
  4. The situation can reach where xid can wrap around.
 
  Session -1
  5. It again tries to query the same table, at this point it will compare
 the xid in tuple with cachedVisibilityXid.
 
  Now if all tuple's xid's for that particular table are frozen in all
 cases
  then it seems to be okay, otherwise it might be problem.
  I am not fully aware about this wrap around and frozed xid concept, thats
  why I had doubted
  that it might create problem, on further thought, it appears that I was
  wrong.

 Well there's that.  But more to the point for the cache to fail you'd
 have to have a scenario where a table didn't scan any records for 1
 billion+ transactions.   See note [3] above for reasoning why this is
 implausible.  Also we're already relying on this effect in transam.c.

 merlin


PFA below the sig the updated patch for the same.It maintains a cache
cachedVisibilityXid which holds the results of clog visibility check.
cachedVisibilityXidStatus holds the commit status of the XID in
cachedVisibilityXid.

In each visibility function (except HeapTupleSatisfiesVacuum() ), an
addition check has been added to check if the commit status of Xmin or Xmax
of a tuple can be retrieved from the cache.

So, in place of

  if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED))

the condition is now

if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED)
 !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple)
 cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED))

This checks if the commit status can be known from the cache or not before
proceeding.

I will be posting the patch to commit fest.

Thoughts/Feedback?

Atri

-- 
Regards,

Atri
*l'apprenant


*patch:

*** tqual--unchanged.c2012-09-20 03:17:58.0 +0530
--- tqual.c2012-11-14 23:27:30.470499857 +0530
***
*** 75,80 
--- 75,88 
  /* local functions */
  static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);

+ /*
+  * Single-item cache for results of clog visibility check.  It's worth
having
+  * such a cache to help reduce the amount of hint bit traffic when
+  * many sequentially touched tuples have the same XID.
+  */
+ static TransactionId cachedVisibilityXid;
+ /* Visibility status cache stores the commit status of the XID in
cachedVisibilityXid */
+ static uint16 cachedVisibilityXidStatus;

  /*
   * SetHintBits()
***
*** 117,122 
--- 125,133 

  if (XLogNeedsFlush(commitLSN)  BufferIsPermanent(buffer))
  return;/* not flushed yet, so don't set hint
*/
+
+ cachedVisibilityXid = xid;
+ cachedVisibilityXidStatus = infomask;
  }

  tuple-t_infomask |= infomask;
***
*** 164,170 
  bool
  HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer
buffer)
  {
! if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED))
  {
  if (tuple-t_infomask  HEAP_XMIN_INVALID)
  return false;
--- 175,183 
  bool
  HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer
buffer)
  {
! if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED)
!  !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple)
!  cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED))
  {
  if (tuple-t_infomask  HEAP_XMIN_INVALID)
  return false;
***
*** 247,253 
  if (tuple-t_infomask  HEAP_XMAX_INVALID)/* xid invalid or
aborted */
  return true;

! if (tuple-t_infomask  HEAP_XMAX_COMMITTED)
  {
  if (tuple-t_infomask  HEAP_IS_LOCKED)
  return true;
--- 260,268 
  if (tuple-t_infomask  HEAP_XMAX_INVALID)/* xid invalid or
aborted 

Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 02:05 PM, Peter Eisentraut wrote:

On 11/14/12 11:50 AM, Andrew Dunstan wrote:

 COPY table FROM 'some command line |';
 COPY table TO '| some command line';



I'd like to be able to filter STDIN if possible - with this syntax how
is COPY going to know to hook up STDIN to the program?

Why don't you filter the data before it gets to stdin?  Some program is
feeding the data to stdin on the client side.  Why doesn't that do the
filtering?  I don't see a large advantage in having the data be sent
unfiltered to the server and having the server do the filtering.



Centralization of processing would be one obvious reason. I don't really 
see why the same reasoning doesn't apply on the backend. You could just 
preprocess the input before calling COPY (via a plperlu function for 
example). If we're going to have filtering functionality then it should 
be as general as possible, ISTM. But I seem to be alone in this, so I 
won't push it.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 11/14/2012 02:05 PM, Peter Eisentraut wrote:
 Why don't you filter the data before it gets to stdin?  Some program is
 feeding the data to stdin on the client side.  Why doesn't that do the
 filtering?  I don't see a large advantage in having the data be sent
 unfiltered to the server and having the server do the filtering.

 Centralization of processing would be one obvious reason.

If I understand correctly, what you're imagining is that the client
sources data to a COPY FROM STDIN type of command, then the backend
pipes that out to stdin of some filtering program, which it then reads
the stdout of to get the data it processes and stores.

We could in principle make that work, but there are some pretty serious
implementation problems: popen doesn't do this so we'd have to cons up
our own fork and pipe setup code, and we would have to write a bunch of
asynchronous processing logic to account for the possibility that the
filter program doesn't return data in similar-size chunk to what it
reads.  (IOW, it will never be clear when to try to read data from the
filter and when to try to write data to it.)

I think it's way too complicated for the amount of functionality you'd
get.  As Peter says, there's no strong reason not to do such processing
on the client side.  In fact there are pretty strong reasons to prefer
to do it there, like not needing database superuser privilege to invoke
the filter program.

What I'm imagining is a very very simple addition to COPY that just
allows it to execute popen() instead of fopen() to read or write the
data source/sink.  What you suggest would require hundreds of lines and
create many opportunities for new bugs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 There are at least three ways we could whack that mole: ...

 * Keep a separate list (or data structure of your choice) so that
 relcache entries created in the current xact could be found directly
 rather than having to scan the whole relcache.  That'd add complexity
 though, and could perhaps be a net loss for cases where the relcache
 isn't so bloated.

 Maybe a static list that can overflow, like the ResourceOwner/Lock
 table one recently added.  The overhead of that should be very low.

 Are the three places where need_eoxact_work = true; the only places
 where things need to be added to the new structure?

Yeah.  The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers.  It's certainly not impossible, I was just unsure how much
of a pain in the rear it might be.

 The next quadratic behavior is in init_sequence.

Yeah, that's another place that is using a linear list that perhaps
should be a hashtable.  OTOH, probably most sessions don't touch enough
different sequences for that to be a win.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Alvaro Herrera
Tom Lane escribió:
 Jeff Janes jeff.ja...@gmail.com writes:

  The next quadratic behavior is in init_sequence.
 
 Yeah, that's another place that is using a linear list that perhaps
 should be a hashtable.  OTOH, probably most sessions don't touch enough
 different sequences for that to be a win.

Could we use some adaptive mechanism here?  Say we use a list for the
first ten entries, and if an eleventh one comes in, we create a hash
table for that one and all subsequent ones.  All future calls would
have to examine both the list for the first few and then the hash table.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Regarding checksums, I can add an option for the initdb that the 
 buildfarm script runs. We already run different tests for different 
 encodings. Of course, constant expanding like this won't scale, so we 
 need to pick the options we want to exrecise carefully.

I thought the whole point of the buildfarm was to provide a scalable way
of exercising different combinations of options that individual
developers couldn't practically test.  We might need a little more
coordination among buildfarm owners to ensure we get full coverage,
of course.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 02:37 PM, Tom Lane wrote:


What I'm imagining is a very very simple addition to COPY that just
allows it to execute popen() instead of fopen() to read or write the
data source/sink.  What you suggest would require hundreds of lines and
create many opportunities for new bugs.




That's certainly a better answer than any I've had. I accept the reasoning.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Andrew Dunstan


On 11/14/2012 03:06 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

Regarding checksums, I can add an option for the initdb that the
buildfarm script runs. We already run different tests for different
encodings. Of course, constant expanding like this won't scale, so we
need to pick the options we want to exrecise carefully.

I thought the whole point of the buildfarm was to provide a scalable way
of exercising different combinations of options that individual
developers couldn't practically test.  We might need a little more
coordination among buildfarm owners to ensure we get full coverage,
of course.




Yes, true. So lets' wait and see how the checksums thing works out and 
then we can tackle the buildfarm end. At any rate, I don't think the 
buildfarm is a reason not to have this as an initdb setting.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for hint bit i/o mitigation

2012-11-14 Thread Atri Sharma
On Thu, Nov 15, 2012 at 12:42 AM, Atri Sharma atri.j...@gmail.com wrote:




 On Wed, Nov 7, 2012 at 9:47 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Nov 7, 2012 at 6:01 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
   Following the sig is a first cut at a patch (written by Atri) that
   attempts to mitigate hint bit i/o penalty when many pages worth of
   tuples are sequentially written out with the same transaction id.
   There have been other attempts to deal with this problem that fit
   niche cases (especially those that create the table in the same
   transaction as the one inserting) that work but don't really solve
  the
   problem generally.
 
  As we are now dealing with only the last xid(please refer to the
 details
  of the patch attached), the invalidation issues are not significant any
  more.
 
  I think you are right, but please check below the scenario I have in
 mind
  due to which I got confused:
 
  Session -1
  1. let's say for tup-1 on page, cachedVisibilityXid is not set, and it
 go
 inside SetHinbits and set it to xid of tuple which is let's say = 708
  2. now for all consecutive tuples which have same xmin (708), it can
  directly refer
 cached id and cached status, even if hint bit is not set.
 
  Other Sessions
  3. now from other sessions, large operations happened on all other
 tables
  except this table.
  4. The situation can reach where xid can wrap around.
 
  Session -1
  5. It again tries to query the same table, at this point it will compare
 the xid in tuple with cachedVisibilityXid.
 
  Now if all tuple's xid's for that particular table are frozen in all
 cases
  then it seems to be okay, otherwise it might be problem.
  I am not fully aware about this wrap around and frozed xid concept,
 thats
  why I had doubted
  that it might create problem, on further thought, it appears that I was
  wrong.

 Well there's that.  But more to the point for the cache to fail you'd
 have to have a scenario where a table didn't scan any records for 1
 billion+ transactions.   See note [3] above for reasoning why this is
 implausible.  Also we're already relying on this effect in transam.c.

 merlin


 PFA below the sig the updated patch for the same.It maintains a cache
 cachedVisibilityXid which holds the results of clog visibility check.
 cachedVisibilityXidStatus holds the commit status of the XID in
 cachedVisibilityXid.

 In each visibility function (except HeapTupleSatisfiesVacuum() ), an
 addition check has been added to check if the commit status of Xmin or Xmax
 of a tuple can be retrieved from the cache.

 So, in place of

   if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED))

 the condition is now


 if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED)
  !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple)
  cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED))

 This checks if the commit status can be known from the cache or not before
 proceeding.

 I will be posting the patch to commit fest.

 Thoughts/Feedback?

 Atri

 --
 Regards,

 Atri
 *l'apprenant


 *patch:

 *** tqual--unchanged.c2012-09-20 03:17:58.0 +0530
 --- tqual.c2012-11-14 23:27:30.470499857 +0530
 ***
 *** 75,80 
 --- 75,88 

   /* local functions */
   static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);

 + /*
 +  * Single-item cache for results of clog visibility check.  It's worth
 having
 +  * such a cache to help reduce the amount of hint bit traffic when
 +  * many sequentially touched tuples have the same XID.
 +  */
 + static TransactionId cachedVisibilityXid;
 + /* Visibility status cache stores the commit status of the XID in
 cachedVisibilityXid */
 + static uint16 cachedVisibilityXidStatus;

   /*
* SetHintBits()
 ***
 *** 117,122 
 --- 125,133 


   if (XLogNeedsFlush(commitLSN)  BufferIsPermanent(buffer))
   return;/* not flushed yet, so don't set hint
 */
 +
 + cachedVisibilityXid = xid;
 + cachedVisibilityXidStatus = infomask;
   }

   tuple-t_infomask |= infomask;
 ***
 *** 164,170 
   bool
   HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer
 buffer)

   {
 ! if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED))
   {
   if (tuple-t_infomask  HEAP_XMIN_INVALID)
   return false;
 --- 175,183 
   bool
   HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer
 buffer)

   {
 ! if (!(tuple-t_infomask  HEAP_XMIN_COMMITTED)
 !  !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple)
 !  cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED))
   {
   if (tuple-t_infomask  HEAP_XMIN_INVALID)
   return false;
 ***
 *** 247,253 
   if (tuple-t_infomask  HEAP_XMAX_INVALID)/* xid invalid or
 aborted */
   return true;

 ! if (tuple-t_infomask  HEAP_XMAX_COMMITTED)
   {
 

Re: [HACKERS] Doc patch, further describe and-mask nature of the permission system

2012-11-14 Thread Karl O. Pinc
On 11/13/2012 08:50:55 PM, Peter Eisentraut wrote:
 On Sat, 2012-09-29 at 01:16 -0500, Karl O. Pinc wrote:
  This patch makes some sweeping statements.
 
 Unfortunately, they are wrong.

I will see if anything can be salvaged.

Regards,

Karl k...@meme.com
Free Software:  You don't pay back, you pay forward.
 -- Robert A. Heinlein



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().

2012-11-14 Thread Xi Wang
Return 0 for INT_MIN % -1 (64-bit) instead of throwing an exception.
This patch complements commit f9ac414c that fixed int4mod().
---
 src/backend/utils/adt/int8.c |4 
 src/include/c.h  |7 +++
 2 files changed, 11 insertions(+)

diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 0e59956..9da651b 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -649,6 +649,10 @@ int8mod(PG_FUNCTION_ARGS)
PG_RETURN_NULL();
}
 
+   /* SELECT ((-9223372036854775808)::int8) % (-1); causes a floating 
point exception */
+   if (arg1 == INT64_MIN  arg2 == -1)
+   PG_RETURN_INT64(0);
+
/* No overflow is possible */
 
PG_RETURN_INT64(arg1 % arg2);
diff --git a/src/include/c.h b/src/include/c.h
index a6c0e6e..d20ba8c 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -294,6 +294,13 @@ typedef unsigned long long int uint64;
 #define UINT64CONST(x) ((uint64) x)
 #endif
 
+#ifndef INT64_MAX
+#define INT64_MAX INT64CONST(9223372036854775807)
+#endif
+
+#ifndef INT64_MIN
+#define INT64_MIN (-INT64_MAX-1)
+#endif
 
 /* Select timestamp representation (float8 or int64) */
 #ifdef USE_INTEGER_DATETIMES
-- 
1.7.10.4



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Doc patch making firm recommendation for setting the value of commit_delay

2012-11-14 Thread Peter Geoghegan
Some of you will be aware that I've tried to formulate a good general
recommendation for setting the value of commit_delay, in light of the
improvements made in 9.3. I previously asked for help finding a
methodology for optimising commit_delay [1]. The documentation related
to commit_delay still says that we don't know what might work well,
though I don't think that's true any more.

I found the time to do some benchmarks of my own - Greg Smith made
available a server that he frequently uses for benchmarks. It was
previously my observation that half of raw single-page sync time as
reported by pg_test_fsync for you wal_sync_method worked best for
commit_delay. I went so far as to modify pg_test_fsync to output
average time per op in microseconds for each operation with
commit_delay in mind, which is a patch that has already been committed
[2]. This general approach worked really well on my laptop, which has
a slowish fsync of about 8 milliseconds (8,000 microseconds), for
which a commit_delay setting of 4,000 (microseconds) seemed to clearly
work best, on both insert and tpc-b benchmarks [3].

This server has an Intel 320 SSD series. Greg has already written
quite a bit about this drive [4] for unrelated reasons. For the SSD,
results of an insert-based pgbench-tools run are shown, with
commit_delay at 0, 20 and 59 (determined by following my general
advise above):

http://commit-delay-results-ssd-insert.staticloud.com

I also looked at a 3-disk RAID0 of 7200RPM drives connected through a
512MB battery-backed write cache (Areca controller), again using
insert.sql:

http://commit-delay-stripe-insert.staticloud.com

In addition to a tpc-b benchmark with the same data directory on the
same 3-disk stripe:

http://commit-delay-results-stripe-tpcb.staticloud.com

I used the same postgresql.conf in all cases, which you can see for
each report, and did the usual median-of-three thing in all cases
(though each run lasted 120 seconds in all cases, not the default 60
seconds). Settings used for one particular pgbench run can be viewed
here (though they're all exactly the same anyway):

http://commit-delay-results-ssd-insert.staticloud.com/19/pg_settings.txt

Attached is a doc-patch that makes recommendations that are consistent
with my observations about what works best here. I'd like to see us
making *some* recommendation - for sympathetic cases, setting
commit_delay appropriately can make a very large difference to
transaction throughput. Such sympathetic cases - many small write
transactions - are something that tends to be seen relatively
frequently with web applications, that disproportionately use cloud
hosting. It isn't at all uncommon for these cases to be highly bound
by their commit rate, and so it is compelling to try to amortize the
cost of a flush as effectively as possible there. It would be
unfortunate if no one was even aware that commit_delay is now useful
for these cases, since the setting allows cloud hosting providers to
help these cases quite a bit, without having to do something like
compromise durability, which in general isn't acceptable.

The point of all these benchmarks isn't to show how effective
commit_delay now is, or can be - we already had that discussion months
ago, before the alteration to its behaviour was committed. The point
is to put my proposed doc changes in the appropriate context, so that
I can build confidence that they're balanced and helpful, by showing
cases that are not so sympathetic.

Thoughts?

[1] http://archives.postgresql.org/pgsql-performance/2012-08/msg3.php

[2] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=82e429794b348cd80c1d1b011e21ffac98bc6e88

[3] 
http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html

[4] http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


commit_delay_doc.2012_11_14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().

2012-11-14 Thread Tom Lane
Xi Wang xi.w...@gmail.com writes:
 Return 0 for INT_MIN % -1 (64-bit) instead of throwing an exception.
 This patch complements commit f9ac414c that fixed int4mod().

Meh.  I didn't care for the explicit dependency on INT_MIN in the
previous patch, and I like introducing INT64_MIN even less.  I think
we should be able to reduce the test to just

if (arg2 == -1)
return 0;

since zero is the correct result for any value of arg1, not only
INT_MIN.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().

2012-11-14 Thread Xi Wang
On 11/14/12 4:46 PM, Tom Lane wrote:
 Meh.  I didn't care for the explicit dependency on INT_MIN in the
 previous patch, and I like introducing INT64_MIN even less.  I think
 we should be able to reduce the test to just
 
   if (arg2 == -1)
   return 0;
 
 since zero is the correct result for any value of arg1, not only
 INT_MIN.

I agree.  Will send a v2.  Thanks. :)

- xi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().

2012-11-14 Thread Tom Lane
Xi Wang xi.w...@gmail.com writes:
 I agree.  Will send a v2.  Thanks. :)

No need, I'm already patching it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH 1/2] Fix INT_MIN % -1 overflow in int8mod().

2012-11-14 Thread Xi Wang
Return 0 for x % -1 instead of throwing an exception (e.g., when x
is INT_MIN).

Suggested by Tom Lane.
---
 src/backend/utils/adt/int8.c |4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 0e59956..a30ab36 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -649,6 +649,10 @@ int8mod(PG_FUNCTION_ARGS)
PG_RETURN_NULL();
}
 
+   /* SELECT ((-9223372036854775808)::int8) % (-1); causes a floating 
point exception */
+   if (arg2 == -1)
+   PG_RETURN_INT64(0);
+
/* No overflow is possible */
 
PG_RETURN_INT64(arg1 % arg2);
-- 
1.7.10.4



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH 2/2] Clean up INT_MIN % -1 overflow in int4mod().

2012-11-14 Thread Xi Wang
Since x % -1 returns 0 for any x, we don't need the check x == INT_MIN.

Suggested by Tom Lane.
---
 src/backend/utils/adt/int.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index 4be3901..3e423fe 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1096,7 +1096,7 @@ int4mod(PG_FUNCTION_ARGS)
}
 
/* SELECT ((-2147483648)::int4) % (-1); causes a floating point 
exception */
-   if (arg1 == INT_MIN  arg2 == -1)
+   if (arg2 == -1)
PG_RETURN_INT32(0);
 
/* No overflow is possible */
-- 
1.7.10.4



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().

2012-11-14 Thread Xi Wang
On 11/14/12 5:03 PM, Tom Lane wrote:
 No need, I'm already patching it.

Oops.  Sorry.  Ignore my patches.

- xi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-14 Thread Merlin Moncure
On Tue, Oct 9, 2012 at 1:51 AM, Amit Kapila amit.kap...@huawei.com wrote:
 On Monday, October 01, 2012 11:11 PM Jeff Davis wrote:
 On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote:
  You are missing large parts of the previous thread, giving various
  opinions on what the UI should look like for enabling checksums.

 I read through all of the discussion that I could find. There was quite
 a lot, so perhaps I have forgotten pieces of it.

 But that one section in the docs does look out of date and/or confusing
 to me.

 I remember there was discussion about a way to ensure that checksums are
 set cluster-wide with some kind of special command (perhaps related to
 VACUUM) and a magic file to let recovery know whether checksums are set
 everywhere or not. That doesn't necessarily conflict with the GUC though
 (the GUC could be a way to write checksums lazily, while this new
 command could be a way to write checksums eagerly).

 If some consensus was reached on the exact user interface, can you
 please send me a link?

 AFAICT complete consensus has not been reached but one of the discussions can 
 be found on below link:
 http://archives.postgresql.org/pgsql-hackers/2012-03/msg00279.php
 Here Robert has given suggestions and then further there is more discussion 
 based on that points.

 According to me, the main points where more work for this patch is required 
 as per previous discussions is as follows:

 1. Performance impact of WAL log for hint-bits needs to be verified for 
 scenario's other than pg_bench (Such as bulk data load (which I
feel there is some way to optimize, but I don't know if that’s part of 
 this patch)).

Atri has a patch posted which (if it passes muster) would eliminate
the i/o impact of WAL logging hint bits following a bulk load or any
scenario where many pages worth of tuples were sequentially written
out with the same XID.  Atri's patch was written with the checksum
patch in mind.

http://archives.postgresql.org/message-id/CAOeZVid7zd9V-9WsEtf9wRCd0H4yw_1OgCLm7%3DCdvGPCxZcJJw%40mail.gmail.com

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-14 Thread Bruce Momjian

Patch applied to git head.  Thanks Ants Aasma for the analysis that lead
to the patch.

---

On Tue, Nov 13, 2012 at 07:03:51PM -0500, Bruce Momjian wrote:
 On Tue, Nov 13, 2012 at 05:44:54AM +0200, Ants Aasma wrote:
  On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian br...@momjian.us wrote:
   You can see a significant speedup with those loops removed.  The 16k
   case is improved, but still not linear.  The 16k dump/restore scale
   looks fine, so it must be something in pg_upgrade, or in the kernel.
  
  I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
  for 8k and 64k tables are attached. pg_upgrade itself is now taking
  negligible time.
 
 I generated these timings from the attached test script.
 
-- 9.3 
 normal   -- binary_upgrade --  -- pg_upgrade -
   - dmp -   - res -   - dmp -   - res -git   patch
   1 0.12  0.07  0.13  0.07 11.06 11.02
1000 2.20  2.46  3.57  2.82 19.15 18.61
2000 4.51  5.01  8.22  5.80 29.12 26.89
4000 8.97 10.88 14.76 12.43 45.87 43.08
800015.30 24.72 30.57 27.10100.31 79.75
   1600036.14 54.88 62.27 61.69248.03167.94
   3200055.29162.20115.16179.15695.05376.84
   64000   149.86716.46265.77724.32   2323.73   1122.38
 
 You can see the speedup of the patch, particularly for a greater number
 of tables, e.g. 2x faster for 64k tables.
 
  The 64k profile shows the AtEOXact_RelationCache scaling problem. For
  the 8k profile nothing really pops out as a clear bottleneck. CPU time
  distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
  to pg_upgrade.
 
 At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
 out the dump/restore times.
 
 I am attaching an updated pg_upgrade patch, which I believe is ready for
 application for 9.3.
 
  Postgres time itself breaks down with 10% for shutdown checkpoint and
  90% for regular running, consisting of 16% parsing, 13% analyze, 20%
  plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.
 
 That SVG graph was quite impressive.
 
  It looks to me that most benefit could be had from introducing more
  parallelism. Are there any large roadblocks to pipelining the dump and
  restore to have them happen in parallel?
 
 I talked to Andrew Dustan about parallelization in pg_restore.  First,
 we currently use pg_dumpall, which isn't in the custom format required
 for parallel restore, but if we changed to custom format, create table
 isn't done in parallel, only create index/check constraints, and trigger
 creation, etc.  Not sure if it worth perusing this just for pg_upgrade.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] My first patch! (to \df output)

2012-11-14 Thread Robert Haas
On Fri, Nov 9, 2012 at 3:22 PM, Jon Erdman
postgre...@thewickedtribe.net wrote:
 Oops! Here it is in the proper diff format. I didn't have my env set up 
 correctly :(

Better add it here so it doesn't get lost:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add contrib module functions to docs' function index

2012-11-14 Thread Robert Haas
On Tue, Nov 13, 2012 at 7:10 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 I'm talking about making sure that contrib module functions (and settings)
 appear in the documentation index (
 http://www.postgresql.org/docs/current/static/bookindex.html) so it's easy
 to find a function by name whether it's in core or contrib. This is what I
 want to add to TODO.

+1.

 Separately, it might also be nice to add the contrib functions to the
 section 9 tables with an extra column showing their origin, but that's less
 clearly a good thing. Even if there's a column saying intarray for
 intarray functions in the array functions list, people will still try to use
 them without loading the extension and get confused when they're not found.
 It'll also bloat the listings of core functions. Rather than do that, I'd
 probably prefer to add a note to relevant sections. For example, in array
 functions I'd want to add Additional functions that operate only on arrays
 of integers are available in the a href=...intarray extension/a.

The second approach seems better, and maybe only in cases where it's
particularly relevant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Jeff Davis
 Hmm... what if we took this a step further and actually stored the
 checksums in a separate relation fork?  That would make it pretty
 simple to support enabling/disabling checksums for particular
 relations.  It would also allow us to have a wider checksum, like 32
 or 64 bits rather than 16.  I'm not scoffing at a 16-bit checksum,
 because even that's enough to catch a very high percentage of errors,
 but it wouldn't be terrible to be able to support a wider one, either.

I don't remember exactly why this idea was sidelined before, but I don't
think there were any showstoppers. It does have some desirable
properties; most notably the ability to add checksums without a huge
effort, so perhaps the idea can be revived.

But there are some practical issues, as Tom points out. Another one is
that it's harder for external utilities (like pg_basebackup) to verify
checksums.

And I just had another thought: these pages of checksums would be data
pages, with an LSN. But as you clean ordinary data pages, you need to
constantly bump the LSN of the very same checksum page (because it
represents 1000 ordinary data pages); making it harder to actually clean
the checksum page and finish a checkpoint. Is this a practical concern
or am I borrowing trouble?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade speed for many tables

2012-11-14 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 10:29:39AM -0800, Jeff Janes wrote:
 On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian br...@momjian.us wrote:
  Magnus reported that a customer with a million tables was finding
  pg_upgrade slow.  I had never considered many table to be a problem, but
  decided to test it.  I created a database with 2k tables like this:
 
  CREATE TABLE test1990 (x SERIAL);
 
  Running the git version of pg_upgrade on that took 203 seconds.  Using
  synchronous_commit=off dropped the time to 78 seconds.  This was tested
  on magnetic disks with a write-through cache.  (No change on an SSD with
  a super-capacitor.)
 
  I don't see anything unsafe about having pg_upgrade use
  synchronous_commit=off.  I could set it just for the pg_dump reload, but
  it seems safe to just use it always.  We don't write to the old cluster,
  and if pg_upgrade fails, you have to re-initdb the new cluster anyway.
 
  Patch attached.  I think it should be applied to 9.2 as well.
 
 Is turning off synchronous_commit enough?  What about turning off fsync?

I did some testing with the attached patch on a magnetic disk with no
BBU that turns off fsync;   I got these results:

 sync_com=off  fsync=off
115.90 13.51
 100026.09 24.56
 200033.41 31.20
 400057.39 57.74
 8000   102.84116.28
16000   189.43207.84

It shows fsync faster for  4k, and slower for  4k.  Not sure why this
is the cause but perhaps the buffering of the fsync is actually faster
than doing a no-op fsync.

I don't think fsync=off makes sense, except for testing;  let me know if
I should test something else.

 When I'm doing a pg_upgrade with thousands of tables, the shutdown
 checkpoint after restoring the dump to the new cluster takes a very
 long time, as the writer drains its operation table by opening and
 individually fsync-ing thousands of files.  This takes about 40 ms per
 file, which I assume is a combination of slow lap-top disk drive, and
 a strange deal with ext4 which makes fsyncing a recently created file
 very slow.   But even with faster hdd, this would still be a problem
 if it works the same way, with every file needing 4 rotations to be
 fsynced and this happens in serial.

Is this with the current code that does synchronous_commit=off?  If not,
can you test to see if this is still a problem?

 Worse, the shutdown only waits for the default of 60 seconds for the
 shutdown to take place before it throws an error and the entire
 pg_upgrade gives up.  It seems to me that either the -t setting should
 be increased, or should be an option to pg_upgrade.
 
 My work around was to invoke a system-wide sync a couple seconds
 after the 'pg_ctl stop' is initiated.  Flushing the files wholesale
 seems to work to make the checkpoint writer rapidly find it has
 nothing to do when it tries to flush them retail.
 
 Anyway, the reason I think turning fsync off might be reasonable is
 that as soon as the new cluster is shut down, pg_upgrade starts
 overwriting most of those just-fsynced file with other files from the
 old cluster, and AFAICT makes no effort to fsync them.  So until there
 is a system-wide sync after the pg_upgrade finishes, your new cluster
 is already in mortal danger anyway.

pg_upgrade does a cluster shutdown before overwriting those files.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/server.c b/contrib/pg_upgrade/server.c
new file mode 100644
index 49d4c8f..01e0dd3
*** a/contrib/pg_upgrade/server.c
--- b/contrib/pg_upgrade/server.c
*** start_postmaster(ClusterInfo *cluster)
*** 219,225 
  			 (cluster-controldata.cat_ver =
  			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ?  -b :
  			  -c autovacuum=off -c autovacuum_freeze_max_age=20,
! 			 (cluster == new_cluster) ?  -c synchronous_commit=off : ,
  			 cluster-pgopts ? cluster-pgopts : , socket_string);
  
  	/*
--- 219,225 
  			 (cluster-controldata.cat_ver =
  			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ?  -b :
  			  -c autovacuum=off -c autovacuum_freeze_max_age=20,
! 			 (cluster == new_cluster) ?  -c fsync=off : ,
  			 cluster-pgopts ? cluster-pgopts : , socket_string);
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical changeset generation v3

2012-11-14 Thread Andres Freund
Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Start postgres:

Start postgres instance (with pg_hba.conf allowing replication cons):

$ postgres -D ~/tmp/pgdev-lcr \
  -c wal_level=logical \
  -c max_wal_senders=10 \
  -c max_logical_slots=10 \
  -c wal_keep_segments=100 \
  -c log_line_prefix=[%p %x] 

Start the changelog receiver:
$ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v

Generate changes:
$ psql -h /tmp postgres EOF

DROP TABLE IF EXISTS replication_example;

CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text 
varchar(120));

-- plain insert
INSERT INTO replication_example(somedata, text) VALUES (1, 1);

-- plain update
UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT 
currval('replication_example_id_seq'));

-- plain delete
DELETE FROM replication_example WHERE id = (SELECT 
currval('replication_example_id_seq'));

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT 
currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT 
currval('replication_example_id_seq'));
COMMIT;

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT 
currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT 
currval('replication_example_id_seq'));
ROLLBACK;

-- add a column
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1);
COMMIT;

-- once more outside
INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1);


-- DDL with table rewrite
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (5, 1);
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (5, 2);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING 
(somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 3);
COMMIT;

EOF

And the results printed by llog:

BEGIN 16556826
COMMIT 16556826
BEGIN 16556827
table replication_example_id_seq: INSERT: 
sequence_name[name]:replication_example_id_seq last_value[int8]:1 
start_value[int8]:1 increment_by[int8]:1 max_value[int8]:9223372036854775807 
min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0 is_cycled[bool]:f 
is_called[bool]:f
COMMIT 16556827
BEGIN 16556828
table replication_example: INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
COMMIT 16556828
BEGIN 16556829
table replication_example: UPDATE: id[int4]:1 somedata[int4]:-1 
text[varchar]:1
COMMIT 16556829
BEGIN 16556830
table replication_example: DELETE (pkey): id[int4]:1
COMMIT 16556830
BEGIN 16556833
table replication_example: INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1
table replication_example: INSERT: id[int4]:5 somedata[int4]:3 
text[varchar]:1 bar[int4]:1
COMMIT 16556833
BEGIN 16556834
table replication_example: INSERT: id[int4]:6 somedata[int4]:4 
text[varchar]:1 bar[int4]:1
COMMIT 16556834
BEGIN 16556835
table replication_example: INSERT: id[int4]:7 somedata[int4]:5 
text[varchar]:1 bar[int4]:(null)
table replication_example: INSERT: id[int4]:8 somedata[int4]:5 
somenum[varchar]:2 bar[int4]:(null)
table pg_temp_74943: INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 
bar[int4]:(null)
table pg_temp_74943: INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 
bar[int4]:1
table pg_temp_74943: INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 
bar[int4]:1
table pg_temp_74943: INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 
bar[int4]:(null)
table pg_temp_74943: INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 
bar[int4]:(null)
table replication_example: INSERT: id[int4]:9 somedata[int4]:5 
somenum[int4]:3 bar[int4]:(null)
COMMIT 16556835

As you can see above we can decode WAL in the presence of nearly all
forms of DDL. The plugin that outputted these changes is supposed to be
added to contrib and is fairly small and uncomplicated.

An interesting piece of information might be that in the very
preliminary benchmarking I have done on this even the textual decoding
could keep up with a full tilt pgbench -c16 -j16 -M prepared on my
(somewhat larger) workstation. The wal space overhead was less than 1%
between two freshly initdb'ed clusters, comparing
wal_level=hot_standby with =logical.
With a custom pgbench script I can saturate the decoding to the effect
that it lags a second or so, but once I write out the data in a binary
format it can keep up again.
The biggest overhead is currently the more slowly increasing
Global/RecentXmin, but that can be greatly improved by logging

[HACKERS] [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

2012-11-14 Thread Andres Freund
---
 src/backend/utils/cache/relmapper.c | 53 +
 src/include/catalog/indexing.h  |  4 +--
 src/include/utils/relmapper.h   |  2 ++
 3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 6f21495..771f34d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -180,6 +180,59 @@ RelationMapOidToFilenode(Oid relationId, bool shared)
 	return InvalidOid;
 }
 
+/* RelationMapFilenodeToOid
+ *
+ * Do the reverse of the normal direction of mapping done in
+ * RelationMapOidToFilenode.
+ *
+ * This is not supposed to be used during normal running but rather for
+ * information purposes when looking at the filesystem or the xlog.
+ *
+ * Returns InvalidOid if the OID is not know which can easily happen if the
+ * filenode is not of a relation that is nailed or shared or if it simply
+ * doesn't exists anywhere.
+ */
+Oid
+RelationMapFilenodeToOid(Oid filenode, bool shared)
+{
+	const RelMapFile *map;
+	int32		i;
+
+	/* If there are active updates, believe those over the main maps */
+	if (shared)
+	{
+		map = active_shared_updates;
+		for (i = 0; i  map-num_mappings; i++)
+		{
+			if (filenode == map-mappings[i].mapfilenode)
+return map-mappings[i].mapoid;
+		}
+		map = shared_map;
+		for (i = 0; i  map-num_mappings; i++)
+		{
+			if (filenode == map-mappings[i].mapfilenode)
+return map-mappings[i].mapoid;
+		}
+	}
+	else
+	{
+		map = active_local_updates;
+		for (i = 0; i  map-num_mappings; i++)
+		{
+			if (filenode == map-mappings[i].mapfilenode)
+return map-mappings[i].mapoid;
+		}
+		map = local_map;
+		for (i = 0; i  map-num_mappings; i++)
+		{
+			if (filenode == map-mappings[i].mapfilenode)
+return map-mappings[i].mapoid;
+		}
+	}
+
+	return InvalidOid;
+}
+
 /*
  * RelationMapUpdateMap
  *
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index c3db3ff..81811f1 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,8 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o
 #define ClassOidIndexId  2662
 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops));
 #define ClassNameNspIndexId  2663
-DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3171, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
-#define ClassTblspcRelfilenodeIndexId  3171
+DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
+#define ClassTblspcRelfilenodeIndexId  3455
 
 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops));
 #define CollationNameEncNspIndexId 3164
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 111a05c..4e56508 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -36,6 +36,8 @@ typedef struct xl_relmap_update
 
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
+extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 	 bool immediate);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH 01/14] Add minimal binary heap implementation

2012-11-14 Thread Andres Freund

Will be replaces by the binaryheap.[ch] from Abhijit once its been reviewed.
---
 src/backend/lib/Makefile |   3 +-
 src/backend/lib/simpleheap.c | 255 +++
 src/include/lib/simpleheap.h |  91 +++
 3 files changed, 348 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/simpleheap.c
 create mode 100644 src/include/lib/simpleheap.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 98ce3d7..c2bc5ba 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = ilist.o stringinfo.o
+
+OBJS = ilist.o simpleheap.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/simpleheap.c b/src/backend/lib/simpleheap.c
new file mode 100644
index 000..825d0a8
--- /dev/null
+++ b/src/backend/lib/simpleheap.c
@@ -0,0 +1,255 @@
+/*-
+ *
+ * simpleheap.c
+ *	  A simple binary heap implementaion
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/simpleheap.c
+ *
+ *-
+ */
+
+#include postgres.h
+
+#include math.h
+
+#include lib/simpleheap.h
+
+static inline int
+simpleheap_left_off(size_t i)
+{
+	return 2 * i + 1;
+}
+
+static inline int
+simpleheap_right_off(size_t i)
+{
+	return 2 * i + 2;
+}
+
+static inline int
+simpleheap_parent_off(size_t i)
+{
+	return floor((i - 1) / 2);
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off);
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off);
+
+static inline void
+simpleheap_swap(simpleheap *heap, size_t a, size_t b)
+{
+	simpleheap_kv swap;
+	swap.value = heap-values[a].value;
+	swap.key = heap-values[a].key;
+
+	heap-values[a].value = heap-values[b].value;
+	heap-values[a].key = heap-values[b].key;
+
+	heap-values[b].key = swap.key;
+	heap-values[b].value = swap.value;
+}
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off)
+{
+	/* manually unrolled tail recursion */
+	while (true)
+	{
+		size_t left_off = simpleheap_left_off(node_off);
+		size_t right_off = simpleheap_right_off(node_off);
+		size_t swap_off = 0;
+
+		/* only one child can violate the heap property after a change */
+
+		/* check left child */
+		if (left_off  heap-size 
+		heap-compare(heap-values[left_off],
+		  heap-values[node_off])  0)
+		{
+			/* heap condition violated */
+			swap_off = left_off;
+		}
+
+		/* check right child */
+		if (right_off  heap-size 
+		heap-compare(heap-values[right_off],
+		  heap-values[node_off])  0)
+		{
+			/* heap condition violated */
+
+			/* swap with the smaller child */
+			if (!swap_off ||
+			heap-compare(heap-values[right_off],
+			  heap-values[left_off])  0)
+			{
+swap_off = right_off;
+			}
+		}
+
+		if (!swap_off)
+		{
+			/* heap condition fullfilled, abort */
+			break;
+		}
+
+		/* swap node with the child violating the property */
+		simpleheap_swap(heap, swap_off, node_off);
+
+		/* recurse, check child subtree */
+		node_off = swap_off;
+	}
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off)
+{
+	/* manually unrolled tail recursion */
+	while (true)
+	{
+		size_t parent_off = simpleheap_parent_off(node_off);
+
+		if (heap-compare(heap-values[parent_off],
+		  heap-values[node_off])  0)
+		{
+			/* heap property violated */
+			simpleheap_swap(heap, node_off, parent_off);
+
+			/* recurse */
+			node_off = parent_off;
+		}
+		else
+			break;
+	}
+}
+
+simpleheap*
+simpleheap_allocate(size_t allocate)
+{
+	simpleheap* heap = palloc(sizeof(simpleheap));
+	heap-values = palloc(sizeof(simpleheap_kv) * allocate);
+	heap-size = 0;
+	heap-space = allocate;
+	return heap;
+}
+
+void
+simpleheap_free(simpleheap* heap)
+{
+	pfree(heap-values);
+	pfree(heap);
+}
+
+/* initial building of a heap */
+void
+simpleheap_build(simpleheap *heap)
+{
+	int i;
+
+	for (i = simpleheap_parent_off(heap-size - 1); i = 0; i--)
+	{
+		simpleheap_sift_down(heap, i);
+	}
+}
+
+/*
+ * Change the
+ */
+void
+simpleheap_change_key(simpleheap *heap, void* key)
+{
+	size_t next_off = 0;
+	int ret;
+	simpleheap_kv* kv;
+
+	heap-values[0].key = key;
+
+	/* no need to do anything if there is only one element */
+	if (heap-size == 1)
+	{
+		return;
+	}
+	else if (heap-size == 2)
+	{
+		next_off = 1;
+	}
+	else
+	{
+		ret = heap-compare(
+			heap-values[simpleheap_left_off(0)],
+			heap-values[simpleheap_right_off(0)]);
+
+		if (ret == -1)
+			next_off = simpleheap_left_off(0);
+		else
+			next_off = simpleheap_right_off(0);
+	}
+
+	/*
+	 * compare with the next key. If were still smaller we can skip
+	 * 

[HACKERS] [PATCH 06/14] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

2012-11-14 Thread Andres Freund

This requires the previously added RELFILENODE syscache and the added
RelationMapFilenodeToOid function added in previous commits.
---
 doc/src/sgml/func.sgml | 23 +++-
 src/backend/utils/adt/dbsize.c | 79 ++
 src/include/catalog/pg_proc.h  |  2 ++
 src/include/utils/builtins.h   |  1 +
 4 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f8f63d8..708da35 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15170,7 +15170,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 
para
 The functions shown in xref linkend=functions-admin-dblocation assist
-in identifying the specific disk files associated with database objects.
+in identifying the specific disk files associated with database objects or doing the reverse.
/para
 
indexterm
@@ -15179,6 +15179,9 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
indexterm
 primarypg_relation_filepath/primary
/indexterm
+   indexterm
+primarypg_relation_by_filenode/primary
+   /indexterm
 
table id=functions-admin-dblocation
 titleDatabase Object Location Functions/title
@@ -15207,6 +15210,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 File path name of the specified relation
/entry
   /row
+  row
+   entry
+literalfunctionpg_relation_by_filenode(parametertablespace/parameter typeoid/type, parameterfilenode/parameter typeoid/type)/function/literal
+/entry
+   entrytyperegclass/type/entry
+   entry
+Find the associated relation of a filenode
+   /entry
+  /row
  /tbody
 /tgroup
/table
@@ -15230,6 +15242,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 the relation.
/para
 
+   para
+functionpg_relation_by_filenode/ is the reverse of
+functionpg_relation_filenode/. Given a quotetablespace/ OID and
+a quotefilenode/ it returns the associated relation. The default
+tablespace for user tables can be replaced with 0. Check the
+documentation of functionpg_relation_filenode/ for an explanation why
+this cannot always easily answered by querying structnamepg_class/.
+   /para
+
   /sect2
 
   sect2 id=functions-admin-genfile
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index cd23334..ec26291 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -741,6 +741,85 @@ pg_relation_filenode(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Get the relation via (reltablespace, relfilenode)
+ *
+ * This is expected to be used when somebody wants to match an individual file
+ * on the filesystem back to its table. Thats not trivially possible via
+ * pg_class because that doesn't contain the relfilenodes of shared and nailed
+ * tables.
+ *
+ * We don't fail but return NULL if we cannot find a mapping.
+ *
+ * Instead of knowing DEFAULTTABLESPACE_OID you can pass 0.
+ */
+Datum
+pg_relation_by_filenode(PG_FUNCTION_ARGS)
+{
+	Oid			reltablespace = PG_GETARG_OID(0);
+	Oid			relfilenode = PG_GETARG_OID(1);
+	Oid			lookup_tablespace = reltablespace;
+	Oid result = InvalidOid;
+	HeapTuple	tuple;
+
+	if (reltablespace == 0)
+		reltablespace = DEFAULTTABLESPACE_OID;
+
+	/* pg_class stores 0 instead of DEFAULTTABLESPACE_OID */
+	if (reltablespace == DEFAULTTABLESPACE_OID)
+		lookup_tablespace = 0;
+
+	tuple = SearchSysCache2(RELFILENODE,
+			lookup_tablespace,
+			relfilenode);
+
+	/* found it in the system catalog, not be a shared/nailed table */
+	if (HeapTupleIsValid(tuple))
+	{
+		result = HeapTupleHeaderGetOid(tuple-t_data);
+		ReleaseSysCache(tuple);
+	}
+	else
+	{
+		if (reltablespace == GLOBALTABLESPACE_OID)
+		{
+			result = RelationMapFilenodeToOid(relfilenode, true);
+		}
+		else
+		{
+			Form_pg_class relform;
+
+			result = RelationMapFilenodeToOid(relfilenode, false);
+
+			if (result != InvalidOid)
+			{
+/* check that we found the correct relation */
+tuple = SearchSysCache1(RELOID,
+	result);
+
+if (!HeapTupleIsValid(tuple))
+{
+	elog(ERROR, Couldn't refind previously looked up relation with oid %u,
+		 result);
+}
+
+relform = (Form_pg_class) GETSTRUCT(tuple);
+
+if (relform-reltablespace != reltablespace 
+	relform-reltablespace != lookup_tablespace)
+	result = InvalidOid;
+
+ReleaseSysCache(tuple);
+			}
+		}
+	}
+
+	if (!OidIsValid(result))
+		PG_RETURN_NULL();
+	else
+		PG_RETURN_OID(result);
+}
+
+/*
  * Get the pathname (relative to $PGDATA) of a relation
  *
  * See comments for pg_relation_filenode.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 16033c7..d28db63 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3436,6 +3436,8 @@ DATA(insert OID = 2998 ( pg_indexes_size		PGNSP PGUID 12 1 0 0 0 f f f f 

[HACKERS] [PATCH 04/14] Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)

2012-11-14 Thread Andres Freund

This cache is somewhat problematic because formally indexes used by syscaches
needs to be unique, this one is not. This is just because of 0/InvalidOids
stored in pg_class.relfilenode for nailed/shared catalog relations.  The
syscache will never be queried for InvalidOid relfilenodes however so it seems
to be safe even if it violates the rules somewhat.

It might be nicer to add infrastructure to do this properly, like using a
partial index, its not clear what the best way to do this is though.

Needs a CATVERSION bump.
---
 src/backend/utils/cache/syscache.c | 11 +++
 src/include/catalog/indexing.h |  2 ++
 src/include/catalog/pg_proc.h  |  1 +
 src/include/utils/syscache.h   |  1 +
 4 files changed, 15 insertions(+)

diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ca22efd..9d2f6b7 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -613,6 +613,17 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		1024
 	},
+	{RelationRelationId,		/* RELFILENODE */
+		ClassTblspcRelfilenodeIndexId,
+		2,
+		{
+			Anum_pg_class_reltablespace,
+			Anum_pg_class_relfilenode,
+			0,
+			0
+		},
+		1024
+	},
 	{RewriteRelationId,			/* RULERELNAME */
 		RewriteRelRulenameIndexId,
 		2,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 238fe58..c3db3ff 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,6 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o
 #define ClassOidIndexId  2662
 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops));
 #define ClassNameNspIndexId  2663
+DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3171, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
+#define ClassTblspcRelfilenodeIndexId  3171
 
 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops));
 #define CollationNameEncNspIndexId 3164
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f935eb1..16033c7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4673,6 +4673,7 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR(SP-GiST support for quad tree over range);
 
 
+
 /*
  * Symbolic values for provolatile column: these indicate whether the result
  * of a function is dependent *only* on the values of its explicit arguments,
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index d1a9855..9a39077 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,7 @@ enum SysCacheIdentifier
 	RANGETYPE,
 	RELNAMENSP,
 	RELOID,
+	RELFILENODE,
 	RULERELNAME,
 	STATRELATTINH,
 	TABLESPACEOID,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

2012-11-14 Thread Andres Freund

To avoid complicating logic we store both, the toplevel and the subxids, in
-xip, first -xcnt toplevel ones, and then -subxcnt subxids.
Also skip logging any subxids if the snapshot is suboverflowed, they aren't
useful in that case anyway.

This allows to make some operations cheaper and it allows faster startup for
the future logical decoding feature because that doesn't care about
subtransactions/suboverflow'edness.
---
 src/backend/access/transam/xlog.c   |  2 ++
 src/backend/storage/ipc/procarray.c | 65 -
 src/backend/storage/ipc/standby.c   |  8 +++--
 src/include/storage/standby.h   |  2 ++
 4 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1faf666..1749f46 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5629,6 +5629,7 @@ StartupXLOG(void)
  * subxids are listed with their parent prepared transactions.
  */
 running.xcnt = nxids;
+running.subxcnt = 0;
 running.subxid_overflow = false;
 running.nextXid = checkPoint.nextXid;
 running.oldestRunningXid = oldestActiveXID;
@@ -7813,6 +7814,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
+			running.subxcnt = 0;
 			running.subxid_overflow = false;
 			running.nextXid = checkPoint.nextXid;
 			running.oldestRunningXid = oldestActiveXID;
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8c0d7b0..a98358d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -501,6 +501,13 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Remove stale transactions, if any.
 	 */
 	ExpireOldKnownAssignedTransactionIds(running-oldestRunningXid);
+
+	/*
+	 * Remove stale locks, if any.
+	 *
+	 * Locks are always assigned to the toplevel xid so we don't need to care
+	 * about subxcnt/subxids (and by extension not about -suboverflowed).
+	 */
 	StandbyReleaseOldLocks(running-xcnt, running-xids);
 
 	/*
@@ -581,13 +588,13 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Allocate a temporary array to avoid modifying the array passed as
 	 * argument.
 	 */
-	xids = palloc(sizeof(TransactionId) * running-xcnt);
+	xids = palloc(sizeof(TransactionId) * (running-xcnt + running-subxcnt));
 
 	/*
 	 * Add to the temp array any xids which have not already completed.
 	 */
 	nxids = 0;
-	for (i = 0; i  running-xcnt; i++)
+	for (i = 0; i  running-xcnt + running-subxcnt; i++)
 	{
 		TransactionId xid = running-xids[i];
 
@@ -1627,15 +1634,13 @@ GetRunningTransactionData(void)
 	oldestRunningXid = ShmemVariableCache-nextXid;
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray collecting all xids
 	 */
 	for (index = 0; index  arrayP-numProcs; index++)
 	{
 		int			pgprocno = arrayP-pgprocnos[index];
-		volatile PGPROC *proc = allProcs[pgprocno];
 		volatile PGXACT *pgxact = allPgXact[pgprocno];
 		TransactionId xid;
-		int			nxids;
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = pgxact-xid;
@@ -1652,30 +1657,46 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		/*
-		 * Save subtransaction XIDs. Other backends can't add or remove
-		 * entries while we're holding XidGenLock.
-		 */
-		nxids = pgxact-nxids;
-		if (nxids  0)
-		{
-			memcpy(xids[count], (void *) proc-subxids.xids,
-   nxids * sizeof(TransactionId));
-			count += nxids;
-			subcount += nxids;
+		if (pgxact-overflowed)
+			suboverflowed = true;
+	}
 
-			if (pgxact-overflowed)
-suboverflowed = true;
+	/*
+	 * Spin over procArray collecting all subxids, but only if there hasn't
+	 * been a suboverflow.
+	 */
+	if (!suboverflowed)
+	{
+		for (index = 0; index  arrayP-numProcs; index++)
+		{
+			int			pgprocno = arrayP-pgprocnos[index];
+			volatile PGPROC *proc = allProcs[pgprocno];
+			volatile PGXACT *pgxact = allPgXact[pgprocno];
+			int			nxids;
 
 			/*
-			 * Top-level XID of a transaction is always less than any of its
-			 * subxids, so we don't need to check if any of the subxids are
-			 * smaller than oldestRunningXid
+			 * Save subtransaction XIDs. Other backends can't add or remove
+			 * entries while we're holding XidGenLock.
 			 */
+			nxids = pgxact-nxids;
+			if (nxids  0)
+			{
+memcpy(xids[count], (void *) proc-subxids.xids,
+	   nxids * sizeof(TransactionId));
+count += nxids;
+subcount += nxids;
+
+/*
+ * Top-level XID of a transaction is always less than any of
+ * its subxids, so we don't need to check if any of the subxids
+ * are smaller than oldestRunningXid
+ */
+			}
 		}
 	}
 
-	CurrentRunningXacts-xcnt = count;
+	CurrentRunningXacts-xcnt = count - subcount;
+	CurrentRunningXacts-subxcnt = subcount;
 	CurrentRunningXacts-subxid_overflow = 

[HACKERS] [PATCH 10/14] Allow walsender's to connect to a specific database

2012-11-14 Thread Andres Freund

Currently the decision whether to connect to a database or not is made by
checking whether the passed dbname parameter is replication. Unfortunately
this makes it impossible to connect a to a database named replication...

This is useful for future walsender commands which need database interaction.
---
 src/backend/postmaster/postmaster.c|  7 --
 .../libpqwalreceiver/libpqwalreceiver.c|  4 ++--
 src/backend/replication/walsender.c| 27 ++
 src/backend/utils/init/postinit.c  |  5 
 src/bin/pg_basebackup/pg_basebackup.c  |  4 ++--
 src/bin/pg_basebackup/pg_receivexlog.c |  4 ++--
 src/bin/pg_basebackup/receivelog.c |  4 ++--
 7 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b223fee..05048bc 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1806,10 +1806,13 @@ retry1:
 	if (strlen(port-user_name) = NAMEDATALEN)
 		port-user_name[NAMEDATALEN - 1] = '\0';
 
-	/* Walsender is not related to a particular database */
-	if (am_walsender)
+	/* Generic Walsender is not related to a particular database */
+	if (am_walsender  strcmp(port-database_name, replication) == 0)
 		port-database_name[0] = '\0';
 
+	if (am_walsender)
+		elog(WARNING, connecting to %s, port-database_name);
+
 	/*
 	 * Done putting stuff in TopMemoryContext.
 	 */
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index bfaebea..c39062b 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -114,7 +114,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
 		the primary server: %s,
 		PQerrorMessage(streamConn;
 	}
-	if (PQnfields(res) != 3 || PQntuples(res) != 1)
+	if (PQnfields(res) != 4 || PQntuples(res) != 1)
 	{
 		int			ntuples = PQntuples(res);
 		int			nfields = PQnfields(res);
@@ -122,7 +122,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
 		PQclear(res);
 		ereport(ERROR,
 (errmsg(invalid response from primary server),
- errdetail(Expected 1 tuple with 3 fields, got %d tuples with %d fields.,
+ errdetail(Expected 1 tuple with 4 fields, got %d tuples with %d fields.,
 		   ntuples, nfields)));
 	}
 	primary_sysid = PQgetvalue(res, 0, 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8774d7e..6452c34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -40,6 +40,7 @@
 #include access/transam.h
 #include access/xlog_internal.h
 #include catalog/pg_type.h
+#include commands/dbcommands.h
 #include funcapi.h
 #include libpq/libpq.h
 #include libpq/pqformat.h
@@ -202,10 +203,12 @@ IdentifySystem(void)
 	char		tli[11];
 	char		xpos[MAXFNAMELEN];
 	XLogRecPtr	logptr;
+	char*dbname = NULL;
 
 	/*
-	 * Reply with a result set with one row, three columns. First col is
-	 * system ID, second is timeline ID, and third is current xlog location.
+	 * Reply with a result set with one row, four columns. First col is system
+	 * ID, second is timeline ID, third is current xlog location and the fourth
+	 * contains the database name if we are connected to one.
 	 */
 
 	snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
@@ -216,9 +219,14 @@ IdentifySystem(void)
 
 	snprintf(xpos, sizeof(xpos), %X/%X, (uint32) (logptr  32), (uint32) logptr);
 
+	if (MyDatabaseId != InvalidOid)
+		dbname = get_database_name(MyDatabaseId);
+	else
+		dbname = (none);
+
 	/* Send a RowDescription message */
 	pq_beginmessage(buf, 'T');
-	pq_sendint(buf, 3, 2);		/* 3 fields */
+	pq_sendint(buf, 4, 2);		/* 4 fields */
 
 	/* first field */
 	pq_sendstring(buf, systemid);	/* col name */
@@ -246,17 +254,28 @@ IdentifySystem(void)
 	pq_sendint(buf, -1, 2);
 	pq_sendint(buf, 0, 4);
 	pq_sendint(buf, 0, 2);
+
+	/* fourth field */
+	pq_sendstring(buf, dbname);
+	pq_sendint(buf, 0, 4);
+	pq_sendint(buf, 0, 2);
+	pq_sendint(buf, TEXTOID, 4);
+	pq_sendint(buf, -1, 2);
+	pq_sendint(buf, 0, 4);
+	pq_sendint(buf, 0, 2);
 	pq_endmessage(buf);
 
 	/* Send a DataRow message */
 	pq_beginmessage(buf, 'D');
-	pq_sendint(buf, 3, 2);		/* # of columns */
+	pq_sendint(buf, 4, 2);		/* # of columns */
 	pq_sendint(buf, strlen(sysid), 4); /* col1 len */
 	pq_sendbytes(buf, (char *) sysid, strlen(sysid));
 	pq_sendint(buf, strlen(tli), 4);	/* col2 len */
 	pq_sendbytes(buf, (char *) tli, strlen(tli));
 	pq_sendint(buf, strlen(xpos), 4);	/* col3 len */
 	pq_sendbytes(buf, (char *) xpos, strlen(xpos));
+	pq_sendint(buf, strlen(dbname), 4);	/* col4 len */
+	pq_sendbytes(buf, (char *) dbname, strlen(dbname));
 
 	pq_endmessage(buf);
 }
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 

[HACKERS] [PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader

2012-11-14 Thread Andres Freund

For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.
---
 contrib/pgrowlocks/pgrowlocks.c  |  2 +-
 src/backend/access/heap/heapam.c | 13 ++
 src/backend/access/heap/pruneheap.c  | 16 ++--
 src/backend/catalog/index.c  |  2 +-
 src/backend/commands/analyze.c   |  3 ++-
 src/backend/commands/cluster.c   |  2 +-
 src/backend/commands/vacuumlazy.c|  3 ++-
 src/backend/storage/lmgr/predicate.c |  2 +-
 src/backend/utils/time/tqual.c   | 50 +---
 src/include/utils/snapshot.h |  4 +--
 src/include/utils/tqual.h| 20 +++
 11 files changed, 83 insertions(+), 34 deletions(-)

diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index 20beed2..8f9db55 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -120,7 +120,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		/* must hold a buffer lock to call HeapTupleSatisfiesUpdate */
 		LockBuffer(scan-rs_cbuf, BUFFER_LOCK_SHARE);
 
-		if (HeapTupleSatisfiesUpdate(tuple-t_data,
+		if (HeapTupleSatisfiesUpdate(tuple,
 	 GetCurrentCommandId(false),
 	 scan-rs_cbuf) == HeapTupleBeingUpdated)
 		{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 64aecf2..d025ff7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -276,6 +276,7 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 			HeapTupleData loctup;
 			bool		valid;
 
+			loctup.t_tableOid = RelationGetRelid(scan-rs_rd);
 			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
 			loctup.t_len = ItemIdGetLength(lpp);
 			ItemPointerSet((loctup.t_self), page, lineoff);
@@ -1590,7 +1591,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 
 		heapTuple-t_data = (HeapTupleHeader) PageGetItem(dp, lp);
 		heapTuple-t_len = ItemIdGetLength(lp);
-		heapTuple-t_tableOid = relation-rd_id;
+		heapTuple-t_tableOid = RelationGetRelid(relation);
 		heapTuple-t_self = *tid;
 
 		/*
@@ -1638,7 +1639,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * transactions.
 		 */
 		if (all_dead  *all_dead 
-			!HeapTupleIsSurelyDead(heapTuple-t_data, RecentGlobalXmin))
+			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
 			*all_dead = false;
 
 		/*
@@ -2418,12 +2419,13 @@ heap_delete(Relation relation, ItemPointer tid,
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
+	tp.t_tableOid = RelationGetRelid(relation);
 	tp.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	tp.t_len = ItemIdGetLength(lp);
 	tp.t_self = *tid;
 
 l1:
-	result = HeapTupleSatisfiesUpdate(tp.t_data, cid, buffer);
+	result = HeapTupleSatisfiesUpdate(tp, cid, buffer);
 
 	if (result == HeapTupleInvisible)
 	{
@@ -2788,6 +2790,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(otid));
 	Assert(ItemIdIsNormal(lp));
 
+	oldtup.t_tableOid = RelationGetRelid(relation);
 	oldtup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	oldtup.t_len = ItemIdGetLength(lp);
 	oldtup.t_self = *otid;
@@ -2800,7 +2803,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	 */
 
 l2:
-	result = HeapTupleSatisfiesUpdate(oldtup.t_data, cid, buffer);
+	result = HeapTupleSatisfiesUpdate(oldtup, cid, buffer);
 
 	if (result == HeapTupleInvisible)
 	{
@@ -3502,7 +3505,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	tuple-t_tableOid = RelationGetRelid(relation);
 
 l3:
-	result = HeapTupleSatisfiesUpdate(tuple-t_data, cid, *buffer);
+	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
 
 	if (result == HeapTupleInvisible)
 	{
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 97a2868..edb3a09 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -340,6 +340,9 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 	OffsetNumber chainitems[MaxHeapTuplesPerPage];
 	int			nchain = 0,
 i;
+	HeapTupleData tup;
+
+	tup.t_tableOid = RelationGetRelid(relation);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -349,6 +352,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 	if (ItemIdIsNormal(rootlp))
 	{
 		htup = (HeapTupleHeader) PageGetItem(dp, rootlp);
+
+		tup.t_data = htup;
+		tup.t_len = ItemIdGetLength(rootlp);
+		ItemPointerSet((tup.t_self), BufferGetBlockNumber(buffer), rootoffnum);
+
 		if (HeapTupleHeaderIsHeapOnly(htup))
 		{
 			/*

[HACKERS] [PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'

2012-11-14 Thread Andres Freund
---
 contrib/Makefile  |   1 +
 contrib/test_decoding/Makefile|  16 +++
 contrib/test_decoding/test_decoding.c | 192 ++
 3 files changed, 209 insertions(+)
 create mode 100644 contrib/test_decoding/Makefile
 create mode 100644 contrib/test_decoding/test_decoding.c

diff --git a/contrib/Makefile b/contrib/Makefile
index d230451..8709992 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -48,6 +48,7 @@ SUBDIRS = \
 		tablefunc	\
 		tcn		\
 		test_parser	\
+		test_decoding	\
 		tsearch2	\
 		unaccent	\
 		vacuumlo
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
new file mode 100644
index 000..2ac9653
--- /dev/null
+++ b/contrib/test_decoding/Makefile
@@ -0,0 +1,16 @@
+# contrib/test_decoding/Makefile
+
+MODULE_big = test_decoding
+OBJS = test_decoding.o
+
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_decoding
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
new file mode 100644
index 000..f3d90e3
--- /dev/null
+++ b/contrib/test_decoding/test_decoding.c
@@ -0,0 +1,192 @@
+/*-
+ *
+ * test_deocding.c
+ *		  example output plugin for the logical replication functionality
+ *
+ * Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/test_decoding/test_decoding.c
+ *
+ *-
+ */
+#include postgres.h
+
+#include catalog/pg_class.h
+#include catalog/pg_type.h
+#include catalog/index.h
+
+#include replication/output_plugin.h
+#include replication/snapbuild.h
+
+#include utils/lsyscache.h
+#include utils/memutils.h
+#include utils/rel.h
+#include utils/relcache.h
+#include utils/syscache.h
+#include utils/typcache.h
+
+
+PG_MODULE_MAGIC;
+
+void _PG_init(void);
+
+void WalSndWriteData(XLogRecPtr lsn, const char *data, Size len);
+
+extern void pg_decode_init(void **private);
+
+extern bool pg_decode_begin_txn(void *private, StringInfo out, ReorderBufferTXN* txn);
+extern bool pg_decode_commit_txn(void *private, StringInfo out, ReorderBufferTXN* txn, XLogRecPtr commit_lsn);
+extern bool pg_decode_change(void *private, StringInfo out, ReorderBufferTXN* txn, Oid tableoid, ReorderBufferChange *change);
+
+
+void
+_PG_init(void)
+{
+}
+
+void
+pg_decode_init(void **private)
+{
+	AssertVariableIsOfType(pg_decode_init, LogicalDecodeInitCB);
+	*private = AllocSetContextCreate(TopMemoryContext,
+	 text conversion context,
+	 ALLOCSET_DEFAULT_MINSIZE,
+	 ALLOCSET_DEFAULT_INITSIZE,
+	 ALLOCSET_DEFAULT_MAXSIZE);
+}
+
+bool
+pg_decode_begin_txn(void *private, StringInfo out, ReorderBufferTXN* txn)
+{
+	AssertVariableIsOfType(pg_decode_begin_txn, LogicalDecodeBeginCB);
+
+	appendStringInfo(out, BEGIN %d, txn-xid);
+	return true;
+}
+
+bool
+pg_decode_commit_txn(void *private, StringInfo out, ReorderBufferTXN* txn, XLogRecPtr commit_lsn)
+{
+	AssertVariableIsOfType(pg_decode_commit_txn, LogicalDecodeCommitCB);
+
+	appendStringInfo(out, COMMIT %d, txn-xid);
+	return true;
+}
+
+static void
+tuple_to_stringinfo(StringInfo s, TupleDesc tupdesc, HeapTuple tuple)
+{
+	int			i;
+	HeapTuple	typeTuple;
+	Form_pg_type pt;
+
+	for (i = 0; i  tupdesc-natts; i++)
+	{
+		Oid			typid, typoutput;
+		bool		typisvarlena;
+		Datum		origval, val;
+		char*outputstr;
+		boolisnull;
+		if (tupdesc-attrs[i]-attisdropped)
+			continue;
+		if (tupdesc-attrs[i]-attnum  0)
+			continue;
+
+		typid = tupdesc-attrs[i]-atttypid;
+
+		typeTuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+		if (!HeapTupleIsValid(typeTuple))
+			elog(ERROR, cache lookup failed for type %u, typid);
+		pt = (Form_pg_type) GETSTRUCT(typeTuple);
+
+		appendStringInfoChar(s, ' ');
+		appendStringInfoString(s, NameStr(tupdesc-attrs[i]-attname));
+		appendStringInfoChar(s, '[');
+		appendStringInfoString(s, NameStr(pt-typname));
+		appendStringInfoChar(s, ']');
+
+		getTypeOutputInfo(typid,
+		  typoutput, typisvarlena);
+
+		ReleaseSysCache(typeTuple);
+
+		origval = fastgetattr(tuple, i + 1, tupdesc, isnull);
+
+		if (typisvarlena  !isnull)
+			val = PointerGetDatum(PG_DETOAST_DATUM(origval));
+		else
+			val = origval;
+
+		if (isnull)
+			outputstr = (null);
+		else
+			outputstr = OidOutputFunctionCall(typoutput, val);
+
+		appendStringInfoChar(s, ':');
+		appendStringInfoString(s, outputstr);
+	}
+}
+
+/* This is is just for demonstration, don't ever use this code for anything real! */
+bool
+pg_decode_change(void *private, StringInfo out, ReorderBufferTXN* txn,
+ Oid tableoid, ReorderBufferChange *change)
+{
+	Relation relation = RelationIdGetRelation(tableoid);
+	Form_pg_class class_form = 

[HACKERS] [PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes

2012-11-14 Thread Andres Freund
---
 src/bin/pg_basebackup/Makefile |   7 +-
 src/bin/pg_basebackup/pg_receivellog.c | 717 +
 src/bin/pg_basebackup/streamutil.c |   3 +-
 src/bin/pg_basebackup/streamutil.h |   1 +
 4 files changed, 725 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_basebackup/pg_receivellog.c

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 5a2a46a..3775c44 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -20,7 +20,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 
 OBJS=receivelog.o streamutil.o $(WIN32RES)
 
-all: pg_basebackup pg_receivexlog
+all: pg_basebackup pg_receivexlog pg_receivellog
 
 pg_basebackup: pg_basebackup.o $(OBJS) | submake-libpq submake-libpgport
 	$(CC) $(CFLAGS) pg_basebackup.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
@@ -28,6 +28,9 @@ pg_basebackup: pg_basebackup.o $(OBJS) | submake-libpq submake-libpgport
 pg_receivexlog: pg_receivexlog.o $(OBJS) | submake-libpq submake-libpgport
 	$(CC) $(CFLAGS) pg_receivexlog.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
+pg_receivellog: pg_receivellog.o $(OBJS) | submake-libpq submake-libpgport
+	$(CC) $(CFLAGS) pg_receivellog.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
 install: all installdirs
 	$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
 	$(INSTALL_PROGRAM) pg_receivexlog$(X) '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
@@ -40,4 +43,4 @@ uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
 
 clean distclean maintainer-clean:
-	rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o
+	rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o
diff --git a/src/bin/pg_basebackup/pg_receivellog.c b/src/bin/pg_basebackup/pg_receivellog.c
new file mode 100644
index 000..1a95991
--- /dev/null
+++ b/src/bin/pg_basebackup/pg_receivellog.c
@@ -0,0 +1,717 @@
+/*-
+ *
+ * pg_receivellog.c - receive streaming logical log data and write it
+ *	  to a local file.
+ *
+ * Author: Magnus Hagander mag...@hagander.net
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/bin/pg_basebackup/pg_receivellog.c
+ *-
+ */
+
+/*
+ * We have to use postgres.h not postgres_fe.h here, because there's so much
+ * backend-only stuff in the XLOG include files we need.  But we need a
+ * frontend-ish environment otherwise.	Hence this ugly hack.
+ */
+#define FRONTEND 1
+#include postgres.h
+#include libpq-fe.h
+#include libpq/pqsignal.h
+#include access/xlog_internal.h
+#include utils/datetime.h
+#include utils/timestamp.h
+
+#include receivelog.h
+#include streamutil.h
+
+#include dirent.h
+#include sys/stat.h
+#include sys/time.h
+#include sys/types.h
+#include unistd.h
+
+#include getopt_long.h
+
+/* Time to sleep between reconnection attempts */
+#define RECONNECT_SLEEP_TIME 5
+
+/* Global options */
+char	   *outfile = NULL;
+int	outfd = -1;
+int			verbose = 0;
+int			noloop = 0;
+int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+volatile bool time_to_abort = false;
+
+
+static void usage(void);
+static void StreamLog();
+
+static void
+usage(void)
+{
+	printf(_(%s receives PostgreSQL streaming transaction logs.\n\n),
+		   progname);
+	printf(_(Usage:\n));
+	printf(_(  %s [OPTION]...\n), progname);
+	printf(_(\nOptions:\n));
+	printf(_(  -f, --file=FILEreceive log into this file\n));
+	printf(_(  -n, --no-loop  do not loop on connection lost\n));
+	printf(_(  -v, --verbose  output verbose messages\n));
+	printf(_(  -V, --version  output version information, then exit\n));
+	printf(_(  -?, --help show this help, then exit\n));
+	printf(_(\nConnection options:\n));
+	printf(_(  -h, --host=HOSTNAMEdatabase server host or socket directory\n));
+	printf(_(  -p, --port=PORTdatabase server port number\n));
+	printf(_(  -d, --database=DBNAME  database to connect to\n));
+	printf(_(  -s, --status-interval=INTERVAL\n
+			  time between status packets sent to server (in seconds)\n));
+	printf(_(  -U, --username=NAMEconnect as specified database user\n));
+	printf(_(  -w, --no-password  never prompt for password\n));
+	printf(_(  -W, --password force password prompt (should happen automatically)\n));
+	printf(_(\nReport bugs to pgsql-b...@postgresql.org.\n));
+}
+
+
+/*
+ * Local version of GetCurrentTimestamp(), since we are not linked with
+ * backend code. The protocol always uses integer timestamps, regardless of
+ * server setting.
+ */
+static int64
+localGetCurrentTimestamp(void)
+{
+	int64 result;
+	struct timeval tp;
+
+	gettimeofday(tp, 

[HACKERS] [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

2012-11-14 Thread Andres Freund

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- compressing the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
 src/backend/access/transam/Makefile |2 +-
 src/backend/access/transam/xlogreader.c | 1032 +++
 src/include/access/xlogreader.h |  264 
 3 files changed, 1297 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogreader.c
 create mode 100644 src/include/access/xlogreader.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 700cfd8..eb6cfc5 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xlogutils.o
+	xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 000..71e7d52
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,1032 @@
+/*-
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/readxlog.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * more extensive validation of read records
+ * * separation of reader/writer
+ * * customizable error response
+ * * usable without backend code around
+ *-
+ */
+
+#include postgres.h
+
+#include access/xlog_internal.h
+#include access/transam.h
+#include catalog/pg_control.h
+#include access/xlogreader.h
+
+/* If (very) verbose debugging is needed:
+ * #define VERBOSE_DEBUG
+ */
+
+XLogReaderState*
+XLogReaderAllocate(void)
+{
+	XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState));
+	int i;
+
+	if (!state)
+		goto oom;
+
+	memset(state-buf.record, 0, sizeof(XLogRecord));
+	state-buf.record_data_size = XLOG_BLCKSZ*8;
+	state-buf.record_data =
+			malloc(state-buf.record_data_size);
+
+	if (!state-buf.record_data)
+		goto oom;
+
+	memset(state-buf.record_data, 0, state-buf.record_data_size);
+	state-buf.origptr = InvalidXLogRecPtr;
+
+	for (i = 0; i  XLR_MAX_BKP_BLOCKS; i++)
+	{
+		state-buf.bkp_block_data[i] =
+			malloc(BLCKSZ);
+
+		if (!state-buf.bkp_block_data[i])
+			goto oom;
+	}
+
+	state-is_record_interesting = NULL;
+	state-writeout_data = NULL;
+	state-finished_record = NULL;
+	state-private_data = NULL;
+	state-output_buffer_size = 0;
+
+	XLogReaderReset(state);
+	return state;
+
+oom:
+	ereport(ERROR,
+	(errcode(ERRCODE_OUT_OF_MEMORY),
+	 errmsg(out of memory),
+	 errdetail(failed while allocating an XLogReader)));
+	return NULL;
+}
+
+void
+XLogReaderFree(XLogReaderState* state)
+{
+	int i;
+
+	for (i = 0; i  XLR_MAX_BKP_BLOCKS; i++)
+	{
+		free(state-buf.bkp_block_data[i]);
+	}
+
+	free(state-buf.record_data);
+
+	free(state);
+}
+
+void
+XLogReaderReset(XLogReaderState* state)
+{	state-in_record = false;
+	state-in_record_header = false;
+	state-do_reassemble_record = false;
+	state-in_bkp_blocks = 0;
+	state-in_bkp_block_header = false;
+	state-in_skip = false;
+	state-remaining_size = 0;
+	state-already_written_size = 0;
+	state-incomplete = false;
+	state-initialized = false;
+	state-needs_input = false;
+	state-needs_output = false;
+	state-stop_at_record_boundary = false;
+}
+
+static inline bool
+XLogReaderHasInput(XLogReaderState* state, Size size)
+{
+	XLogRecPtr tmp = state-curptr;
+	XLByteAdvance(tmp, size);
+	if (XLByteLE(state-endptr, tmp))
+		return false;
+	return true;
+}
+
+static inline bool
+XLogReaderHasOutput(XLogReaderState* state, Size size){
+	/* if we don't do output or have no limits in the output size */
+	if (state-writeout_data == NULL || state-output_buffer_size == 0)
+		return true;
+
+	if (state-already_written_size + size  state-output_buffer_size)
+		return false;
+
+	return true;
+}
+
+static inline bool
+XLogReaderHasSpace(XLogReaderState* state, Size size)
+{
+	if (!XLogReaderHasInput(state, size))
+		return false;
+
+	if (!XLogReaderHasOutput(state, size))
+		return false;
+
+	return true;
+}
+
+/* 

Re: [HACKERS] logical changeset generation v3 - git repository

2012-11-14 Thread Andres Freund
On 2012-11-15 01:27:46 +0100, Andres Freund wrote:
 In response to this you will soon find the 14 patches that currently
 implement $subject.

As its not very wieldly to send around that many/big patches all the
time, until the next major version I will just update the git tree at:

Web:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3

Git:
git clone git://git.postgresql.org/git/users/andresfreund/postgres.git 
xlog-decoding-rebasing-cf3

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Jeff Davis
On Tue, 2012-11-13 at 15:27 -0500, Robert Haas wrote:
 A small
 patch that gets committed is better than a big one that doesn't.

Here's a small patch (two, actually, because the TLI one is
uninteresting and noisy). It's based on Simon's patch, but with some
significant changes:

  * I ripped out all of the handling for a mix of some checksummed and
some non-checksummed pages. No more control bits or page version stuff.
  * I moved the checksum to the pd_tli field, and renamed it
pd_checksum.
  * vm/fsm_extend were not setting the verification information for some
reason. I'm not sure why, but since it's now on/off for the entire
system, they need to do the same thing.
  * Added a flag to pg_control called data_checksums. It is set by
initdb when the -k/--data-checksums option is specified (open for
discussion).
  * Added a function in xlog.c that is a simple reader of the control
file flag.
  * Got rid of page_checksums GUC.
  * Incorporated the page number into the checksum calculation, to
detect pages that are transposed.

I'll do another pass to make sure I update all of the comments, and try
to self review it. So, slightly rough in some places.

Regards,
Jeff Davis



checksums.patch.gz
Description: GNU Zip compressed data


replace-tli-with-checksums.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lcr - walsender integration

2012-11-14 Thread Andres Freund
Hi,

The current logical walsender integration looks like the following:

=# INIT_LOGICAL_REPLICATION 'text';
WARNING:  Initiating logical rep
WARNING:  reached consistent point, stopping!
 replication_id | consistent_point | snapshot_name | plugin
+--+---+
 id-2   | 3/CACBDF98   | 0xDEADBEEF| text
(1 row)

=# START_LOGICAL_REPLICATION 'id-2' 3/CACBDF98;
...


So the current protocol is:
INIT_LOGICAL_REPLICATION '$plugin';
returns
* slot
* first consistent point
* snapshot id

START_LOGICAL_REPLICATION '$slot' $last_received_lsn;

streams changes, each wrapped in a 'w' message with (start, end) set to
the same value. The content of the data is completely free-format and
only depends on the output plugin.

Feedback is provided from the client via the normal 'r' messages.

I think thats not a bad start, but we probably can improve it a bit:

INIT_LOGICAL_REPLICATION '$slot' '$plugin' ($value = $key, ...);
START_LOGICAL_REPLICATION '$slot' $last_received_lsn;
STOP_LOGICAL_REPLICATION '$slot';

The option to INIT_LOGICAL_REPLICATION would then get passed to the
'pg_decode_init' output plugin function (i.e. a function of that name
would get dlsym()'ed using the pg infrastructure for that).

Does that look good to you? Any suggestions?

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Craig Ringer
On 11/15/2012 12:46 AM, Tom Lane wrote:

 Agreed, and there's also the question of passing switches etc to the
 program, so the string can't be a bare file name anyway.  I proposed
 pipe symbols (|) in the string previously, but if you find that too
 Unix-centric I suppose we could do

   COPY TABLE FROM PROGRAM 'command line';
   COPY TABLE TO PROGRAM 'command line';

I'd strongly prefer that from a security standpoint. I intensely dislike
the idea that COPY will change from a command that can at worst expose
data to a command that can execute programs. It means existing security
decisions in applications that use it must be re-evaluated ... and most
won't be. Also, it isn't too hard to check the command string for pipe
chars, but experience has taught over and over with SQL injection, shell
metacharacter exploits, XSS, etc that magic characters that you must
check for are a bad idea, and it's much better to have separate syntax
(like parameterized statements) rather than magic strings.

Additionally, the pipe design appears to presume the presence of a shell
and the desirability of using it. I don't think either assumption is
sensible.

Windows has a shell of sorts (cmd.exe) but its behaviour is different
with regards to quoting and it can be tricky to produce commands that
work under both a UNIX shell and cmd.exe .

More importantly, the shell provides fun opportunities for unexpected
side-effects via metacharacters, leading to undesired behaviour or even
exploits. It's IMO strongly preferable to use an argument vector and
direct execution, so the shell never gets involved.

How about:

COPY ... FROM PROGRAM '/bin/my_program', '$notavariable', '$(rm -rf
$HOME)';

or:

COPY ... FROM (PROGRAM '/bin/my_program', ARGUMENTS
('$notavariable', '$(rm -rf $HOME)') );

?

Something extensible would be good, as somebody is *inevitably* going to
ask so how do I set environment variables before I call the command
and how do I control which return values are considered success.


--
Craig Ringer



   regards, tom lane




-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v3

2012-11-14 Thread Josh Berkus
On 11/14/12 4:27 PM, Andres Freund wrote:
 Hi,
 
 In response to this you will soon find the 14 patches that currently
 implement $subject. I'll go over each one after showing off for a bit:

Lemme be the first to say, wow.  Impressive work.

Now the debugging starts ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Tom Lane
Craig Ringer cr...@2ndquadrant.com writes:
 On 11/15/2012 12:46 AM, Tom Lane wrote:
 Agreed, and there's also the question of passing switches etc to the
 program, so the string can't be a bare file name anyway.  I proposed
 pipe symbols (|) in the string previously, but if you find that too
 Unix-centric I suppose we could do
 
 COPY TABLE FROM PROGRAM 'command line';
 COPY TABLE TO PROGRAM 'command line';

 I'd strongly prefer that from a security standpoint.

That's a reasonable concern.

 Additionally, the pipe design appears to presume the presence of a shell
 and the desirability of using it. I don't think either assumption is
 sensible.

I disagree very very strongly with that.  If we prevent use of shell
syntax, we will lose a lot of functionality, for instance

copy ... from program 'foo somefile'
copy ... from program 'foo | bar'

unless you're imagining that we will reimplement a whole lot of that
same shell syntax for ourselves.  (And no, I don't care whether the
Windows syntax is exactly the same or not.  The program name/path is
already likely to vary across systems, so it's pointless to suppose that
use of the feature would be 100% portable if only we lobotomized it.)

 More importantly, the shell provides fun opportunities for unexpected
 side-effects via metacharacters, leading to undesired behaviour or even
 exploits.

So?  You're already handing the keys to the kingdom to anybody who can
control the contents of that command line, even if it's only to point at
the wrong program.  And one man's unexpected side-effect is another
man's essential feature, as in my examples above.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-14 Thread Robert Haas
On Wed, Nov 14, 2012 at 6:24 PM, Jeff Davis pg...@j-davis.com wrote:
 Hmm... what if we took this a step further and actually stored the
 checksums in a separate relation fork?  That would make it pretty
 simple to support enabling/disabling checksums for particular
 relations.  It would also allow us to have a wider checksum, like 32
 or 64 bits rather than 16.  I'm not scoffing at a 16-bit checksum,
 because even that's enough to catch a very high percentage of errors,
 but it wouldn't be terrible to be able to support a wider one, either.

 I don't remember exactly why this idea was sidelined before, but I don't
 think there were any showstoppers. It does have some desirable
 properties; most notably the ability to add checksums without a huge
 effort, so perhaps the idea can be revived.

 But there are some practical issues, as Tom points out. Another one is
 that it's harder for external utilities (like pg_basebackup) to verify
 checksums.

 And I just had another thought: these pages of checksums would be data
 pages, with an LSN. But as you clean ordinary data pages, you need to
 constantly bump the LSN of the very same checksum page (because it
 represents 1000 ordinary data pages); making it harder to actually clean
 the checksum page and finish a checkpoint. Is this a practical concern
 or am I borrowing trouble?

Well, I think the invariant we'd need to maintain is as follows: every
page for which the checksum fork might be wrong must have an FPI
following the redo pointer.  So, at the time we advance the redo
pointer, we need the checksum fork to be up-to-date for all pages for
which a WAL record was written after the old redo pointer except for
those for which a WAL record has again been written after the new redo
pointer.  In other words, the checksum pages we write out don't need
to be completely accurate; the checksums for any blocks we know will
get clobbered anyway during replay don't really matter.

However, reading your comments, I do see one sticking point.  If we
don't update the checksum page until a buffer is written out, which of
course makes a lot of sense, then during a checkpoint, we'd have to
flush all of the regular pages first and then all the checksum pages
afterward.  Otherwise, the checksum pages wouldn't be sufficiently
up-to-date at the time we write them.  There's no way to make that
happen just by fiddling with the LSN; rather, we'd need some kind of
two-pass algorithm over the buffer pool.  That doesn't seem
unmanageable, but it's more complicated than what we do now.

I'm not sure we'd actually bother setting the LSN on the checksum
pages, because the action that prompts an update of a checksum page is
the decision to write out a non-checksum page, and that's not a
WAL-loggable action, so there's no obvious LSN to apply, and no
obvious need to apply one at all.

I'm also not quite sure what happens with full_page_writes=off.  I
don't really see how to make this scheme work at all in that
environment.  Keeping the checksum in the page seems to dodge quite a
few problems in that case ... as long as you assume that 8kB writes
really are atomic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch making firm recommendation for setting the value of commit_delay

2012-11-14 Thread David Rowley

 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
 ow...@postgresql.org] On Behalf Of Peter Geoghegan
 Sent: 15 November 2012 09:44
 To: PG Hackers
 Subject: [HACKERS] Doc patch making firm recommendation for setting the
 value of commit_delay
 
 Some of you will be aware that I've tried to formulate a good general
 recommendation for setting the value of commit_delay, in light of the
 improvements made in 9.3. I previously asked for help finding a
methodology
 for optimising commit_delay [1]. The documentation related to
 commit_delay still says that we don't know what might work well, though I
 don't think that's true any more.
 
 I found the time to do some benchmarks of my own - Greg Smith made
 available a server that he frequently uses for benchmarks. It was
previously
 my observation that half of raw single-page sync time as reported by
 pg_test_fsync for you wal_sync_method worked best for commit_delay. I
 went so far as to modify pg_test_fsync to output average time per op in
 microseconds for each operation with commit_delay in mind, which is a
patch
 that has already been committed [2]. This general approach worked really
 well on my laptop, which has a slowish fsync of about 8 milliseconds
(8,000
 microseconds), for which a commit_delay setting of 4,000 (microseconds)
 seemed to clearly work best, on both insert and tpc-b benchmarks [3].

snip
 
 
 Thoughts?
 

I think for sure, since the GUC maintained its original name, that the docs
need to be updated to let people know the background behaviour has changed
and may now be far more useful.
I've read through the patch. Only thing I see out of place is a small typo:

!values of varnamecommit_siblings/varname should be used is such
cases,

Should probably read

!values of varnamecommit_siblings/varname should be used *in* such
cases,

Thanks for doing all the benchmarks too. Good to see such a range of
different hardware tested.

Regards

David Rowley





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH 03/14] Add simple xlogdump tool

2012-11-14 Thread Andres Freund
---
 src/bin/Makefile|   2 +-
 src/bin/xlogdump/Makefile   |  25 +++
 src/bin/xlogdump/xlogdump.c | 468 
 3 files changed, 494 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/xlogdump/Makefile
 create mode 100644 src/bin/xlogdump/xlogdump.c

diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..9992f7a 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = initdb pg_ctl pg_dump \
-	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump
 
 ifeq ($(PORTNAME), win32)
 SUBDIRS += pgevent
diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
new file mode 100644
index 000..d54640a
--- /dev/null
+++ b/src/bin/xlogdump/Makefile
@@ -0,0 +1,25 @@
+#-
+#
+# Makefile for src/bin/xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_resetxlog/Makefile
+#
+#-
+
+PGFILEDESC = xlogdump
+PGAPPICON=win32
+
+subdir = src/bin/xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS= xlogdump.o \
+	 $(WIN32RES)
+
+all: xlogdump
+
+
+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s   \012|grep -v /main.o|sed 's/^/..\/..\/..\//')
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
new file mode 100644
index 000..0f984e4
--- /dev/null
+++ b/src/bin/xlogdump/xlogdump.c
@@ -0,0 +1,468 @@
+#include postgres.h
+
+#include unistd.h
+
+#include access/xlogreader.h
+#include access/rmgr.h
+#include miscadmin.h
+#include storage/ipc.h
+#include utils/memutils.h
+#include utils/guc.h
+
+#include getopt_long.h
+
+/*
+ * needs to be declared because otherwise its defined in main.c which we cannot
+ * link from here.
+ */
+const char *progname = xlogdump;
+
+typedef struct XLogDumpPrivateData {
+	TimeLineID timeline;
+	char* outpath;
+	char* inpath;
+} XLogDumpPrivateData;
+
+static void
+XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
+ XLogRecPtr startptr, char *buf, Size count);
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id,
+  XLogRecPtr startptr, const char *buf, Size count);
+
+#define XLogFilePathWrite(path, base, tli, logSegNo)			\
+	snprintf(path, MAXPGPATH, %s/%08X%08X%08X, base, tli,		\
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId),		\
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id,
+  XLogRecPtr startptr, const char *buf, Size count)
+{
+	const char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	static int	sendFile = -1;
+	static XLogSegNo sendSegNo = 0;
+	static uint32 sendOff = 0;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes  0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			writebytes;
+
+		startoff = recptr % XLogSegSize;
+
+		if (sendFile  0 || !XLByteInSeg(recptr, sendSegNo))
+		{
+			char		path[MAXPGPATH];
+
+			/* Switch to another logfile segment */
+			if (sendFile = 0)
+close(sendFile);
+
+			XLByteToSeg(recptr, sendSegNo);
+			XLogFilePathWrite(path, directory, timeline_id, sendSegNo);
+
+			sendFile = open(path, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR);
+			if (sendFile  0)
+			{
+ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg(could not open file \%s\: %m,
+path)));
+			}
+			sendOff = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (sendOff != startoff)
+		{
+			if (lseek(sendFile, (off_t) startoff, SEEK_SET)  0){
+char fname[MAXPGPATH];
+XLogFileName(fname, timeline_id, sendSegNo);
+
+ereport(ERROR,
+		(errcode_for_file_access(),
+		 errmsg(could not seek in log segment %s to offset %u: %m,
+		fname,
+startoff)));
+			}
+			sendOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes  (XLogSegSize - startoff))
+			segbytes = XLogSegSize - startoff;
+		else
+			segbytes = nbytes;
+
+		writebytes = write(sendFile, p, segbytes);
+		if (writebytes = 0)
+		{
+			char fname[MAXPGPATH];
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			ereport(ERROR,
+	(errcode_for_file_access(),
+			errmsg(could not write to log segment %s, offset %u, length %lu: %m,
+   fname,
+   sendOff, (unsigned long) segbytes)));
+		}
+
+		/* Update state for read */
+		XLByteAdvance(recptr, writebytes);
+
+		sendOff += writebytes;
+		nbytes -= writebytes;
+		p += writebytes;
+	}
+}
+
+/* this should probably be put in a general implementation */
+static void
+XLogDumpXLogRead(const 

[HACKERS] [PATCH 07/14] Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement

2012-11-14 Thread Andres Freund

This is useful to be able to represent a CommandId thats invalid. There was no
such value before.

This decreases the possible number of subtransactions by one which seems
unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are
never looked at outside the context of their own transaction (spare timetravel
access, but thats new anyway).
---
 src/backend/access/transam/xact.c | 4 ++--
 src/include/c.h   | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 10386da..f28b4c8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -745,12 +745,12 @@ CommandCounterIncrement(void)
 	if (currentCommandIdUsed)
 	{
 		currentCommandId += 1;
-		if (currentCommandId == FirstCommandId) /* check for overflow */
+		if (currentCommandId == InvalidCommandId)
 		{
 			currentCommandId -= 1;
 			ereport(ERROR,
 	(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-	 errmsg(cannot have more than 2^32-1 commands in a transaction)));
+	 errmsg(cannot have more than 2^32-2 commands in a transaction)));
 		}
 		currentCommandIdUsed = false;
 
diff --git a/src/include/c.h b/src/include/c.h
index a6c0e6e..e52af3b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -367,6 +367,7 @@ typedef uint32 MultiXactOffset;
 typedef uint32 CommandId;
 
 #define FirstCommandId	((CommandId) 0)
+#define InvalidCommandId	(~(CommandId)0)
 
 /*
  * Array indexing support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Craig Ringer
On 11/15/2012 10:19 AM, Tom Lane wrote:

 I disagree very very strongly with that.  If we prevent use of shell
 syntax, we will lose a lot of functionality, for instance

   copy ... from program 'foo somefile'
   copy ... from program 'foo | bar'

 unless you're imagining that we will reimplement a whole lot of that
 same shell syntax for ourselves.  (And no, I don't care whether the
 Windows syntax is exactly the same or not.  The program name/path is
 already likely to vary across systems, so it's pointless to suppose that
 use of the feature would be 100% portable if only we lobotomized it.)

That's reasonable - and it isn't worth making people jump through hoops
with ('bash','-c','/some/command  infile') .

 So?  You're already handing the keys to the kingdom to anybody who can
 control the contents of that command line, even if it's only to point at
 the wrong program.  And one man's unexpected side-effect is another
 man's essential feature, as in my examples above.

That's true if they're controlling the whole command, not so much if
they just provide a file name. I'm just worried that people will use it
without thinking deeply about the consequences, just like they do with
untrusted client input in SQL injection attacks.

I take you point about wanting more than just the execve()-style
invocation. I'd still like to see a way to invoke the command without
having the shell involved, though; APIs to invoke external programs seem
to start out with a version that launches via the shell then quickly
grow more controlled argument-vector versions.

There's certainly room for a quick'n'easy COPY ... FROM PROGRAM ('cmd1 |
cmd2 | tee /tmp/log') . At this point all I think is really vital is to
make copy-with-exec *syntactically different* to plain COPY, and to
leave room for extending the syntax for environment, separate args
vector, etc when they're called for. Like VACUUM, where VACUUM VERBOSE
ANALYZE became VACUUM (VERBOSE, ANALYZE) to make room for (BUFFERS), etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH 14/14] design document v2.3 and snapshot building design doc v0.2

2012-11-14 Thread Andres Freund
---
 src/backend/replication/logical/DESIGN.txt | 603 +
 src/backend/replication/logical/Makefile   |   6 +
 .../replication/logical/README.SNAPBUILD.txt   | 298 ++
 3 files changed, 907 insertions(+)
 create mode 100644 src/backend/replication/logical/DESIGN.txt
 create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt

diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt
new file mode 100644
index 000..8383602
--- /dev/null
+++ b/src/backend/replication/logical/DESIGN.txt
@@ -0,0 +1,603 @@
+//-*- mode: adoc -*-
+= High Level Design for Logical Replication in Postgres =
+:copyright: PostgreSQL Global Development Group 2012
+:author: Andres Freund, 2ndQuadrant Ltd.
+:email: and...@2ndquadrant.com
+
+== Introduction ==
+
+This document aims to first explain why we think postgres needs another
+replication solution and what that solution needs to offer in our opinion. Then
+it sketches out our proposed implementation.
+
+In contrast to an earlier version of the design document which talked about the
+implementation of four parts of replication solutions:
+
+1. Source data generation
+1. Transportation of that data
+1. Applying the changes
+1. Conflict resolution
+
+this version only plans to talk about the first part in detail as it is an
+independent and complex part usable for a wide range of use cases which we want
+to get included into postgres in a first step.
+
+=== Previous discussions ===
+
+There are two rather large threads discussing several parts of the initial
+prototype and proposed architecture:
+
+- http://archives.postgresql.org/message-id/201206131327.24092.and...@2ndquadrant.com[Logical Replication/BDR prototype and architecture]
+- http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL]
+
+Those discussions lead to some fundamental design changes which are presented in this document.
+
+=== Changes from v1 ===
+* At least a partial decoding step required/possible on the source system
+* No intermediate (schema only) instances required
+* DDL handling, without event triggers
+* A very simple text conversion is provided for debugging/demo purposes
+* Smaller scope
+
+== Existing approaches to replication in Postgres ==
+
+If any currently used approach to replication can be made to support every
+use-case/feature we need, it likely is not a good idea to implement something
+different. Currently three basic approaches are in use in/around postgres
+today:
+
+. Trigger based
+. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery]
+. Statement based
+
+Statement based replication has obvious and known problems with consistency and
+correctness making it hard to use in the general case so we will not further
+discuss it here.
+
+Lets have a look at the advantages/disadvantages of the other approaches:
+
+=== Trigger based Replication ===
+
+This variant has a multitude of significant advantages:
+
+* implementable in userspace
+* easy to customize
+* just about everything can be made configurable
+* cross version support
+* cross architecture support
+* can feed into systems other than postgres
+* no overhead from writes to non-replicated tables
+* writable standbys
+* mature solutions
+* multimaster implementations possible  existing
+
+But also a number of disadvantages, some of them very hard to solve:
+
+* essentially duplicates the amount of writes (or even more!)
+* synchronous replication hard or impossible to implement
+* noticeable CPU overhead
+** trigger functions
+** text conversion of data
+* complex parts implemented in several solutions
+* not in core
+
+Especially the higher amount of writes might seem easy to solve at a first
+glance but a solution not using a normal transactional table for its log/queue
+has to solve a lot of problems. The major ones are:
+
+* crash safety, restartability  spilling to disk
+* consistency with the commit status of transactions
+* only a minimal amount of synchronous work should be done inside individual
+transactions
+
+In our opinion those problems are restricting progress/wider distribution of
+these class of solutions. It is our aim though that existing solutions in this
+space - most prominently slony and londiste - can benefit from the work we are
+doing  planning to do by incorporating at least parts of the changeset
+generation infrastructure.
+
+=== Recovery based Replication ===
+
+This type of solution, being built into postgres and of increasing popularity,
+has and will have its use cases and we do not aim to replace but to complement
+it. We plan to reuse some of the infrastructure and to make it possible to mix
+both modes of replication
+
+Advantages:
+
+* builtin
+* built on existing infrastructure from crash recovery
+* efficient
+** minimal CPU, 

Re: [HACKERS] recursive view syntax

2012-11-14 Thread Peter Eisentraut
On Tue, 2012-11-13 at 23:44 -0500, Tom Lane wrote:
 Peter Eisentraut pete...@gmx.net writes:
  I noticed we don't implement the recursive view syntax, even though it's
  part of the standard SQL feature set for recursive queries.  Here is a
  patch to add that.
 
 Can't you simplify that by using SELECT * FROM name?

You mean in the expansion?

Maybe, but SELECT * is perhaps something best avoided because of unclear
side effects.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v3

2012-11-14 Thread Michael Paquier
Looks like cool stuff @-@
I might be interested in looking at that a bit as I think I will hopefully
be hopefully be able to grab some time in the next couple of weeks.
Are some of those patches already submitted to a CF?
-- 
Michael Paquier
http://michael.otacoo.com


Re: [HACKERS] Materialized views WIP patch

2012-11-14 Thread Alvaro Herrera
Kevin Grittner wrote:

Interesting stuff.

   /*
 +  * SetRelationIsValid
 +  *  Set the value of the relation's relisvalid field in pg_class.
 +  *
 +  * NOTE: caller must be holding an appropriate lock on the relation.
 +  * ShareUpdateExclusiveLock is sufficient.
 +  *
 +  * NOTE: an important side-effect of this operation is that an SI 
 invalidation
 +  * message is sent out to all backends --- including me --- causing plans
 +  * referencing the relation to be rebuilt with the new list of children.
 +  * This must happen even if we find that no change is needed in the pg_class
 +  * row.
 +  */
 + void
 + SetRelationIsValid(Oid relationId, bool relisvalid)
 + {

It's not clear to me that it's right to do this by doing regular heap
updates here instead of heap_inplace_update.  Also, I think this might
end up causing a lot of pg_class tuple churn (at least for matviews that
delete rows at xact end), which would be nice to avoid.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch, put commas in the right place in pg_restore docs

2012-11-14 Thread Peter Eisentraut
On Tue, 2012-10-16 at 21:50 -0500, Karl O. Pinc wrote:
 Simple punctuation change to pg_restore docs.

committed




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-14 Thread Amit Kapila
On Wednesday, November 14, 2012 10:19 PM Fujii Masao wrote:
 On Thu, Nov 15, 2012 at 12:55 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Now user can use this utility to decide if new-standby has max LSN
 greater
 
  And that situation can occur when new-standby has startpoint LSN
 greater
  than new-master?
 
 Whether the backup is required has nothing to do with the startpoint.
 The backup is required when the data page in old-master precedes
 the last applied LSN in old-standby (i.e., new-master) at the moment
 of the failover. Without the backup, there is no way to revert the data
 which is ahead of new-master.

Okay. 
So as Robert and Alvaro suggested to have it separate utility rather than
having options in pg_resetxlog to print MAX LSN seems to be quite
appropriate.
I am planning to update the patch to make it a separate utility as
pg_computemaxlsn with options same as what I have proposed for pg_resetxlog
to print MAX LSN.
So considering it a separate utility there can be 2 options:
a. have a utility in contrib.
b. have a utility in bin similar to pg_resetxlog.

What is the best place to have it?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move postgresql_fdw_validator into dblink

2012-11-14 Thread Shigeru Hanada
Sorry for long absence.

On Sat, Oct 20, 2012 at 4:24 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 IIRC, the reason why postgresql_fdw instead of pgsql_fdw was
 no other fdw module has shorten naming such as ora_fdw for
 Oracle.
 However, I doubt whether it is enough strong reason to force to
 solve the technical difficulty; naming conflicts with existing user
 visible features.
 Isn't it worth to consider to back to the pgsql_fdw_validator
 naming again?

AFAIR, in the discussion about naming of the new FDW, another
name postgres_fdw was suggested as well as postgresql_fdw, and I
chose the one more familiar to me at that time.  I think that only few
people feel that postgres is shortened name of
postgresql.

How about using postgres_fdw for PG-FDW?

Regards,
--
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] add -Wlogical-op to standard compiler options?

2012-11-14 Thread Peter Eisentraut
I think it might be worth adding -Wlogical-op to the standard warning
options (for supported compilers, determined by configure test).

`-Wlogical-op'
 Warn about suspicious uses of logical operators in expressions.
 This includes using logical operators in contexts where a bit-wise
 operator is likely to be expected.

In addition to what it says there, it appears to warn about illogical
combinations of  and .  I have been using it locally for a while
without problems.  We have already found a couple of bugs this way.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] feature proposal - triggers by semantics

2012-11-14 Thread Darren Duncan
I have a feature request, which at one level should require little code change, 
but at another level may require more.


Since Postgres 9.3 is going to be doing some significant feature additions for 
triggers, I'd like to see some more.


As they currently exist, triggers always fire based on certain SQL syntax used, 
rather than on the semantics of what is actually going on.


I would like to see a new class of triggers that fire when particular database 
operations happen regardless of what SQL syntax was used.


As a simple example, I'd like to be able to define a trigger like AFTER DELETE 
ON foo FOR EACH ROW and have that trigger be invoked not only by a DELETE on 
foo but also by a TRUNCATE on foo.  So I would like to do some auditing action 
when a row of foo is deleted, no matter how it happens.


The reason this particular example in particular is important is that TRUNCATE 
is documented as a data-manipulation action semantically equivalent to an 
unqualified DELETE in its effects, primarily.  As such, I would expect the same 
triggers to fire as would for an unqualified DELETE.


The reason I propose it be a new kind of trigger is so that then we also retain 
the ability to declare triggers that fire on DELETE and not on TRUNCATE.


Less important, but also nice at least from the ability to be less verbose, is 
that said trigger could also run when an UPDATE happens, optionally, since an 
UPDATE can be considered semantically a DELETE+INSERT.  But adding the TRUNCATE 
support is most important because it simply doesn't exist now, while UPDATE you 
can get just by adding or update.


I suggest that the simplest way to add this feature is to just extend the 
existing syntax for defining a FOR EACH ROW so that TRUNCATE is also an option, 
besides INSERT/UPDATE/DELETE.


In that case, the semantics of the TRUNCATE statement could be altered as 
follows:  Iff TRUNCATE foo is invoked and foo has an TRUNCATE FOR EACH ROW 
trigger defined on it, then an unqualified DELETE FROM foo will be performed 
instead with its usual semantics.  If such a trigger is not defined on foo, then 
the old TRUNCATE semantics happen.


As such, this case of the feature can be added without breaking anything legacy.

So, I'm partly proposing a specific narrow new feature, TRUNCATE FOR EACH ROW, 
but I'm also proposing the ability to generally define triggers based not on the 
syntax used but the actual action requested.


A tangential feature request is to provide a runtime config option that can 
cause TRUNCATE to always behave as unqualified DELETE FROM regardless of any 
triggers, as if it were just a syntactic shorthand.  Or alternately/also provide 
extra syntax to TRUNCATE itself where one can specify which behavior to have, 
and both options can be given explicitly to override any config option.


-- Darren Duncan



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-14 Thread Amit Kapila
On Tuesday, November 13, 2012 10:17 PM Fujii Masao wrote:
 On Tue, Nov 13, 2012 at 1:23 PM, Amit kapila amit.kap...@huawei.com
 wrote:
  On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote:
  Robert Haas escribió:
  On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com
 wrote:
 
   I think I can see all of those things being potentially useful.
 There
   are a couple of pending patches that will revise the WAL format

  I wonder if we shouldn't make this a separate utility, rather than
  something that is part of pg_resetxlog.  Anyone have a thought on
 that
  topic?
 
  That thought did cross my mind too.
 
 We might be able to use this utility to decide whether we need to take
 a fresh backup from the master onto the standby, to start old master
 as new standby after failover.
 
 When starting new standby after failover, any data page in the standby
 must
 not precede the master. Otherwise, the standby cannot catch up with the
 master
 consistently. But, the master might write the data page corresponding to
 the WAL which has not been replicated to the standby yet. So, if
 failover happens
 before that WAL has been replicated, the data page in old master would
 precede
 new master (i.e., old standby), and in this case the backup is required.
 OTOH,
 if maximum LSN in data page in the standby is less than the master, the
 backup
 is not required.

When new standby will start the replication (RequestXLogStreaming()), it
will
send the startpoint, so won't in above scenario that startpoint will be
ahead of new master 
(or new master won't have that LSN) and replication will not be
eastablished?

So now user may not be able to decide whether he needs to do incremental or
full backup from new master, 
is this the case you are trying to point?

 Without this utility, it's difficult to calculate the maximum LSN of
 data page, so
 basically we needed to take a backup when starting the standby. In the
 future,
 thanks to this utility, we can calculate the maximum LSN, and can skip a
 backup
 if that LSN is less than the master (i.e., last applied LSN, IOW,
 timeline switch LSN).

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why does delete from table where not exists (select 1 from ... LIMIT 1) perform badly?

2012-11-14 Thread Palle Girgensohn
Hi,

I've read about the reason for this before, but cannot find a reference
to it now.

How come the planner treats the
delete from table where not extists(select 1 from table2 where ... LIMIT 1)

so differently, and usually badly, when the LIMIT 1 is there. In older
version of postgresql, I remember that the effect was the opposite, a
limit 1 would actually perform substantially better. Hence we have old
code (and old habits), where the LIMIT 1 is still used.

Shouldn't the planner really understand that the intention is the same
in these two queries?



-- bad:
DELETE FROM iup_locked_gradings ilg
WHERE NOT EXISTS (
  SELECT 1 FROM iup_locked_subject ils
WHERE ils.locked_gradings_id = ilg.locked_gradings_id
  LIMIT 1
  )
;

-- good:
DELETE FROM iup_locked_gradings ilg
WHERE NOT EXISTS (
  SELECT 1 FROM iup_locked_subject ils
WHERE ils.locked_gradings_id = ilg.locked_gradings_id
  )
;


pp=# begin; explain DELETE FROM iup_locked_gradings ilg WHERE NOT EXISTS
(SELECT 1 FROM iup_locked_subject ils WHERE ils.locked_gradings_id =
ilg.locked_gradings_id LIMIT 1);
BEGIN
  QUERY PLAN

---
 Delete  (cost=0.00..523542963.48 rows=291737 width=6)
   -  Seq Scan on iup_locked_gradings ilg  (cost=0.00..523542963.48
rows=291737 width=6)
 Filter: (NOT (SubPlan 1))
 SubPlan 1
   -  Limit  (cost=0.00..897.27 rows=1 width=0)
 -  Seq Scan on iup_locked_subject ils
(cost=0.00..18842.76 rows=21 width=0)
   Filter: (locked_gradings_id = $0)
(7 rows)


pp=# begin; explain DELETE FROM iup_locked_gradings ilg WHERE NOT EXISTS
(SELECT 1 FROM iup_locked_subject ils WHERE ils.locked_gradings_id =
ilg.locked_gradings_id );
BEGIN
QUERY PLAN

---
 Delete  (cost=31705.39..47934.47 rows=553737 width=12)
   -  Hash Anti Join  (cost=31705.39..47934.47 rows=553737 width=12)
 Hash Cond: (ilg.locked_gradings_id = ils.locked_gradings_id)
 -  Seq Scan on iup_locked_gradings ilg  (cost=0.00..6677.44
rows=583474 width=10)
 -  Hash  (cost=15776.83..15776.83 rows=1226373 width=10)
   -  Seq Scan on iup_locked_subject ils
(cost=0.00..15776.83 rows=1226373 width=10)
(6 rows)

pp=#










chalmers=# \d iup_locked_gradings
Table public.iup_locked_gradings
Column|  Type   |
Modifiers
--+-+--
 locked_gradings_id   | integer | not null default
nextval('iup_locked_gradings_locked_gradings_id_seq'::regclass)
 type | integer |
 description  | text|
 name | text|
 original_gradings_id | integer |
Indexes:
iup_locked_gradings_pkey PRIMARY KEY, btree (locked_gradings_id),
tablespace opt
Referenced by:
TABLE iup_locked_subject CONSTRAINT
iup_locked_subject_locked_gradings_id_fkey FOREIGN KEY
(locked_gradings_id) REFERENCES iup_locked_gradings(locked_gradings_id)
ON UPDATE CASCADE ON DELETE SET NULL
Tablespace: opt




chalmers=# \d iup_locked_subject
   Table public.iup_locked_subject
   Column|  Type   |
Modifiers
-+-+
 locked_subject_id   | integer | not null default
nextval('iup_locked_subject_locked_subject_id_seq'::regclass)
 name| text| not null
 link_url| text|
 description | text|
 use_measures| boolean | not null default true
 locked_gradings_id  | integer |
 original_subject_id | integer |
 use_fail_warning| boolean | not null default false
Indexes:
iup_locked_subject_pkey PRIMARY KEY, btree (locked_subject_id),
tablespace opt
Foreign-key constraints:
iup_locked_subject_locked_gradings_id_fkey FOREIGN KEY
(locked_gradings_id) REFERENCES iup_locked_gradings(locked_gradings_id)
ON UPDATE CASCADE ON DELETE SET NULL
Referenced by:
Tablespace: opt







signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Etsuro Fujita
I wrote:
  From: Tom Lane [mailto:t...@sss.pgh.pa.us]
 
  I wrote:
   Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes:
   I have a question.  I think it would be also better to extend the syntax
   for the SQL COPY command in the same way, ie,
   COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
   format 'csv'
 
   Yeah, sure --- that case is already superuser-only, so why not give it
   the option of being a popen instead of just fopen.
 
  BTW, one thought that comes to mind is that such an operation is
  extremely likely to fail under environments such as SELinux.  That's
  not necessarily a reason not to do it, but we should be wary of
  promising that it will work everywhere.  Probably a documentation note
  about this would be enough.
 
 OK  I'll revise the patch.

I've revised the patch.  In this version a user can specify hooks for pre- and
post-processor executables for COPY and \copy in the follwoing way:

$ echo '/bin/gunzip -c $1'  decompress.sh
$ chmod +x decompress.sh

In the case of the COPY command,

postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
WITH (format 'csv');

Also, in the case of the \copy instruction,

postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
with (format 'csv')

As shown in the example above, I've assumed that the syntax for this option for
e.g., the COPY command is:

COPY table_name FROM 'progname filename' WITH ...
COPY table_name TO 'progname filename' WITH ...

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output.  Also, prgoname for COPY OUT is
the user-supplied program that reads standard input and writes to filename taken
as its argument.  This makes simple the identification and verification of
progname and filename.

Todo:
* Documentation including documentation note about the limitation for
environments such as SELinux mentioned by Tom.
* More test

Any comments and suggestions are welcomed.

Thanks,

Best regards,
Etsuro Fujita


copy-popen-20121114.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012-11-14 Thread Simon Riggs
On 13 September 2012 10:13, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote:
 I'd like to add the following options to the SQL COPY command and the psql 
 \copy
 instruction:

 * PREPROCESSOR: Specifies the user-supplied program for COPY IN.  The data
 from an input file is preprocessed by the program before the data is loaded 
 into
 a postgres table.
 * POSTPROCESSOR: Specifies the user-supplied program for COPY OUT.  The 
 data
 from a postgres table is postprocessed by the program before the data is 
 stored
 in an output file.

 These options can be specified only when an input or output file is specified.

 These options allow to move data between postgres tables and e.g., compressed
 files or files on a distributed file system such as Hadoop HDFS.

These options look pretty strange to me and I'm not sure they are a good idea.

If we want to read other/complex data, we have Foreign Data Wrappers.

What I think we need is COPY FROM (SELECT). COPY (query) TO
already exists, so this is just the same thing in the other direction.
Once we have a SELECT statement in both directions we can add any user
defined transforms we wish implemented as database functions.

At present we only support INSERT SELECT ... FROM FDW
which means all the optimisations we've put into COPY are useless with
FDWs. So we need a way to speed up loads from other data sources.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] binary heap implementation

2012-11-14 Thread Abhijit Menon-Sen
Hi.

There are two or three places in the Postgres source that implement heap
sort (e.g. tuplesort.c, nodeMergeAppend.c), and it's also needed by the
BDR code. It seemed reasonable to factor out the functionality.

I've attached a patch (binaryheap.diff) that contains a straightforward
implementation of a binary heap (originally written by Andres, with a
few bugfixes and cleanups by me).

There doesn't seem to be any place to put unit tests for code like this,
so at Alvaro's suggestion, I have attached a small standalone program I
used for testing (testbinheap.c) and a Makefile. If you copy them both
to src/backend/lib and run make -f Makefile.binheap, it'll build the
test program. It's nothing exciting, just exercises the functions in
various ways.

I've also attached a patch (nodeMergeAppend.diff) that converts the code
in nodeMergeAppend.c to use binaryheap. It passes make check, and also
the following test (which is planned as a merge append):

CREATE OR REPLACE VIEW gen AS SELECT * FROM
generate_series(1, 10) g(i) ORDER BY g.i OFFSET 0;

SELECT * FROM (
SELECT * FROM gen UNION ALL SELECT * FROM gen UNION ALL
SELECT * FROM gen UNION ALL SELECT * FROM gen UNION ALL
SELECT * FROM gen UNION ALL SELECT * FROM gen) s
ORDER BY i OFFSET 100;

Initially there was a slight performance degradation with my patch, but
I speeded things up and now my code is at least at par with, and maybe
slightly faster than, the old code. On my laptop, both consistently
take ~2.4s on average to execute the above SELECT.

Comments? Suggestions?

-- Abhijit
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 98ce3d7..327a1bc 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = ilist.o stringinfo.o
+OBJS = ilist.o binaryheap.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk

diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
new file mode 100644
index 000..aebf08f
--- /dev/null
+++ b/src/include/lib/binaryheap.h
@@ -0,0 +1,138 @@
+/*
+ * binaryheap.h
+ *
+ * A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/lib/binaryheap.h
+ */
+
+#ifndef BINARYHEAP_H
+#define BINARYHEAP_H
+
+/*
+ * This structure represents a single node in a binaryheap, and just
+ * holds two pointers. The heap management code doesn't care what is
+ * stored in a node (in particular, the key or value may be NULL),
+ * only that the comparator function can compare any two nodes.
+ */
+
+typedef struct binaryheap_node
+{
+	void *key;
+	void *value;
+} binaryheap_node;
+
+/*
+ * For a max-heap, the comparator must return:
+ * -1 iff a  b
+ * 0 iff a == b
+ * +1 iff a  b
+ * For a min-heap, the conditions are reversed.
+ */
+typedef int (*binaryheap_comparator)(binaryheap_node *a, binaryheap_node *b);
+
+/*
+ * binaryheap
+ *
+ *		size			how many nodes are currently in nodes
+ *		space			how many nodes can be stored in nodes
+ *		comparator		comparator to define the heap property
+ *		nodes			the first of a list of space nodes
+ */
+
+typedef struct binaryheap
+{
+	size_t size;
+	size_t space;
+	binaryheap_comparator compare;
+	binaryheap_node nodes[1];
+} binaryheap;
+
+/*
+ * binaryheap_allocate
+ *
+ * Returns a pointer to a newly-allocated heap that has the capacity to
+ * store the given number of nodes, with the heap property defined by
+ * the given comparator function.
+ */
+
+binaryheap *
+binaryheap_allocate(size_t capacity, binaryheap_comparator compare);
+
+/*
+ * binaryheap_free
+ *
+ * Releases memory used by the given binaryheap.
+ */
+
+void
+binaryheap_free(binaryheap *heap);
+
+/*
+ * binaryheap_add_unordered
+ *
+ * Adds the given key and value to the end of the heap's list of nodes
+ * in O(1) without preserving the heap property. This is a convenience
+ * to add elements quickly to a new heap. To obtain a valid heap, one
+ * must call binaryheap_build() afterwards.
+ */
+
+void
+binaryheap_add_unordered(binaryheap *heap, void *key, void *value);
+
+/*
+ * binaryheap_build
+ *
+ * Assembles a valid heap in O(n) from the nodes added by
+ * binaryheap_add_unordered(). Not needed otherwise.
+ */
+
+void
+binaryheap_build(binaryheap *heap);
+
+/*
+ * binaryheap_add
+ *
+ * Adds the given key and value to the heap in O(log n), while
+ * preserving the heap property.
+ */
+
+void
+binaryheap_add(binaryheap *heap, void *key, void *value);
+
+/*
+ * binaryheap_first
+ *
+ * Returns a pointer to the first (root, topmost) node in the heap
+ * without modifying the heap. Returns NULL if the heap is empty.
+ * Always O(1).
+ */
+
+binaryheap_node *
+binaryheap_first(binaryheap *heap);
+
+/*
+ * binaryheap_remove_first
+ *
+ * Removes the first (root, topmost) node in the heap and returns a
+ * pointer to it after rebalancing the heap.