date:20110908

On Thu, Sep 8, 2011 at 10:03 PM, Robert Haas  wrote:
> On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao  wrote:
>> OTOH, new function enables users to monitor the delay as a timestamp.
>> For users, a timestamp is obviously easier to handle than LSN, and the delay
>> as a timestamp is more intuitive. So, I think that it's worth adding
>> something like pg_last_xact_insert_timestamp into core for improvement
>> of user-friendness.
>
> It seems very nice from a usability point of view, but I have to agree
> with Simon's concern about performance.  Actually, as of today,
> WALInsertLock is such a gigantic bottleneck that I suspect the
> overhead of this additional bookkeeping would be completely
> unnoticeable.

The patch I've posted adds one acquisition and release of spinlock per
a commit or abort. But it's done outside of WALInsertLock. So I don't
think that the patch degrades a performance.

> But I'm still reluctant to add more centralized
> spinlocks that everyone has to fight over, having recently put a lot
> of effort into getting rid of some of the ones we've traditionally
> had.

You are against adding new acquisition and release of spinlock itself
even if it has nothing to do with WALInsertLock? The patch uses
XLogCtl->info_lck spinlock to save the last insert timestamp in the
shmem. XLogCtl->info_lck already protects many shmem variables
related to XLOG. So using XLogCtl->info_lck additionally might make
its contention heavier and degrade a performance. If the patch
defines new spinlock and uses it to save the timestamp to avoid
such a contention, you feel satisfied with the patch?

Another idea to avoid spinlock contention is save the timestamp in
PgBackendStatus (which contains information for pg_stat_activity).
This enables us to write and read the timestamp without spinlock.
Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

On Thu, Sep 8, 2011 at 4:45 PM, Peter Geoghegan  wrote:
> On 8 September 2011 15:43, Robert Haas  wrote:
>> I wouldn't be too enthusiastic about
>> starting a project like this in January, but now seems fine.  A bigger
>> problem is that I don't hear anyone volunteering to do the work.
>
> You seem to have a fairly strong opinion on the xlog.c code. It would
> be useful to hear any preliminary thoughts that you might have on what
> we'd end up with when this refactoring work is finished. If I'm not
> mistaken, you think that it is a good candidate for being refactored
> not so much because of its size, but for other reasons -  could you
> please elaborate on those? In particular, I'd like to know what
> boundaries it is envisaged that the code should be refactored along to
> increase its conceptual integrity, or to better separate concerns. I
> assume that that's the idea, since each new .c file would have to have
> a discrete purpose.

I'm not sure how strong my opinions are, but one thing that I think
could be improved is that StartupXLOG() is really, really long, and it
does a lot of different stuff:

- Read the control file and sanity check it.
- Read the backup label if there is one or otherwise inspect the
control file's checkpoint information.
- Set up for REDO (including Hot Standby) if required.
- Main REDO loop.
- Check whether we reached consistency and/or whether we need a new TLI.
- Prepare to write WAL.
- Post-recovery cleanup.
- Initialize for normal running.

It seems to me that we could improve the readability of that function
by separating out some of the larger chunks of functionality into
their own static functions.  That wouldn't make xlog.c any shorter,
but it would make that function easier to understand.

Another gripe I have is that recoveryStopsHere() has non-obvious side
effects.  Not sure what to do about that, exactly, but I found it
extremely surprising when first picking my way through this code.

One pretty easy thing to pull out of this file wold be all the
user-callable functions.  pg_xlog_recovery_replay_{pause,resume},
pg_is_xlog_recovery_paused, pg_last_xact_replay_timestamp,
pg_is_in_recovery, pg_{start,stop}_backup(), pg_switch_xlog(),
pg_create_restore_point(), pg_current_xlog_{insert_,}location,
pg_last_xlog_{receive,replay}_location(), pg_xlogfile_name_offset(),
pg_xlogfile_name().

Another chunk that seems like it would be pretty simple to separate
out is the checkpoint-related stuff: LogCheckpointStart(),
LogCheckpointEnd(), CreateCheckPoint(), CheckPointGuts(),
RecoveryRestartPoint(), CreateRestartPoint(), KeepLogSeg().  Not a lot
of code, maybe, but it seems clearly distinguishable from what the
rest of the file is about.

I'm sure there are other good ways to do it, too...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-08 Thread Josh Berkus

George,

I'm quoting you here because in the version of your email which got
posted to the list your whole explanation got put below the patch text,
making it hard to find the justification for the patch.  Follows:

> I run a number of postgresql installations on NFS and on the whole I find 
> this to be very reliable.  I have however run into a few issues when there is 
> concurrent writes from multiple processes.
> 
> I see errors such as the following:
> 
> 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection 
> authorized: user=postgres database=postgres
> 2011-07-31 22:13:35 ESTERROR:  could not write block 1 of relation 
> global/2671: wrote only 4096 of 8192 bytes
> 2011-07-31 22:13:35 ESTHINT:  Check free disk space.
> 2011-07-31 22:13:35 ESTCONTEXT:  writing block 1 of relation global/2671
> 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
> host=[local]
> 
> I have also seen similar errors coming out of the WAL writer, however they 
> occur at the level PANIC, which is a little more drastic.
> 
> After spending some time with debug logging turned on and even more time 
> staring at strace, I believe this occurs when one process was writing to a 
> data file and it received a SIGINT from another process, eg:
> (These logs are from another similar run)
> 
> [pid  1804] <... fsync resumed> )   = 0
> [pid 10198] kill(1804, SIGINT 
> [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
> [pid  1804] write(3, 
> "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 
> 
> [pid 10198] <... kill resumed> )= 0
> [pid  1804] <... write resumed> )   = 4096
> [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
> [pid  1804] rt_sigreturn(0x2)   = 4096
> [pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 
> 260
> [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  
> [pid  1802] <... select resumed> )  = 1 (in [5], left {0, 999000})
> [pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
> [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
> [pid  1802] read(5,  
> [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
> Process 1804 detached
> 
> After finding this, I came up with the following test case which easily 
> replicated our issue:
> 
> #!/bin/bash
> 
> name=$1
> number=1
> while true; do 
>   /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT 
> NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';"
>   /usr/bin/createdb -E UNICODE -O $name$number $name$number
>   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
> exit
>   fi
>   let number=$number+1
> done
> 
> When I run a single copy of this script, I have no issues, however when I 
> start up a few more copies to simultaneously hit the DB, it crashes quiet 
> quickly - usually within 20 or 30 seconds.
> 
> After looking through the code I found that when postgres calls write() it 
> doesn't retry.  In order to address the issue with the PANIC in the WAL 
> writer I set the sync method to o_sync which solved the issue in that part of 
> the code, however I was still seeing failures in other areas of the code 
> (such as the FileWrite function).  Following this, I spoke to an NFS guru who 
> pointed out that writes under linux are not guaranteed to complete unless you 
> open up O_SYNC or similar on the file handle.  I had a look in the libc docs 
> and found this:
> 
> http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
> 
> 
> The write function writes up to size bytes from buffer to the file with 
> descriptor filedes. The data in buffer is not necessarily a character string 
> and a null character is output like any other character.
> 
> The return value is the number of bytes actually written. This may be size, 
> but can always be smaller. Your program should always call write in a loop, 
> iterating until all the data is written.
> 
> 
> After finding this, I checked a number of other pieces of software that we 
> see no issues with on NFS (such as the JVM) for their usage of write().  I 
> confirmed they write in a while loop and set about patching the postgres 
> source.
> 
> I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
> see on our systems.  I have also checked that make check still passes. 
> 
> As my C is terrible, I would welcome any comments on the implementation of 
> this patch.
> 
> Best regards,
> 
> George


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-08 Thread George Barnett

Hi Hackers,


postgresql-writeall.patch
Description: Binary data

I run a number of postgresql installations on NFS and on the whole I find this 
to be very reliable.  I have however run into a few issues when there is 
concurrent writes from multiple processes.

I see errors such as the following:

2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: 
user=postgres database=postgres
2011-07-31 22:13:35 ESTERROR:  could not write block 1 of relation 
global/2671: wrote only 4096 of 8192 bytes
2011-07-31 22:13:35 ESTHINT:  Check free disk space.
2011-07-31 22:13:35 ESTCONTEXT:  writing block 1 of relation global/2671
2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
host=[local]

I have also seen similar errors coming out of the WAL writer, however they 
occur at the level PANIC, which is a little more drastic.

After spending some time with debug logging turned on and even more time 
staring at strace, I believe this occurs when one process was writing to a data 
file and it received a SIGINT from another process, eg:
(These logs are from another similar run)

[pid  1804] <... fsync resumed> )   = 0
[pid 10198] kill(1804, SIGINT 
[pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
[pid  1804] write(3, 
"c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 

[pid 10198] <... kill resumed> )= 0
[pid  1804] <... write resumed> )   = 4096
[pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
[pid  1804] rt_sigreturn(0x2)   = 4096
[pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
[pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  
[pid  1802] <... select resumed> )  = 1 (in [5], left {0, 999000})
[pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid  1804] tgkill(1804, 1804, SIGABRT) = 0
[pid  1802] read(5,  
[pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
Process 1804 detached

After finding this, I came up with the following test case which easily 
replicated our issue:

#!/bin/bash

name=$1
number=1
while true; do 
  /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT 
NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';"
  /usr/bin/createdb -E UNICODE -O $name$number $name$number
  if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
exit
  fi
  let number=$number+1
done

When I run a single copy of this script, I have no issues, however when I start 
up a few more copies to simultaneously hit the DB, it crashes quiet quickly - 
usually within 20 or 30 seconds.

After looking through the code I found that when postgres calls write() it 
doesn't retry.  In order to address the issue with the PANIC in the WAL writer 
I set the sync method to o_sync which solved the issue in that part of the 
code, however I was still seeing failures in other areas of the code (such as 
the FileWrite function).  Following this, I spoke to an NFS guru who pointed 
out that writes under linux are not guaranteed to complete unless you open up 
O_SYNC or similar on the file handle.  I had a look in the libc docs and found 
this:

http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html


The write function writes up to size bytes from buffer to the file with 
descriptor filedes. The data in buffer is not necessarily a character string 
and a null character is output like any other character.

The return value is the number of bytes actually written. This may be size, but 
can always be smaller. Your program should always call write in a loop, 
iterating until all the data is written.


After finding this, I checked a number of other pieces of software that we see 
no issues with on NFS (such as the JVM) for their usage of write().  I 
confirmed they write in a while loop and set about patching the postgres source.

I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
see on our systems.  I have also checked that make check still passes. 

As my C is terrible, I would welcome any comments on the implementation of this 
patch.

Best regards,

George





 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Chris Redekop

Thanks for all the feedback guys.  Just to throw another monkey wrench in
here - I've been playing with Simon's proposed solution of returning 0 when
the WAL positions match, and I've come to the realizatiion that even if
using pg_last_xact_insert_timestamp, although it would help, we still
wouldn't be able to get a 100% accurate "how far behind?" counternot
that this is a big deal, but I know my ops team is going to bitch to me
about it :).take this situation: there's a lull of 30 seconds where
there are no transactions committed on the serverthe slave is totally
caught up, WAL positions match, I'm reporting 0, everything is happy.  Then
a transaction is committed on the masterbefore the slave gets it my
query hits it and sees that we're 30 seconds behind (when in reality we're
<1sec behind).Because of this affect my graph is a little spikey...I
mean it's not a huge deal or anything - I can put some sanity checking in my
number reporting ("if 1 second ago you were 0 seconds behind, you can't be
more than 1 second behind now" sorta thing).  But if we wanted to go for
super-ideal solution there would be a way to get the timestamp of
pg_stat_replication.replay_location+1 (the first transaction that the slave
does not have).

On Thu, Sep 8, 2011 at 7:03 AM, Robert Haas  wrote:

> On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao  wrote:
> > OTOH, new function enables users to monitor the delay as a timestamp.
> > For users, a timestamp is obviously easier to handle than LSN, and the
> delay
> > as a timestamp is more intuitive. So, I think that it's worth adding
> > something like pg_last_xact_insert_timestamp into core for improvement
> > of user-friendness.
>
> It seems very nice from a usability point of view, but I have to agree
> with Simon's concern about performance.  Actually, as of today,
> WALInsertLock is such a gigantic bottleneck that I suspect the
> overhead of this additional bookkeeping would be completely
> unnoticeable.  But I'm still reluctant to add more centralized
> spinlocks that everyone has to fight over, having recently put a lot
> of effort into getting rid of some of the ones we've traditionally
> had.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: [HACKERS] Large C files

2011-09-08 Thread Josh Berkus

Simon, Robert, Bruce, Tom,

 >>> Say what?  When else would you have us do it?
>> >
>>> >> When else would you have us develop?
>> >
>> > In my eyes that sort of activity *is* development.  I find the
>> > distinction you are drawing entirely artificial, and more calculated to
>> > make sure refactoring never happens than to add any safety.  Any
>> > significant development change carries a risk of breakage.
> You clearly have the bit between your teeth on this.
> 
> That doesn't make it worthwhile or sensible though.

The above really seems like a non-argument over trivial wording of
stuff.  Can we please give each other the benefit of the doubt and not
read objectionable content into offhand comments?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Ants Aasma

On Thu, Sep 8, 2011 at 6:46 PM, Robert Haas  wrote:
> I'm not convinced it's anywhere near that easy.  For one thing, on at
> least one big server I'm playing with, memory latency on shared memory
> is vastly higher (like >10x!) than on backend-local memory due to NUMA
> effects.

I wonder if both the shared mem and non-local memory issue can be
circumvented by using a slru like mechanism as a side channel to
publish taken snapshots and make concurrent xids available with a
sinval/hasmessages like per proc flag in shared memory to notify of
migrated snapshots.

I'll have to think through the space, locking and performance
considerations. That might take a small while though, I just managed
to contract the flu and can't really think straight.

Sorry to waste your time if this whole approach is completely untenable.
It seemed like a interesting topic to sink my teeth in, but in hindsight
seems a bit too much for a first try.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against multiple instances per cluster

2011-09-08 Thread Magnus Hagander

On Thu, Sep 8, 2011 at 20:40, Thom Brown  wrote:
> Hi all,
>
> I've come across a PostgreSQL set up where there are 2 servers, each
> with the same version of PostgreSQL on, both mounting the same SAN
> onto their respective file systems.  It was intended that only 1 of
> the servers would be running an instance of PostgreSQL at a time as
> they both point to the same pgdata.  This was dubbed a "high

> Would there be a way to prevent this abhorrent scenario from coming
> into existence?  One idea is to have a configuration option to be
> strict about the presence of a pid file in the data directory, and
> force manual intervention, but I'm not sure this would solve the
> problem in most cases where this problem exists as someone would have
> had to specifically sought out the option and set it.  It might also
> encourage some to just delete the pid file thinking that would make
> the nasty errors go away.

There are plenty of clustering products out there that are really
designed for one thing pimarily, and that's dealing with this kind of
fencing. The proper solution is to use one of those. There's no way we
can do what's required from inside postgresql, and I see no reason why
we should re-invent generic clustering software. (for example, if you
do this, we can't prevent the two kernels from corrupting the
filesystem on the shared storage, which we rely on working..)

Such software is often marketed as an "easy way to set up high
availability". It's easy to set up. It requires some actual expertise
to set up *right*. But once you've done that, it works well, and it
prevents this kind of situation to happen.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] memory-related bugs

Daniel Farina  writes:
> On Tue, Sep 6, 2011 at 12:00 PM, Tom Lane  wrote:
>> I'm still of the opinion that there's no real need to avoid memcpy with
>> identical source and destination, so I didn't apply this first patch.

> I am leery of memcpy with overlapping regions.  I know that it can
> cause an infinite loop on ssse3 architectures, having to do with some
> backwards-iteration it does:
> https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/609290

The linked page offers no support for your claim of an infinite loop,
and in any case it's discussing a case involving *overlapping* regions,
not *identical* regions.

The reason why this concern is irrelevant for identical regions is that
no matter what order the memcpy routine copies the bytes in, it's
necessarily storing the exact same data into each byte that was there
before.  The only way that strange results could be produced is if the
routine transiently set some byte to a value other than its final value,
which would mean that it must be intending to store to that location
more than once, which would be silly and inefficient.

So I'm not going to believe that there's a problem here without a lot
more rigorous evidence than anyone has offered.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

2011-09-08 Thread Peter Geoghegan

On 8 September 2011 15:43, Robert Haas  wrote:
> I wouldn't be too enthusiastic about
> starting a project like this in January, but now seems fine.  A bigger
> problem is that I don't hear anyone volunteering to do the work.

You seem to have a fairly strong opinion on the xlog.c code. It would
be useful to hear any preliminary thoughts that you might have on what
we'd end up with when this refactoring work is finished. If I'm not
mistaken, you think that it is a good candidate for being refactored
not so much because of its size, but for other reasons -  could you
please elaborate on those? In particular, I'd like to know what
boundaries it is envisaged that the code should be refactored along to
increase its conceptual integrity, or to better separate concerns. I
assume that that's the idea, since each new .c file would have to have
a discrete purpose.

On 7 September 2011 19:12, Tom Lane  wrote:
> The pgrminclude-induced bug you just fixed shows a concrete way in which
> moving code from one file to another might silently break it, ie, it
> still compiles despite lack of definition of some symbol it's intended
> to see.

Of course, the big unknown here is the C preprocessor. There is
nothing in principle that stops a header file from slightly or utterly
altering the semantics of any file it is included in. That said, it
wouldn't surprise me if the nm-diff shell script I proposed (or a
slight variant intended to catch pgrminclude type bugs) caught many of
those problems in practice.

Attached is simple POC nm-diff.sh, intended to detect
pgrminclude-induced bugs (split c file variant may follow if there is
interest). I tested this on object files compiled before and after the
bug fix to catalog.h in this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f81fb4f690355bc88fee69624103956fb4576fe5

or rather, I tested things after that commit with and without the
supposedly unneeded header; I removed catalog version differences,
applying Ockham's razor, so there was exactly one difference. That
didn't change anything; I just saw the same thing as before, which
was:

[peter@localhost commands]$ nm-diff.sh tablespace.o tablespace.old.o
Symbol table differences:
50c50
< 05e7 r __func__.15690
---
> 05ef r __func__.15690
WARNING: Symbol tables don't match!

I decided to add a bunch of obviously superfluous headers (a bunch of
xlog stuff) to the same file to verify that they *didn't* change
anything, figuring that it was very unlikely that adding these headers
could *really* change something. However, they did, but I don't think
that that proves that the script is fundamentally flawed (they didn't
*really* change anything):

[peter@localhost commands]$ nm-diff.sh tablespace.o tablespace.old.o
Symbol table differences:
41,50c41,50
< 06f0 r __func__.15719 <-- symbol name differs, offset does not
< 06d0 r __func__.15730
< 06be r __func__.15750
< 06a0 r __func__.15759
< 0680 r __func__.15771
< 0660 r __func__.15793
< 0640 r __func__.15803
< 0620 r __func__.15825
< 0600 r __func__.15886
< 05e7 r __func__.15903  <-- value/local offset the same as before
---
> 06f0 r __func__.15506
> 06d0 r __func__.15517
> 06be r __func__.15537
> 06a0 r __func__.15546
> 0680 r __func__.15558
> 0660 r __func__.15580
> 0640 r __func__.15590
> 0620 r __func__.15612
> 0600 r __func__.15673
> 05ef r __func__.15690  <-- Also, value/local offset the same as 
> before (same inconsistency too)
WARNING: Symbol tables don't match!

My analysis is that the fact that the arbitrary symbol names assigned
within this read-only data section differ likely has no significance
(it looks like the compiler assigns them at an early optimisation
stage like Postgres assigns sequence values, before some are
eliminated at a later stage - I read ELF docs, but couldn't confirm
this, but maybe should have read GCC docs), so the next revision of
this script should simply ignore them using a regex if they look like
this; only the internal offsets/values matter, and they have been
observed to differ based on whether or not the header is included per
Bruce's revision (importantly, the difference in offset/values of the
last __func__ remains just the same regardless of whether the
superfluous xlog headers are included).

Does that seem reasonable? Is there interest in developing this idea further?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

nm-diff.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] memory-related bugs

2011-09-08 Thread Daniel Farina

On Tue, Sep 6, 2011 at 12:00 PM, Tom Lane  wrote:
> [ Sorry for letting this slip through the cracks ... I think I got
>  distracted by collation bugs :-( ]
>
> Noah Misch  writes:
>> On Sat, Mar 12, 2011 at 12:44:29PM -0500, Tom Lane wrote:
>>> Noah Misch  writes:
 A suitably-instrumented run of "make installcheck-world" under valgrind 
 turned
 up a handful of memory-related bugs:
>
>>> Hmm, interesting work, but I don't think I believe in the necessity for
>>> this kluge:
>>>
>> +     else if (attributeName != &(att->attname))
>> +             namestrcpy(&(att->attname), attributeName);
>
> I'm still of the opinion that there's no real need to avoid memcpy with
> identical source and destination, so I didn't apply this first patch.

I am leery of memcpy with overlapping regions.  I know that it can
cause an infinite loop on ssse3 architectures, having to do with some
backwards-iteration it does:

https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/609290

I have spotted this in the wild in PostgreSQL (which is how I happened
to produce this bug report link so readily), yet very rarely; I mailed
a more detailed report to the security list.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Add missing format argument to ecpg_log() call

2011-09-08 Thread Peter Eisentraut

On tor, 2011-09-08 at 15:32 -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > Add missing format argument to ecpg_log() call
> 
> Oh, fun.  So why isn't there an __attribute__((format...)) decoration
> on ecpg_log()?

I have a larger patch for that.  I just wanted to get the one piece of
obvious breakage backpatched.

I'm also proposing to add -Wmissing-format-attribute to the standard
options set.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Alexander Korotkov

Thanks for congratulations!
Tnanks to Heikki for mentoring and his work on patch!

--
With best regards,
Alexander Korotkov.

[HACKERS] pg_dump.c

2011-09-08 Thread Andrew Dunstan



In the "refactoring Large C files" discussion one of the biggest files 
Bruce mentioned is pg_dump.c. There has been discussion in the past of 
turning lots of the knowledge currently embedded in this file into a 
library, which would make it available to other clients (e.g. psql). I'm 
not sure what a reasonable API for that would look like, though. Does 
anyone have any ideas?


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

Simon Riggs  writes:
> You clearly have the bit between your teeth on this.

Personally, I'm neither intending to break up xlog.c right now, nor
asking you to do it.  I'm just objecting to your claim that there
should be some project-policy restriction on when refactoring gets done.
I do have other refactoring-ish plans in mind, eg
http://archives.postgresql.org/message-id/9232.1315441...@sss.pgh.pa.us
and am not going to be happy if you claim I can't do that now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against multiple instances per cluster

Thom Brown  writes:
> I've come across a PostgreSQL set up where there are 2 servers, each
> with the same version of PostgreSQL on, both mounting the same SAN
> onto their respective file systems.  It was intended that only 1 of
> the servers would be running an instance of PostgreSQL at a time as
> they both point to the same pgdata.  This was dubbed a "high
> availability" set up, where if one server went down, they could start
> PostgreSQL on the other.  (yes, I know what you're thinking)

Multiply by about ten and you might have an idea what I'm thinking.

> Now
> normally there is protection against 2 instances running only if the
> instances on the same server as it would reference shared memory.  But
> in this case, neither server has access to the other's shared memory,
> so it has to rely on the pid file.  But the pid file isn't enough by
> itself.

The pid file is not merely "not enough", it's entirely useless, since
the two machines aren't sharing a process ID space.  If somebody starts
a postmaster on machine 2, it will almost certainly see the pid in the
pidfile as not running (on machine 2).  So you have no safety interlock
whatsoever in this configuration.

It is possible to build configurations of this type safely, but you need
some external dead-man-switch or STONITH arrangement that forcibly kills
machine 1 (or at least disconnects it from the SAN) before machine 2 is
allowed to start.  Postgres can't do it on its own, and as you've
undoubtedly already found out, human DBAs can't be trusted to get it
right either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Protecting against multiple instances per cluster

2011-09-08 Thread Thom Brown

Hi all,

I've come across a PostgreSQL set up where there are 2 servers, each
with the same version of PostgreSQL on, both mounting the same SAN
onto their respective file systems.  It was intended that only 1 of
the servers would be running an instance of PostgreSQL at a time as
they both point to the same pgdata.  This was dubbed a "high
availability" set up, where if one server went down, they could start
PostgreSQL on the other.  (yes, I know what you're thinking)  Now
normally there is protection against 2 instances running only if the
instances on the same server as it would reference shared memory.  But
in this case, neither server has access to the other's shared memory,
so it has to rely on the pid file.  But the pid file isn't enough by
itself.  In this set up, if someone were to inadvertently start up a
Postgres instance on the 2nd server whilst the 1st was still running,
it would do very bad things.

For example, when I set up the same scenario on my own network, it
indeed let me start up the 2nd instance.  I then tried setting up a
table and generating lots of data for it, then... KABOOM:

postgres=# create table things (id serial, things int);
NOTICE:  CREATE TABLE will create implicit sequence "things_id_seq"
for serial column "things.id"
CREATE TABLE
postgres=# insert into things (things) select x from
generate_series(1,50) a(x);
LOG:  could not link file "pg_xlog/xlogtemp.28426" to
"pg_xlog/00010002" (initialization of log file 0,
segment 2): Operation not supported
STATEMENT:  insert into things (things) select x from
generate_series(1,50) a(x);
PANIC:  could not open file "pg_xlog/00010002" (log
file 0, segment 2): No such file or directory
STATEMENT:  insert into things (things) select x from
generate_series(1,50) a(x);
PANIC:  could not open file "pg_xlog/00010002" (log
file 0, segment 2): No such file or directory
PANIC:  could not open file "pg_xlog/00010002" (log
file 0, segment 2): No such file or directory
The connection to the server was lost. Attempting reset: LOG:  server
process (PID 28426) was terminated by signal 6: Abort trap
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
!> LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2011-09-08 19:04:47 BST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  consistent recovery state reached at 0/1755268
LOG:  redo starts at 0/1755268
LOG:  could not open file "pg_xlog/00010002" (log file
0, segment 2): No such file or directory
LOG:  redo done at 0/1A8
LOG:  last completed transaction was at log time 2011-09-08 19:05:14.098496+01
LOG:  could not link file "pg_xlog/xlogtemp.28429" to
"pg_xlog/00010002" (initialization of log file 0,
segment 2): Operation not supported
PANIC:  could not open file "pg_xlog/00010002" (log
file 0, segment 2): No such file or directory
LOG:  startup process (PID 28429) was terminated by signal 6: Abort trap
LOG:  aborting startup due to startup process failure

Now obviously no-one should ever set up their system in such a
fashion, but some have, as I've witnessed it.  I suspect this is
potentially the cause of their continued database corruption and
outages, where expected WAL files aren't in the pg_xlog directory, so
recovery can't finish, and clog files going missing etc.

While I appreciate that this isn't a bug, and that no-one should
actually be setting things up in this way, it does introduce the
ability to hose one's own cluster without realising (until it's
possibly too late).

Would there be a way to prevent this abhorrent scenario from coming
into existence?  One idea is to have a configuration option to be
strict about the presence of a pid file in the data directory, and
force manual intervention, but I'm not sure this would solve the
problem in most cases where this problem exists as someone would have
had to specifically sought out the option and set it.  It might also
encourage some to just delete the pid file thinking that would make
the nasty errors go away.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Oleg Bartunov


My congratulations too, Alexander ! Hope to work on SP-GiST together !

Oleg

On Thu, 8 Sep 2011, Heikki Linnakangas wrote:


On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.


I've committed the patch now, including that fix. Thanks for a great GSoC 
project!





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

2011-09-08 Thread Bruce Momjian

Simon Riggs wrote:
> On Thu, Sep 8, 2011 at 3:25 PM, Tom Lane  wrote:
> > Simon Riggs  writes:
> >> On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane  wrote:
> >>> Simon Riggs  writes:
>  Please lets not waste effort on refactoring efforts in mid dev cycle.
> >
> >>> Say what? ?When else would you have us do it?
> >
> >> When else would you have us develop?
> >
> > In my eyes that sort of activity *is* development. ?I find the
> > distinction you are drawing entirely artificial, and more calculated to
> > make sure refactoring never happens than to add any safety. ?Any
> > significant development change carries a risk of breakage.
> 
> You clearly have the bit between your teeth on this.

So I guess Tom, Robert Haas, and I all have bits then.

> That doesn't make it worthwhile or sensible though.

We are not saying do it now.  We are disputing your suggestion that now
is not a good time --- there is a difference.

> I've offered to do it slowly and carefully over time, but that seems
> not enough for some reason. What is the real reason for this?

Oh, if we told you, then, well, you would know, and the big secret would
be out.  (Hint:  when three people are telling you the same thing, odds
are there is not some secret motive.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

2011-09-08 Thread Simon Riggs

On Thu, Sep 8, 2011 at 3:25 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane  wrote:
>>> Simon Riggs  writes:
 Please lets not waste effort on refactoring efforts in mid dev cycle.
>
>>> Say what?  When else would you have us do it?
>
>> When else would you have us develop?
>
> In my eyes that sort of activity *is* development.  I find the
> distinction you are drawing entirely artificial, and more calculated to
> make sure refactoring never happens than to add any safety.  Any
> significant development change carries a risk of breakage.

You clearly have the bit between your teeth on this.

That doesn't make it worthwhile or sensible though.

I've offered to do it slowly and carefully over time, but that seems
not enough for some reason. What is the real reason for this?

I assume whoever does it will be spending significant time on testing
and bug fixing afterwards. I foresee lots of "while I'm there, I
thought I'd just mend X", so we'll spend lots of time fighting to keep
functionality that's already there. Look at the discussion around
archive_command for an example of that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fast GiST index build - further improvements

2011-09-08 Thread Heikki Linnakangas

Now that the main GiST index build patch has been committed, there's a 
few further improvements that could make it much faster still:


Better management of the buffer pages on disk. At the moment, the 
temporary file is used as a heap of pages belonging to all the buffers 
in random order. I think we could speed up the algorithm considerably by 
reading/writing the buffer pages sequentially. For example, when an 
internal page is split, and all the tuples in its buffer are relocated, 
that would be a great chance to write the new pages of the new buffers 
in sequential order, instead of writing them back to the pages freed up 
by the original buffer, which can be spread all around the temp file. I 
wonder if we could use a separate file for each buffer? Or at least, a 
separate file for all buffers that are larger than, say 100 MB in size.


Better management of in-memory buffer pages. When we start emptying a 
buffer, we currently flush all the buffer pages in memory to the 
temporary file, to make room for new buffer pages. But that's a waste of 
time, if some of the pages we had in memory belonged to the buffer we're 
about to empty next, or that we empty tuples to. Also, if while emptying 
a buffer, all the tuples go to just one or two lower level buffers, it 
would be beneficial to keep more than one page in-memory for those buffers.


Buffering leaf pages. This I posted on a separate thread already: 
http://archives.postgresql.org/message-id/4e5350db.3060...@enterprisedb.com



Also, at the moment there is one issue with the algorithm that we have 
glossed over this far: For each buffer, we keep some information in 
memory, in the hash table, and in the auxiliary lists. That means that 
the amount of memory needed for the build scales with the size of the 
index. If you're dealing with very large indexes, hopefully you also 
have a lot of RAM in your system, so I don't think this is a problem in 
practice. Still, it would be nice to do something about that. A 
straightforward idea would be to swap some of the information to disk. 
Another idea that, simpler to implement, would be to completely destroy 
a buffer, freeing all the memory it uses, when it becomes completely 
empty. Then, if you're about to run out of memory (as defined by 
maintenance_work_mem), you can empty some low level buffers to disk to 
free up some.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] force_not_null option support for file_fdw

2011-09-08 Thread Kohei Kaigai

Hi Hanada-san.

> ISTM that your results are reasonable for each collation setting.
> Former ordering is same as C locale, and in latter case alphabetical order 
> has priority over case
> distinctions.  Do you mean that ordering used in file_fdw is affected from 
> something unexpected, without
> collation or locale setting?
> 
> BTW, I found a thread which is related to this issue.
>   http://archives.postgresql.org/pgsql-hackers/2011-09/msg00130.php
> 
> I changed the test data so that it uses only upper case alphabets, because 
> case distinction is not
> important for that test.  I also removed digits to avoid test failure in some 
> locales which sort
> alphabets before digits.
>
OK, Thanks for this clarification. This change of regression test case seems to 
me reasonable
to avoid unnecessary false-positive.

I found one other point to be fixed:
On get_force_not_null(), it makes a list of attribute names with force_not_null 
option.

+   foreach (cell, options)
+   {
+   DefElem*def = (DefElem *) lfirst(cell);
+   const char *value = defGetString(def);
+
+   if (strcmp(def->defname, "force_not_null") == 0 &&
+   strcmp(value, "true") == 0)
+   {
+   columns = lappend(columns, makeString(NameStr(attr->attname)));
+   elog(DEBUG1, "%s: force_not_null", NameStr(attr->attname));
+   }
+
+   }

makeString() does not copy the supplied string itself, so it is not preferable 
to reference
NameStr(attr->attname) across ReleaseSysCache().
I'd like to suggest to supply a copied attname using pstrdup for makeString

Thanks,
--
NEC Europe Ltd, SAP Global Competence Center
KaiGai Kohei 


> -Original Message-
> From: Shigeru Hanada [mailto:shigeru.han...@gmail.com]
> Sent: 8. September 2011 06:19
> To: Kohei Kaigai
> Cc: Kohei KaiGai; PostgreSQL-development
> Subject: Re: [HACKERS] force_not_null option support for file_fdw
> 
> (2011/09/05 22:05), Kohei Kaigai wrote:
> >> In my usual environment that test passed, but finally I've reproduced
> >> the failure with setting $LC_COLLATE to "es_ES.UTF-8".  Do you have set 
> >> any $LC_COLLATE in your
> test environment?
> >>
> > It is not set in my environment.
> >
> > I checked the behavior of ORDER BY when we set a collation on the regular 
> > relation, not a foreign
> table.
> > Do we hit something other unexpected bug in collation here?
> >
> > postgres=# CREATE TABLE t1 (word1 text); CREATE TABLE postgres=#
> > INSERT INTO t1 VALUES ('ABC'),('abc'),('123'),('NULL'); INSERT 0 4
> > postgres=# ALTER TABLE t1 ALTER word1 TYPE text COLLATE "ja_JP.utf8";
> > ALTER TABLE postgres=# SELECT * FROM t1 ORDER BY word1;
> >   word1
> > ---
> >   123
> >   ABC
> >   NULL
> >   abc
> > (4 rows)
> >
> > postgres=# ALTER TABLE t1 ALTER word1 TYPE text COLLATE "en_US.utf8";
> > ALTER TABLE postgres=# SELECT * FROM t1 ORDER BY word1;
> >   word1
> > ---
> >   123
> >   abc
> >   ABC
> >   NULL
> > (4 rows)
> 
> Thanks for the checking.  FYI, I mainly use Fedora 15 box with Japanese 
> environment for my development.
> 
> ISTM that your results are reasonable for each collation setting.
> Former ordering is same as C locale, and in latter case alphabetical order 
> has priority over case
> distinctions.  Do you mean that ordering used in file_fdw is affected from 
> something unexpected, without
> collation or locale setting?
> 
> BTW, I found a thread which is related to this issue.
>   http://archives.postgresql.org/pgsql-hackers/2011-09/msg00130.php
> 
> I changed the test data so that it uses only upper case alphabets, because 
> case distinction is not
> important for that test.  I also removed digits to avoid test failure in some 
> locales which sort
> alphabets before digits.
> 
> Regards,
> --
> Shigeru Hanada
> 
> 
>  Click
> https://www.mailcontrol.com/sr/fB6Wmr8zmMzTndxI!oX7Uo9cpkuWnNqkEgc!P6cHvBhGb4EkL1te5Ky38yYzoE4Mra
> 3ljAyIpUlPbv5+FCDqIw==  to report this email as spam.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] concurrent snapshots

Ants Aasma  writes:
> On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas  wrote:
>> 1. A backend can have lots of snapshots, potentially requiring an
>> unbounded amount of shared memory.  We can't accommodate that.

> If PostgreSQL gets POSIX shared memory capability at some point in the
> future, would it be enough to accommodate snapshots in shared memory?

Don't hold your breath.  Even if we moved over to using mmap() in place
of shmat(), it's a *very* long way from there to being able to
dynamically expand shared memory and have all backends able to access
the additional space.  Doing that in a way that's cheap enough and
transparent enough that we'd actually accept it is even more challenging
--- for instance, it seems hard to impossible to guarantee that all
backends would be able to map the added space at the same addresses.

Most of the interest in mmap is not about any dynamic-expansion ideas,
it's just about getting out from under the ridiculously small limits on
SysV shmem space that are still default on many platforms.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] concurrent snapshots

On Thu, Sep 8, 2011 at 11:33 AM, Ants Aasma  wrote:
> On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas  wrote:
>> On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma  wrote:
>>> When go try to find the new csnmin
>>> and discover that a backend has a csnmin that is too old, we go through
>>> the snapshots of that backend and convert every snapshot under the
>>> desired csnmin to a traditional snapshot.
>>
>> I thought about something along these lines (though I didn't flesh out
>> the details as much as you have here), but rejected it because the
>> step described above would require all snapshots to be stored in
>> shared memory.  That's problematic for several reasons:
>>
>> 1. A backend can have lots of snapshots, potentially requiring an
>> unbounded amount of shared memory.  We can't accommodate that.
>
> If PostgreSQL gets POSIX shared memory capability at some point in the
> future, would it be enough to accommodate snapshots in shared memory?

By itself, that particular change would not help with this problem.

>> 2. You'd need to protect all of those snapshots with spinlocks or
>> something, which would be wicked expensive, because the owning process
>> would need to take and release that spinlock every time it touched the
>> snapshot.
>
> It seems to me that converting a transactions type can be done
> lock-free. The process that does the converting needs to ensure that
> the new transaction type flag is visible before releasing any xids.
> For visibility checking, the additional cost is a read barrier, two
> volatile reads (recheck snapshot type and dense horizon) and occasional
> retry after getting a visibility result.
>
> Maybe I'm missing something. I'll take a deeper look at the snapshot
> handling code tonight to see if anything else might have any
> implications.

I'm not convinced it's anywhere near that easy.  For one thing, on at
least one big server I'm playing with, memory latency on shared memory
is vastly higher (like >10x!) than on backend-local memory due to NUMA
effects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Ants Aasma

On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas  wrote:
> On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma  wrote:
>> When go try to find the new csnmin
>> and discover that a backend has a csnmin that is too old, we go through
>> the snapshots of that backend and convert every snapshot under the
>> desired csnmin to a traditional snapshot.
>
> I thought about something along these lines (though I didn't flesh out
> the details as much as you have here), but rejected it because the
> step described above would require all snapshots to be stored in
> shared memory.  That's problematic for several reasons:
>
> 1. A backend can have lots of snapshots, potentially requiring an
> unbounded amount of shared memory.  We can't accommodate that.

If PostgreSQL gets POSIX shared memory capability at some point in the
future, would it be enough to accommodate snapshots in shared memory?

> 2. You'd need to protect all of those snapshots with spinlocks or
> something, which would be wicked expensive, because the owning process
> would need to take and release that spinlock every time it touched the
> snapshot.

It seems to me that converting a transactions type can be done
lock-free. The process that does the converting needs to ensure that
the new transaction type flag is visible before releasing any xids.
For visibility checking, the additional cost is a read barrier, two
volatile reads (recheck snapshot type and dense horizon) and occasional
retry after getting a visibility result.

Maybe I'm missing something. I'll take a deeper look at the snapshot
handling code tonight to see if anything else might have any
implications.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg  writes:
> On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
>> daveg  writes:
>>> The first version we saw it on was 8.4.7.

>> Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

> Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
> since shortly after it was released. Prior to that we have had all the major
> and most of the minor releases since 7.1.

Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and
I can't find anything that looks like it could possibly be related.
So at this point I'm inclined to think that the bug is older than
that, but your usage patterns changed so that you started to tickle it.

Can you think of any changes in your usage that might date to around
that time, and would somehow be connected to backend startup/shutdown?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

On Thu, Sep 8, 2011 at 10:59 AM, Heikki Linnakangas
 wrote:
> On 06.09.2011 01:18, Alexander Korotkov wrote:
>>
>> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
>> doesn't exceed maximal offset number.
>
> I've committed the patch now, including that fix. Thanks for a great GSoC
> project!

Wow, major congrats, Alexander!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Heikki Linnakangas


On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.


I've committed the patch now, including that fix. Thanks for a great 
GSoC project!


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

On Thu, Sep 8, 2011 at 10:29 AM, Bruce Momjian  wrote:
> Tom Lane wrote:
>> Simon Riggs  writes:
>> > On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane  wrote:
>> >> Simon Riggs  writes:
>> >>> Please lets not waste effort on refactoring efforts in mid dev cycle.
>>
>> >> Say what?  When else would you have us do it?
>>
>> > When else would you have us develop?
>>
>> In my eyes that sort of activity *is* development.  I find the
>> distinction you are drawing entirely artificial, and more calculated to
>> make sure refactoring never happens than to add any safety.  Any
>> significant development change carries a risk of breakage.
>
> I ran pgrminclude a week ago and that is certainly a larger change than
> this.  Early in the development cycle people are merging in their saved
> patches, so now is a fine time to do refactoring.

+1.

I'd feel more comfortable refactoring it if we had some automated
testing of those code paths, but I don't see anything wrong with doing
it now from a timing perspective.  We still have 4 months until the
start of the final CommitFest.  I wouldn't be too enthusiastic about
starting a project like this in January, but now seems fine.  A bigger
problem is that I don't hear anyone volunteering to do the work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

2011-09-08 Thread Bruce Momjian

Tom Lane wrote:
> Simon Riggs  writes:
> > On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane  wrote:
> >> Simon Riggs  writes:
> >>> Please lets not waste effort on refactoring efforts in mid dev cycle.
> 
> >> Say what? ?When else would you have us do it?
> 
> > When else would you have us develop?
> 
> In my eyes that sort of activity *is* development.  I find the
> distinction you are drawing entirely artificial, and more calculated to
> make sure refactoring never happens than to add any safety.  Any
> significant development change carries a risk of breakage.

I ran pgrminclude a week ago and that is certainly a larger change than
this.  Early in the development cycle people are merging in their saved
patches, so now is a fine time to do refactoring.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] concurrent snapshots

On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma  wrote:
> When go try to find the new csnmin
> and discover that a backend has a csnmin that is too old, we go through
> the snapshots of that backend and convert every snapshot under the
> desired csnmin to a traditional snapshot.

I thought about something along these lines (though I didn't flesh out
the details as much as you have here), but rejected it because the
step described above would require all snapshots to be stored in
shared memory.  That's problematic for several reasons:

1. A backend can have lots of snapshots, potentially requiring an
unbounded amount of shared memory.  We can't accommodate that.

2. You'd need to protect all of those snapshots with spinlocks or
something, which would be wicked expensive, because the owning process
would need to take and release that spinlock every time it touched the
snapshot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Large C files

Simon Riggs  writes:
> On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane  wrote:
>> Simon Riggs  writes:
>>> Please lets not waste effort on refactoring efforts in mid dev cycle.

>> Say what?  When else would you have us do it?

> When else would you have us develop?

In my eyes that sort of activity *is* development.  I find the
distinction you are drawing entirely artificial, and more calculated to
make sure refactoring never happens than to add any safety.  Any
significant development change carries a risk of breakage.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

On Tue, Sep 6, 2011 at 11:06 PM, Amit Kapila  wrote:
> 1. With the above, you want to reduce/remove the concurrency issue between
> the GetSnapshotData() [used at begining of sql command execution] and
> ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
> is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
> and by ProcArrayEndTransaction() in X mode.
> There may be other instances for similar thing, but this the main thing
> which you want to resolve.

Yep.

> 2. You want to resolve it by using ring buffer such that readers don't need
> to take any lock.

Yep.  Actually, they're still going to need some spinlocks at least in
the first go round, to protect the pointers.  I'm hoping those can
eventually be eliminated on machines with 8-byte atomic reads using
appropriate memory barrier primitives.

> 1. 2 Writers; Won't 2 different sessions who try to commit at same time will
> get the same write pointer.
>    I assume it will be protected as even indicated in one of your replies
> as I understood?

Yes, commits have to be serialized.  No way around that.  The best
we'll ever be able to do is shorten the critical section.

> 2. 1 Reader, 1 Writter; It might be case that some body has written a new
> snapshot and advanced the stop pointer and at that point of time one reader
> came and read between start pointer and stop pointer. Now the reader will
> see as follows:
>   snapshot, few XIDs, snapshot
>
>    So will it handle this situation such that it will only read latest
> snapshot?

In my prototype implementation that can't happen because the start and
stop pointers are protected by a single spinlock and are moved
simultaneously.  But I think we can get rid on machines with 8-byte
atomic writes of that and just move the stop pointer first and then
the start pointer.  If you get more than one snapshot in the middle
you just ignore the first part of the data you read and start with the
beginning of the last snapshot.

> 3. How will you detect overwrite.

If the write pointer is greater than the start pointer by more than
the ring size, you've wrapped.

> 4. Won't it effect if we don't update xmin everytime and just noting the
> committed XIDs. The reason I am asking is that it is used in tuple
> visibility check
>    so with new idea in some cases instead of just returning from begining
> by checking xmin it has to go through the committed XID list.
>    I understand that there may be less cases or the improvement by your
> idea can supesede this minimal effect. However some cases can be defeated.

The snapshot xmin has to be up to date.  I'm not planning to break
that because it would be wrong.

RecentGlobalXmin doesn't need to be completely up to date, and in fact
recomputing it on every snapshot becomes prohibitively expensive with
this approach.  I'm still struggling with the best way to handle that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Kevin Grittner

Aidan Van Dyk  wrote:
 
> If you're copying a file into the archive, and making it appear
> non-atomically in your archive, your doing something wrong.
> 
> Period.
> 
> No excuses.
 
+1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] concurrent snapshots

2011-09-08 Thread Ants Aasma

Hi,

I have been thinking about how to handle long running transactions with
Robert’s commit sequence number (CSN) idea.
http://archives.postgresql.org/message-id/CA%2BTgmoaAjiq%3Dd%3DkYt3qNj%2BUvi%2BMB-aRovCwr75Ca9egx-Ks9Ag%40mail.gmail.com

I just started to go through transaction management code, so I would
appreciate if anyone could point out any gaping holes in my ideas.

There are two reasons why XID-to-CSN values need to be kept around. One
is that long running transactions need a place to publish their CSN for
CSN based snapshots. The other is that CSN based snapshots need to know
which transactions between xmin and xmax are visible and which are
invisible.

My idea is to use hybrid storage for the XID-to-CSN storage. For recent
transactions shared memory contains a dense XID indexed ring buffer of
CSN values. Head of the buffer moves with nextXid. The tail is also
moved on xid assignment, but done in larger chunks to reduce lock
contention. When moving the tail up, any XID that might still be
invisible to a snapshot is moved to a sparse buffer. To keep the size of
the sparse buffer within limits, too old snapshots are converted into
list-of-running-xids snapshots. This allows us to move up the visible to
all csn-based snapshots horizon (csnmin) and release xids below that
horizon from the sparse buffer. CSNs themselves can be uint32s, this
means that no transaction can be older than 2 billion writing
transactions, which is already enforced by xids being 32 bit values.

Taking a snapshot using this idea would consist of a Xmin, Xmax and CSN.
Snapshot Xmin guarantees that no xid below Xmin can have a CSN bigger
than the snapshots. One way to get Xmin would be to maintain a global
value when doing the tail update of the dense array. Snapshot Xmax
guarantees that no xid above Xmax can have a CSN less than the
snapshots. Xmax can be easily obtained from nextXid. CSN is simply the
current value of the CSN counter. After the snapshot is taken, if
necessary the csnmin of the proc needs to be updated and snapshot added
to the list of acquired snapshots.

All of the guarantees of taking a snapshot can be maintained without
global locking if we have atomic reads/writes of CSN and XID values. We
need to obtain them in the order Xmin -> CSN -> Xmax, inserting read
barriers between the loads. Write side ordering is guaranteed by the CSN
lock. Updating the proc entry can be done under the per proc lock.

To check visibility of a xid between snapshots xmin and xmax, we first
check if its above the dense xid array tail. If it is in dense, just
read the CSN value and compare it to the snapshot value. If xid falls
under the ringbuffer tail, go through the sparse buffer. If the xid is
found in the sparse buffer, just read the CSN value and compare. If it
is not found, then it is guaranteed to be completed before the snapshot,
just read the clog status for the transaction. This is all done without
locking. When looking something up from the dense array, we have to
issue a read barrier and recheck the tail location to ensure that it
hasn’t moved out from under us. If it has, retry.

To commit a transaction we take a lock on the CSN, stamp all our XIDs
with the value, update clog, a write barrier to ensure that lock free
readers see the changes, increment the csn and relase the lock. For each
of our XIDs we need to check if it is above or below the dense
ringbuffer tail and update the respective value.

Moving of the dense ringbuffer tail should be done when assigning XIDs.
This allows us to enforce a hard limit on the size of the ringbuffer. To
reduce contention on the CSN lock, this should be done in large
increments. 1/8 of the ringbuffer at a time seems like a good compromise
between frequency and buffer space efficiency.

To move the tail, we first find the global minimum value of snapshot
CSNs (csnmin). If every proc maintains its own csnmin under per proc
lock, we can just get the current csn and go through all the procs, take
the per proc lock and fetch the csnmin to get the global value. Then we
go through XIDs between old tail and new tail and find all that are
still running or not visible to csnmin. We insert the XIDs into the
sparse buffer, issue a write barrier so they are visible everyone and
then update the tail location. Because we are trawling through the
procarray here anyway, its also a good time to update global xmin.

Because we want to avoid unbounded growth of the sparse buffer, we need
to get rid of old CSN based snapshots. Tying this into XID assignment is
a good way to get some provable limits, if subtransactions can be
overflowed somehow. For instance, we can keep track of the CSN during
last few tail updates, and convert any too old snapshots - this puts the
cap on the sparse buffer at O(num_backends*non-overflowed-sub-txs +
n_steps*tail-update-step) entries. When go try to find the new csnmin
and discover that a backend has a csnmin that is too old, we go through
the snapshots of that backen

Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Aidan Van Dyk

On Thu, Sep 8, 2011 at 2:05 AM, Fujii Masao  wrote:

> That's an option. But I don't think that finding an existing file is so 
> serious
> problem. The most common cases which cause a partially-filled archived
> file are;
>
> 1. The server crashes while WAL file is being archived, and then the server
>    restarts. In this case, the restarted server would find partially-filled
>    archived file.
>
> 2. In replication environment, the master crashes while WAL file is being
>    archived, and then a failover happens. In this case, new master would
>    find partially-filled archived file.
>
> In these cases, I don't think it's so unsafe to overwrite an existing file.

Personally, I think both of these show examples of why PG should be
looking hard at either providing a simple robust local directory based
archive_command, or very seriously pointing users at properly written
tools like omniptr, or ptrtools, walmgr, etc...

Neither of those cases should ever happen.  If you're copying a file
into the archive, and making it appear non-atomically in your archive,
your doing something wrong.

Period.

No excuses.

a.
-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_last_xact_insert_timestamp

On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao  wrote:
> OTOH, new function enables users to monitor the delay as a timestamp.
> For users, a timestamp is obviously easier to handle than LSN, and the delay
> as a timestamp is more intuitive. So, I think that it's worth adding
> something like pg_last_xact_insert_timestamp into core for improvement
> of user-friendness.

It seems very nice from a usability point of view, but I have to agree
with Simon's concern about performance.  Actually, as of today,
WALInsertLock is such a gigantic bottleneck that I suspect the
overhead of this additional bookkeeping would be completely
unnoticeable.  But I'm still reluctant to add more centralized
spinlocks that everyone has to fight over, having recently put a lot
of effort into getting rid of some of the ones we've traditionally
had.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Don't truncate integer part in to_char for 'FM99.'

2011-09-08 Thread Marti Raudsepp

On Wed, Sep 7, 2011 at 23:48, Tom Lane  wrote:
> Also, the way yours is set up, I'm dubious that it
> does the right thing when the last '0' specifier is to the left of the
> decimal point.

When the last '0' is left of the decimal point, Num->zero_end is set
to 0, so the branch dealing with that is never executed.

> I'm currently testing this patch:

Looks good to me. It's potentially less efficient, but I'm not worried
about that.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] EXPLAIN and nfiltered, take two

2011-09-08 Thread Marko Tiikkaja


Hi,

Here's a patch for $SUBJECT based on the feedback here:
http://archives.postgresql.org/message-id/9053.1295888...@sss.pgh.pa.us

I intentionally decided to omit the information for Join Filter, since 
the information can already be deduced from EXPLAIN ANALYZE output, and 
for Left Joins and Anti Joins "Rows Removed by Join Filter" didn't 
really make much sense.


The "Rows Removed by .." information is always shown by default (when 
there is a Filter or Recheck Cond, of course), and I didn't feel like it 
was worth it to add a new option for EXPLAIN to turn that information off.


As for documentation..  I'm really at a loss here.  I tried numerous 
different things for doc/src/sgml/perform.sgml, but I didn't feel like 
any of them added anything.  The EXPLAIN ANALYZE output seems quite 
self-explanatory after all.


Attached are also the tests I used, and an example output.


--
Marko Tiikkajahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
BEGIN
CREATE TABLE
CREATE TABLE
CREATE TABLE
INSERT 0 20
INSERT 0 10
INSERT 0 10
CREATE INDEX
CREATE SERVER
CREATE FOREIGN TABLE
CREATE FUNCTION
SET
SET
SET
SET
SET
SET
SET
SET
SET
   QUERY PLAN   
 
-
 Index Scan using foo_a_index on big_table  (cost=0.00..2588.27 rows=333 
width=8) (actual time=1.488..47.456 rows=65000 loops=1)
   Index Cond: (a = 2)
   Filter: (b > 5000)
   Rows Removed by Filter: 1667
 Total runtime: 57.127 ms
(5 rows)

SET
SET
SET
  QUERY PLAN
  
--
 Bitmap Heap Scan on big_table  (cost=19.86..961.93 rows=333 width=8) (actual 
time=13.673..67.979 rows=65000 loops=1)
   Recheck Cond: (a = 2)
   Rows Removed by Recheck Cond: 62526
   Filter: (b > 5000)
   Rows Removed by Filter: 1667
   ->  Bitmap Index Scan on foo_a_index  (cost=0.00..19.77 rows=1000 width=0) 
(actual time=12.309..12.309 rows=7 loops=1)
 Index Cond: (a = 2)
 Total runtime: 78.757 ms
(8 rows)

RESET
SET
SET
   QUERY PLAN   


 Seq Scan on foo  (cost=0.00..36.75 rows=713 width=8) (actual time=0.008..0.010 
rows=5 loops=1)
   Filter: (a > 5)
   Rows Removed by Filter: 5
 Total runtime: 0.031 ms
(4 rows)

  QUERY PLAN
   
---
 Values Scan on "*VALUES*"  (cost=0.00..0.15 rows=3 width=4) (actual 
time=0.021..0.080 rows=5 loops=1)
   Filter: (column1 > 5)
   Rows Removed by Filter: 5
 Total runtime: 0.189 ms
(4 rows)

QUERY PLAN  
  
--
 CTE Scan on t  (cost=3279.93..4052.35 rows=11443 width=4) (actual 
time=0.020..0.033 rows=5 loops=1)
   Filter: (a > 5)
   Rows Removed by Filter: 5
   CTE t
 ->  Recursive Union  (cost=0.00..3279.93 rows=34330 width=4) (actual 
time=0.009..0.025 rows=10 loops=1)
   ->  Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 
width=4) (actual time=0.009..0.009 rows=5 loops=1)
   ->  WorkTable Scan on t  (cost=0.00..258.33 rows= width=4) 
(actual time=0.002..0.003 rows=2 loops=2)
 Filter: ((a + 5) <= 10)
 Rows Removed by Filter: 2
 Total runtime: 0.080 ms
(10 rows)

  QUERY PLAN
  
--
 Subquery Scan on ss  (cost=0.00..58.15 rows=713 width=8) (actual 
time=0.008..0.010 rows=5 loops=1)
   Filter: (ss.i > 5)
   Rows Removed by Filter: 5
   ->  Limit  (cost=0.00..31.40 rows=2140 width=8) (actual time=0.004..0.004 
rows=10 loops=1)
 ->  Seq Scan on foo  (cost=0.00..31.40 rows=2140 width=8) (actual 
time=0.003..0.003 rows=10 loops=1)
 Total runtime: 0.056 ms
(6 rows)

QUERY PLAN  
   
---
 Function Scan on generate_series i  (cost=0.00..12.50 ro

Re: [HACKERS] Back branch update releases this week; beta postponed

2011-09-08 Thread Simon Riggs

On Wed, Sep 7, 2011 at 2:43 AM, Bruce Momjian  wrote:
> Tom Lane wrote:
>> Bernd Helmle  writes:
>> > --On 10. April 2011 13:53:52 -0400 Tom Lane  wrote:
>> >> The core team has therefore decided to wrap back-branch
>> >> update releases this Thursday for release Monday 4/18.
>>
>> > Hmm, I would like to see the patch for
>> > 
>> > going in for 8.4.8.
>>
>> Simon, was there a reason you only back-patched that to 9.0?
>
> Was this addressed?

Fix applied, following positive feedback in production testing.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_last_xact_insert_timestamp

On Thu, Sep 8, 2011 at 5:55 PM, Simon Riggs  wrote:
> On Thu, Sep 8, 2011 at 9:36 AM, Fujii Masao  wrote:
>> The above has been posted in pgsql-general. The reason why Chris thinks
>> a counterpart of pg_last_xact_replay_timestamp() is required sounds
>> reasonable to me. So I'd like to propose new function
>> "pg_last_xact_insert_timestamp" as the counterpart, which returns the
>> timestamp of the last inserted commit or abort WAL record. I'll add the
>> attached patch into the next CF.
>
> For reasons stated on the original thread, I don't think we need this.
>
> The OP can calculate what he wants without this.
>
> The code already exists in recovery and has some purpose there. Adding
> similar code to the mainline just so somebody can run this function
> occasionally is not good. Based on this I will be looking to see if we
> can optimise the recovery path some more.

Okay. Let me explain another use case of pg_last_xact_insert_timestamp().
The existing functions might be enough for checking whether the standby
has already caught up with the master. But I think that new function would be
very useful to calculate *how much* the standby is behind from the master.

Of course, we can do that by using the existing functions. But a problem is
that they return LSN and the calculated delay is represented as the size of
WAL. For users, it's not easy to handle LSN (e.g., there is no function to do
calculation of LSN), and the delay in WAL size is not intuitive. I've sometimes
received such a complaint.

OTOH, new function enables users to monitor the delay as a timestamp.
For users, a timestamp is obviously easier to handle than LSN, and the delay
as a timestamp is more intuitive. So, I think that it's worth adding
something like pg_last_xact_insert_timestamp into core for improvement
of user-friendness.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Simon Riggs

On Thu, Sep 8, 2011 at 9:36 AM, Fujii Masao  wrote:
> On Thu, Sep 8, 2011 at 7:06 AM, Chris Redekop  wrote:
>> Is there anything available to get the last time a transaction
>> occurred?like say "pg_last_xact_timestamp"?  In order to accurately
>> calculate how far behind my slave is I need to do something like
>> master::pg_last_xact_timestamp() -
>> slave::pg_last_xact_replay_timestamp()currently I'm using now() instead
>> of the pg_last_xact_timestamp() call, but then when the master is not busy
>> the slave appears to lag behind.  I'm considering writing a C module to get
>> the last modified file time of the xlog, but I'm hoping there is a better
>> alternative that I haven't found yet
>
> The above has been posted in pgsql-general. The reason why Chris thinks
> a counterpart of pg_last_xact_replay_timestamp() is required sounds
> reasonable to me. So I'd like to propose new function
> "pg_last_xact_insert_timestamp" as the counterpart, which returns the
> timestamp of the last inserted commit or abort WAL record. I'll add the
> attached patch into the next CF.

For reasons stated on the original thread, I don't think we need this.

The OP can calculate what he wants without this.

The code already exists in recovery and has some purpose there. Adding
similar code to the mainline just so somebody can run this function
occasionally is not good. Based on this I will be looking to see if we
can optimise the recovery path some more.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_last_xact_insert_timestamp

On Thu, Sep 8, 2011 at 7:06 AM, Chris Redekop  wrote:
> Is there anything available to get the last time a transaction
> occurred?like say "pg_last_xact_timestamp"?  In order to accurately
> calculate how far behind my slave is I need to do something like
> master::pg_last_xact_timestamp() -
> slave::pg_last_xact_replay_timestamp()currently I'm using now() instead
> of the pg_last_xact_timestamp() call, but then when the master is not busy
> the slave appears to lag behind.  I'm considering writing a C module to get
> the last modified file time of the xlog, but I'm hoping there is a better
> alternative that I haven't found yet

The above has been posted in pgsql-general. The reason why Chris thinks
a counterpart of pg_last_xact_replay_timestamp() is required sounds
reasonable to me. So I'd like to propose new function
"pg_last_xact_insert_timestamp" as the counterpart, which returns the
timestamp of the last inserted commit or abort WAL record. I'll add the
attached patch into the next CF.

Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 13996,14001  SELECT set_config('log_statement_stats', 'off', false);
--- 13996,14004 
  pg_current_xlog_location
 
 
+ pg_last_xact_insert_timestamp
+
+
  pg_start_backup
 
 
***
*** 14049,14054  SELECT set_config('log_statement_stats', 'off', false);
--- 14052,14064 


 
+ pg_last_xact_insert_timestamp()
+ 
+timestamp with time zone
+Get current transaction log insert time stamp
+   
+   
+
  pg_start_backup(label text , fast boolean )
  
 text
***
*** 14175,14180  postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 14185,14199 
 
  
 
+ pg_last_xact_insert_timestamp displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ This is initialized with the time stamp of last transaction replayed during recovery (i.e.,
+ the return value of pg_last_xact_replay_timestamp). If no transactions
+ have been replayed during recovery, pg_last_xact_insert_timestamp
+ returns NULL until at least one commit or abort WAL record has been inserted.
+
+ 
+
  For details about proper usage of these functions, see
  .
 
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 867,872  primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 867,881 
   ps command (see  for details).
  
  
+  You can also calculate the lag in time stamp by comparing the last
+  WAL insert time stamp on the primary with the last WAL replay
+  time stamp on the standby. They can be retrieved using
+  pg_last_xact_insert_timestamp on the primary and
+  the pg_last_xact_replay_timestamp on the standby,
+  respectively (see  and
+   for details).
+ 
+ 
   You can retrieve a list of WAL sender processes via the
   
   pg_stat_replication view. Large differences between
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***
*** 1041,1046  RecordTransactionCommit(void)
--- 1041,1049 
  			rdata[lastrdata].next = NULL;
  
  			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
+ 
+ 			/* Save timestamp of latest transaction commit record */
+ 			SetLastInsertXTime(xlrec.xact_time);
  		}
  		else
  		{
***
*** 1064,1069  RecordTransactionCommit(void)
--- 1067,1075 
  			rdata[lastrdata].next = NULL;
  
  			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
+ 
+ 			/* Save timestamp of latest transaction commit record */
+ 			SetLastInsertXTime(xlrec.xact_time);
  		}
  	}
  
***
*** 1433,1438  RecordTransactionAbort(bool isSubXact)
--- 1439,1447 
  
  	(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
  
+ 	/* Save timestamp of latest transaction abort record */
+ 	SetLastInsertXTime(xlrec.xact_time);
+ 
  	/*
  	 * Report the latest async abort LSN, so that the WAL writer knows to
  	 * flush this abort. There's nothing to be gained by delaying this, since
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 391,396  typedef struct XLogCtlData
--- 391,397 
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
  	uint32		lastRemovedLog; /* latest removed/recycled XLOG segment */
  	uint32		lastRemovedSeg;
+ 	TimestampTz lastInsertXTime;	/* timestamp of last COMMIT/ABORT record inserted */
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
***
*** 608,613  static bool recoveryStopsHere(XLogRecord *record

Re: [HACKERS] postgresql.conf archive_command example