Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-25 Thread Fujii Masao
On Tue, Oct 25, 2011 at 2:41 AM, Gurjeet Singh singh.gurj...@gmail.com wrote:
 On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera alvhe...@commandprompt.com
 wrote:

 Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011:

      Starting from line 89 of terminal1.txt we see this snippet:
 
  postgres=# rollback;
  ROLLBACK
  postgres=# begin TRANSACTION ;
  BEGIN
  postgres=# set transaction snapshot '02C8-1';
  ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
  postgres=# rollback;
  ROLLBACK
 
 
      As you can see the SET TRANSACTION SNAPSHOT was the first statement
  in
  that transaction, and yet the ERROR message says that it is not.

 Maybe the tab-completion feature issued a query before the set
 transaction command.

 I have tried reproducing the bug starting from 1 and 2 transactions before
 the one shown in snippet, and I used tab-completion to get the same
 screen-output as termonal1.txt and yet it's not reproducible.

I could reproduce it when I typed TAB just after typing set in set
transaction snapshot.
As Tom and Alvaro pointed out, the tab-completion issues a query and which
prevents the set transaction snapshot command.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-25 Thread Wolfgang Wilhelm
Hello,

my experience is that as soon as index only scans are available they are used - 
sometimes just because of the simple logic that a user thinks it is faster. 
Even when the index is so ridiculously long just to have all info in the 
index...

Regards
Wolfgang Wilhelm




Von: Tom Lane t...@sss.pgh.pa.us
An: Robert Haas robertmh...@gmail.com
Cc: Kevin Grittner kevin.gritt...@wicourts.gov; pgsql-hackers@postgresql.org
Gesendet: 21:35 Montag, 24.Oktober 2011 
Betreff: Re: [HACKERS] So, is COUNT(*) fast now? 

Robert Haas robertmh...@gmail.com writes:
 But even though Tom's statement that most indexes are one column might
 be a slight exaggeration, I suspect it probably is true that the
 optimizations he's talking about for large numbers of columns won't
 produce any material benefit even for a 3 or 4 column index.  Which
 makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

            regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Heikki Linnakangas

On 25.10.2011 08:12, Fujii Masao wrote:

On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 24.10.2011 15:29, Fujii Masao wrote:


+listitem
+para
+  Copy the pg_control file from the cluster directory to the global
+  sub-directory of the backup. For example:
+programlisting
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+/programlisting
+/para
+/listitem


Why is this step required? The control file is overwritten by information
from the backup_label anyway, no?


Yes, when recovery starts, the control file is overwritten. But before that,
we retrieve the minimum recovery point from the control file. Then it's used
as the backup end location.

During recovery, pg_stop_backup() cannot write an end-of-backup record.
So, in standby-only backup, other way to retrieve the backup end location
(instead of an end-of-backup record) is required. Ishiduka-san used the
control file as that, according to your suggestion ;)
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php


Oh :-)


+para
+  Again connect to the database as a superuser, and execute
+functionpg_stop_backup/. This terminates the backup mode, but
does not
+  perform a switch to the next WAL segment, create a backup history
file and
+  wait for all required WAL segments to be archived,
+  unlike that during normal processing.
+/para
+/listitem


How do you ensure that all the required WAL segments have been archived,
then?


The patch doesn't provide any capability to ensure that, IOW assumes that's
a user responsibility. If a user wants to ensure that, he/she needs to calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files have
appeared in the archive. Also if the required WAL file has not been archived
yet, a user might need to execute pg_switch_xlog() in the master.


Frankly, I think this whole thing is too fragile. The procedure is 
superficially similar to what you do on master: run pg_start_backup(), 
rsync data directory, run pg_stop_backup(), but is actually subtly 
different and more complicated. If you don't know that, and don't follow 
the full procedure, you get a corrupt backup. And the backup might look 
ok, and might even sometimes work, which means that you won't notice in 
quick testing. That's a *huge* foot-gun.


I think we need to step back and find a way to make this:
a) less complicated, or at least
b) more robust, so that if you don't follow the procedure, you get an error.

With pg_basebackup, we have a fighting chance of getting this right, 
because we have more control over how the backup is made. For example, 
we can co-operate with the buffer manager to avoid torn-pages, 
eliminating the need for full_page_writes=on, and we can include a 
control file with the correct end-of-backup location automatically, 
without requiring user intervention. pg_basebackup is less flexible than 
the pg_start/stop_backup method, and unfortunately you're more likely to 
need the flexibility in a more complicated setup with a hot standby 
server and all, but making the generic pg_start/stop_backup method work 
seems infeasible at the moment.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Simon Riggs
On Mon, Oct 24, 2011 at 7:13 AM, Florian Pflug f...@phlo.org wrote:

 I think Simon's theory that we're starting recovery from the wrong place,
 i.e. should start with an earlier WAL location, is probably correct. The
 question is, why?

Err, that's not what I said and I don't mean that. Having said that,
what I said about pg_control being invalid would imply that, so is
wrong also.

We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
being derived later than it should be, which can cause problems if
this then means that whole pages are unitialised in subtrans. The bug
only shows up if you do enough transactions (2048 is always enough) to
move to the next subtrans page between the redo pointer and the
checkpoint record while at the same time we do not have a long running
transaction that spans those two points. That's just enough to happen
reasonably frequently on busy systems and yet just enough to have
slipped through testing.

We must either

1. During CreateCheckpoint() we should derive oldestActiveXid before
we derive the redo location

2. Change the way subtrans pages are initialized during recovery so we
don't rely on oldestActiveXid

I need to think some more before a decision on this in my own mind,
but I lean towards doing (1) as a longer term fix and doing (2) as a
short term fix for existing releases. I expect to have a fix later
today.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] debug query execution

2011-10-25 Thread vadym nikolaiev
Hi.

I would like to ask you which sources are responsible for execute queries in
PostgreSQL?
( i would like to run a simple query and debug it execution on PostgreSql
server for understanding how PostgeSql does query processing internally)

regards,
Vadym


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Fujii Masao
On Tue, Oct 25, 2011 at 3:44 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 +para
 +      Again connect to the database as a superuser, and execute
 +functionpg_stop_backup/. This terminates the backup mode, but
 does not
 +      perform a switch to the next WAL segment, create a backup history
 file and
 +      wait for all required WAL segments to be archived,
 +      unlike that during normal processing.
 +/para
 +/listitem

 How do you ensure that all the required WAL segments have been archived,
 then?

 The patch doesn't provide any capability to ensure that, IOW assumes
 that's
 a user responsibility. If a user wants to ensure that, he/she needs to
 calculate
 the backup start and end WAL files from the result of pg_start_backup()
 and pg_stop_backup() respectively, and needs to wait until those files
 have
 appeared in the archive. Also if the required WAL file has not been
 archived
 yet, a user might need to execute pg_switch_xlog() in the master.

 Frankly, I think this whole thing is too fragile. The procedure is
 superficially similar to what you do on master: run pg_start_backup(), rsync
 data directory, run pg_stop_backup(), but is actually subtly different and
 more complicated. If you don't know that, and don't follow the full
 procedure, you get a corrupt backup. And the backup might look ok, and might
 even sometimes work, which means that you won't notice in quick testing.
 That's a *huge* foot-gun.

 I think we need to step back and find a way to make this:
 a) less complicated, or at least
 b) more robust, so that if you don't follow the procedure, you get an error.

One idea to make the way more robust is to change the PostgreSQL so that
it writes the buffer page to a temporary space instead of database file
during a backup. This means that there is no torn-pages in the database files
of the backup. After backup, the data blocks are written back to the database
files over time. When recovery starts from that backup(i.e., backup_label is
found), it clears the temporary space in the backup first and continues recovery
by using the database files which contain no torn-pages. OTOH,
in crash recovery (i.e., backup_label is not found), recovery is performed by
using both database files and temporary space. This whole approach would
make the standby-only backup available even if FPW is disabled in the master
and you don't care about the order to backup the control file.

But this idea looks overkill. It seems very complicated to implement that, and
likely to invite other bugs. I don't have any other good and simple
idea for now.

 With pg_basebackup, we have a fighting chance of getting this right, because
 we have more control over how the backup is made. For example, we can
 co-operate with the buffer manager to avoid torn-pages, eliminating the need
 for full_page_writes=on, and we can include a control file with the correct
 end-of-backup location automatically, without requiring user intervention.
 pg_basebackup is less flexible than the pg_start/stop_backup method, and
 unfortunately you're more likely to need the flexibility in a more
 complicated setup with a hot standby server and all, but making the generic
 pg_start/stop_backup method work seems infeasible at the moment.

Yes, so we should give up supporting manual procedure? And extend
pg_basebackup for the standby-only backup, first? I can live with this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Simon Riggs
On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs si...@2ndquadrant.com wrote:

 We are starting recovery at the right place but we are initialising
 the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
 being derived later than it should be, which can cause problems if
 this then means that whole pages are unitialised in subtrans. The bug
 only shows up if you do enough transactions (2048 is always enough) to
 move to the next subtrans page between the redo pointer and the
 checkpoint record while at the same time we do not have a long running
 transaction that spans those two points. That's just enough to happen
 reasonably frequently on busy systems and yet just enough to have
 slipped through testing.

 We must either

 1. During CreateCheckpoint() we should derive oldestActiveXid before
 we derive the redo location

 2. Change the way subtrans pages are initialized during recovery so we
 don't rely on oldestActiveXid

 I need to think some more before a decision on this in my own mind,
 but I lean towards doing (1) as a longer term fix and doing (2) as a
 short term fix for existing releases. I expect to have a fix later
 today.

(1) looks the best way forwards in all cases.

Patch attached. Will be backpatched to 9.0

I think it is possible to avoid taking XidGenLock during
GetRunningTransactions() now, but I haven't included that change in
this patch.

Any other comments before commit?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


oldestActiveXid_fixed.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Magnus Hagander
2011/10/25 Shigeru Hanada shigeru.han...@gmail.com:
 I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
 contrib module.  I think that this module would be the basis of further
 SQL/MED development for core, e.g. join-push-down and ANALYZE support.

I have not looked at the code itself, but I wonder if we shouldn't
consider making this a part of core-proper, not just a contrib module.
The fact that it isn't *already* available in core surprises a lot of
people...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Magnus Hagander
On Tue, Oct 25, 2011 at 10:50, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Oct 25, 2011 at 3:44 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 +para
 +      Again connect to the database as a superuser, and execute
 +functionpg_stop_backup/. This terminates the backup mode, but
 does not
 +      perform a switch to the next WAL segment, create a backup history
 file and
 +      wait for all required WAL segments to be archived,
 +      unlike that during normal processing.
 +/para
 +/listitem

 How do you ensure that all the required WAL segments have been archived,
 then?

 The patch doesn't provide any capability to ensure that, IOW assumes
 that's
 a user responsibility. If a user wants to ensure that, he/she needs to
 calculate
 the backup start and end WAL files from the result of pg_start_backup()
 and pg_stop_backup() respectively, and needs to wait until those files
 have
 appeared in the archive. Also if the required WAL file has not been
 archived
 yet, a user might need to execute pg_switch_xlog() in the master.

 Frankly, I think this whole thing is too fragile. The procedure is
 superficially similar to what you do on master: run pg_start_backup(), rsync
 data directory, run pg_stop_backup(), but is actually subtly different and
 more complicated. If you don't know that, and don't follow the full
 procedure, you get a corrupt backup. And the backup might look ok, and might
 even sometimes work, which means that you won't notice in quick testing.
 That's a *huge* foot-gun.

 I think we need to step back and find a way to make this:
 a) less complicated, or at least
 b) more robust, so that if you don't follow the procedure, you get an error.

 One idea to make the way more robust is to change the PostgreSQL so that
 it writes the buffer page to a temporary space instead of database file
 during a backup. This means that there is no torn-pages in the database files
 of the backup. After backup, the data blocks are written back to the database
 files over time. When recovery starts from that backup(i.e., backup_label is
 found), it clears the temporary space in the backup first and continues 
 recovery
 by using the database files which contain no torn-pages. OTOH,
 in crash recovery (i.e., backup_label is not found), recovery is performed by
 using both database files and temporary space. This whole approach would
 make the standby-only backup available even if FPW is disabled in the master
 and you don't care about the order to backup the control file.

 But this idea looks overkill. It seems very complicated to implement that, and
 likely to invite other bugs. I don't have any other good and simple
 idea for now.

 With pg_basebackup, we have a fighting chance of getting this right, because
 we have more control over how the backup is made. For example, we can
 co-operate with the buffer manager to avoid torn-pages, eliminating the need
 for full_page_writes=on, and we can include a control file with the correct
 end-of-backup location automatically, without requiring user intervention.
 pg_basebackup is less flexible than the pg_start/stop_backup method, and
 unfortunately you're more likely to need the flexibility in a more
 complicated setup with a hot standby server and all, but making the generic
 pg_start/stop_backup method work seems infeasible at the moment.

 Yes, so we should give up supporting manual procedure? And extend
 pg_basebackup for the standby-only backup, first? I can live with this.

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to get right, given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Florian Pflug
On Oct25, 2011, at 11:13 , Simon Riggs wrote:
 On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs si...@2ndquadrant.com wrote:
 We are starting recovery at the right place but we are initialising
 the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
 being derived later than it should be, which can cause problems if
 this then means that whole pages are unitialised in subtrans. The bug
 only shows up if you do enough transactions (2048 is always enough) to
 move to the next subtrans page between the redo pointer and the
 checkpoint record while at the same time we do not have a long running
 transaction that spans those two points. That's just enough to happen
 reasonably frequently on busy systems and yet just enough to have
 slipped through testing.
 
 We must either
 
 1. During CreateCheckpoint() we should derive oldestActiveXid before
 we derive the redo location

 (1) looks the best way forwards in all cases.

Let me see if I understand this

The probem seems to be that we currently derive oldestActiveXid end the end of
the checkpoint, just before writing the checkpoint record. Since we use
oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before
that checkpoint record (but after the REDO location, of course) may very well
contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt
to touch these XID's SUBTRANS state, we error out.

Your patch seems sensible, because the checkpoint logically occurs at the
REDO location not the checkpoint's location, so we ought to log an 
oldestActiveXID
corresponding to that location.

What I don't understand is how this affects the CLOG. How does oldestActiveXID
factor into CLOG initialization?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-25 Thread Magnus Hagander
On Tue, Oct 25, 2011 at 00:00, Alexander Korotkov aekorot...@gmail.com wrote:
 On Tue, Oct 25, 2011 at 1:46 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:

 Alexander Korotkov aekorot...@gmail.com wrote:

  Coundn't be predicate locking implementation in SSI be used for
  in-database cache invalidation.

 It would not necessarily be limited to *in-database* caches.  The
 main thing would be to design a good API to the predicate locking
 portion of SSI, which I think is about 80% of the SSI code.  Dan and
 I both have an interest in such further use, and there have been
 others who have talked about potential uses for the non-blocking
 predicate locking.  I think the API would need to be based around a
 listen/notify model.

IIRC, I discussed this with Dan Ports at pgcon, as it was one of the
things he had been looking into as well. You might want to talk to him
about it.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Fujii Masao
On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander mag...@hagander.net wrote:
 I don't think we should necessarily give up completely. But doing a
 pg_basebackup way *first* seems reasonable - because it's going to be
 the easiest one to get right, given that we have more control there.
 Doesn't mean we shouldn't extend it in the future...

Agreed. The question is -- how far should we change pg_basebackup to
get right? I think it's not difficult to change it so that it backs up
the control file at the end. But eliminating the need for full_page_writes=on
seems not easy. No? So I'm not inclined to do that in at least first commit.
Otherwise, I'm afraid the patch would become huge.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Magnus Hagander
On Tue, Oct 25, 2011 at 13:54, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander mag...@hagander.net wrote:
 I don't think we should necessarily give up completely. But doing a
 pg_basebackup way *first* seems reasonable - because it's going to be
 the easiest one to get right, given that we have more control there.
 Doesn't mean we shouldn't extend it in the future...

 Agreed. The question is -- how far should we change pg_basebackup to
 get right? I think it's not difficult to change it so that it backs up
 the control file at the end. But eliminating the need for full_page_writes=on
 seems not easy. No? So I'm not inclined to do that in at least first commit.
 Otherwise, I'm afraid the patch would become huge.

It's more server side of base backups than the actual pg_basebackup
tool of course, but I'm sure that's what we're all referring to here.

Personally, I'd see the fpw stuff as part of the infrastructure
needed. Meaning that the fpw stuff should go in *first*, and the
pg_basebackup stuff later.

If we want something to go in early, that could be as simple as a
version of pg_basebackup that runs against the slave but only if
full_page_writes=on on the master. If it's not, it throws an error.
Then we can improve upon that by adding handling of fpw=off, first by
infrastructure, then by tool.

Doing it piece by piece like that is probably a good idea, since as
you say, all at once will be pretty huge.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 2011/10/25 Shigeru Hanada shigeru.han...@gmail.com:
 I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
 contrib module.  I think that this module would be the basis of further
 SQL/MED development for core, e.g. join-push-down and ANALYZE support.

 I have not looked at the code itself, but I wonder if we shouldn't
 consider making this a part of core-proper, not just a contrib module.
 The fact that it isn't *already* available in core surprises a lot of
 people...

We've just spent a whole lot of blood and sweat on making the extension
mechanism work nicely.  I don't understand this urge to not use it.

ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
Once we do that its release schedule will get locked to core's ---
wouldn't it be better to keep flexibility for now, while it's in such
active development?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Magnus Hagander
On Tue, Oct 25, 2011 at 14:08, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 2011/10/25 Shigeru Hanada shigeru.han...@gmail.com:
 I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
 contrib module.  I think that this module would be the basis of further
 SQL/MED development for core, e.g. join-push-down and ANALYZE support.

 I have not looked at the code itself, but I wonder if we shouldn't
 consider making this a part of core-proper, not just a contrib module.
 The fact that it isn't *already* available in core surprises a lot of
 people...

 We've just spent a whole lot of blood and sweat on making the extension
 mechanism work nicely.  I don't understand this urge to not use it.

We're back to the old discussion, I guess.. I'm happy to see it as an
extension, but I think it should be included with the standard
installation. Like we do with for example pl/pgsql (which I realize
has a dependency on the backend anyway, so it can't be done another
way easily) and pl/perl (which doesn't, AFAIK, so it's a better
example)


 ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
 Once we do that its release schedule will get locked to core's ---
 wouldn't it be better to keep flexibility for now, while it's in such
 active development?

I would be happy to keep it outside, and integrate it in the final CF
for example :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Kohei KaiGai
 ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
 Once we do that its release schedule will get locked to core's ---
 wouldn't it be better to keep flexibility for now, while it's in such
 active development?

 I would be happy to keep it outside, and integrate it in the final CF
 for example :)

Right now, file_fdw is the only FDW module that we have in the core,
however, it is inadequacy to proof the new concept of FDW feature
to utilize external RDBMS, such as join push-down of foreign tables.

I think the pgsql-fdw module also should be included in the core
distribution as a basis of future enhancement, unless we don't
need any working modules when an enhancement of FDW is
proposed.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Stephen Frost
* Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
 Right now, file_fdw is the only FDW module that we have in the core,

Erm, guess I'm a bit confused why we've got that in core while not
putting pgsql_fdw in core.  This all gets back to previous discussions
around 'recommended' contrib modules (which should really be installed
by default on the filesystem through the distros, ala Debian's
recommends: approach) and 'other' contrib modules.

I'm in favor of making that distinction.  I would still have pgsql_fdw,
file_fdw, etc, be packaged more-or-less the same way and still use the
CREATE EXTENTION framework, of course.

It would be nice if we didn't have to lock the release schedule of those
recommended modules to the core release schedule, or even to each other,
but that's a separate issue, imv.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Simon Riggs
On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug f...@phlo.org wrote:

 What I don't understand is how this affects the CLOG. How does oldestActiveXID
 factor into CLOG initialization?

It is an entirely different error.

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Steve Singer

On 11-10-25 02:44 AM, Heikki Linnakangas wrote:
With pg_basebackup, we have a fighting chance of getting this right, 
because we have more control over how the backup is made. For example, 
we can co-operate with the buffer manager to avoid torn-pages, 
eliminating the need for full_page_writes=on, and we can include a 
control file with the correct end-of-backup location automatically, 
without requiring user intervention. pg_basebackup is less flexible 
than the pg_start/stop_backup method, and unfortunately you're more 
likely to need the flexibility in a more complicated setup with a hot 
standby server and all, but making the generic pg_start/stop_backup 
method work seems infeasible at the moment.


Would pg_basebackup be able to work with the buffer manager on the slave 
to avoid full_page_writes=on needing to be set on the master?  (the 
point of this is to be able to take the base backup without having the 
backup program contact the master). If so could pg_start_backup() not 
just put the buffer manager into the same state?





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Heikki Linnakangas

On 25.10.2011 15:56, Steve Singer wrote:

On 11-10-25 02:44 AM, Heikki Linnakangas wrote:

With pg_basebackup, we have a fighting chance of getting this right,
because we have more control over how the backup is made. For example,
we can co-operate with the buffer manager to avoid torn-pages,
eliminating the need for full_page_writes=on, and we can include a
control file with the correct end-of-backup location automatically,
without requiring user intervention. pg_basebackup is less flexible
than the pg_start/stop_backup method, and unfortunately you're more
likely to need the flexibility in a more complicated setup with a hot
standby server and all, but making the generic pg_start/stop_backup
method work seems infeasible at the moment.


Would pg_basebackup be able to work with the buffer manager on the slave
to avoid full_page_writes=on needing to be set on the master? (the point
of this is to be able to take the base backup without having the backup
program contact the master).


In theory, yes. I'm not sure how difficult it would be in practice. 
Currently, the walsender process just scans and copies everything in the 
data directory, at the filesystem level. It would have to go through the 
buffer manager instead, to avoid reading a page at the same time that 
the buffer manager is writing it out.



If so could pg_start_backup() not just put the buffer manager into the same 
state?


No. . The trick that pg_basebackup (= walsender) can do is to co-operate 
with the buffer manager when reading each page. An external program 
cannot do that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Marko Kreen
On Tue, Oct 25, 2011 at 3:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 2011/10/25 Shigeru Hanada shigeru.han...@gmail.com:
 I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
 contrib module.  I think that this module would be the basis of further
 SQL/MED development for core, e.g. join-push-down and ANALYZE support.

 I have not looked at the code itself, but I wonder if we shouldn't
 consider making this a part of core-proper, not just a contrib module.
 The fact that it isn't *already* available in core surprises a lot of
 people...

 We've just spent a whole lot of blood and sweat on making the extension
 mechanism work nicely.  I don't understand this urge to not use it.

 ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
 Once we do that its release schedule will get locked to core's ---
 wouldn't it be better to keep flexibility for now, while it's in such
 active development?

Simple question - do FDW internals need work?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-25 Thread Gurjeet Singh
On Tue, Oct 25, 2011 at 2:33 AM, Fujii Masao masao.fu...@gmail.com wrote:

 On Tue, Oct 25, 2011 at 2:41 AM, Gurjeet Singh singh.gurj...@gmail.com
 wrote:
  On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera 
 alvhe...@commandprompt.com
  wrote:
 
  Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011
 :
 
   Starting from line 89 of terminal1.txt we see this snippet:
  
   postgres=# rollback;
   ROLLBACK
   postgres=# begin TRANSACTION ;
   BEGIN
   postgres=# set transaction snapshot '02C8-1';
   ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
   postgres=# rollback;
   ROLLBACK
  
  
   As you can see the SET TRANSACTION SNAPSHOT was the first
 statement
   in
   that transaction, and yet the ERROR message says that it is not.
 
  Maybe the tab-completion feature issued a query before the set
  transaction command.
 
  I have tried reproducing the bug starting from 1 and 2 transactions
 before
  the one shown in snippet, and I used tab-completion to get the same
  screen-output as termonal1.txt and yet it's not reproducible.

 I could reproduce it when I typed TAB just after typing set in set
 transaction snapshot.
 As Tom and Alvaro pointed out, the tab-completion issues a query and which
 prevents the set transaction snapshot command.


Great! That settles it then. Reproducible, but not a bug.

Thanks,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


[HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me. 

db=# select to_tsquery('english', 's:*');
NOTICE:  text-search query contains only stop words or doesn't contain
lexemes, ignored
 to_tsquery 

 
(1 row)

db=# select to_tsquery('simple', 's:*');
 to_tsquery 

 's':*
(1 row)


I also think that this is a mistake. It should only be highlighting s.
db=# select ts_headline('sushant', to_tsquery('simple', 's:*'));
  ts_headline   

 bsushant/b


Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Florian Pflug
On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
 I am currently using the prefix search feature in text search. I find
 that the prefix characters are treated the same as a normal lexeme and
 passed through stemming and stopword dictionaries. This seems like a bug
 to me.

Hm, I don't think so. If they don't pass through stopword dictionaries,
then queries containing stopwords will fail to find any rows - which is
probably not what one would expect.

Here's an example:

Query for records containing the* and car*. The @@-operator returns true,
because the stopword is removed from both the tsvector and the tsquery
(the 'english' dictionary drops 'these' as a stopward and stems 'cars' to
'car. Both the tsvector and the query end up being just 'car')

postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('english', 
'the:*  car:*');
 ?column? 
--
 t
(1 row)

Here what happens stopwords aren't removed from the query
(Now, the tsvector ends up being 'car', but the query is 'the:*  car:*')

postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('simple', 
'the:*  car:*');
 ?column? 
--
 f
(1 row)

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] isolationtester's dry run mode

2011-10-25 Thread Alvaro Herrera
This patch adds a -n option to isolationtester.  With this option,
instead of running the tests, it prints the permutations that would be
run.  This is very useful for generating test specs which need to remove
permutations that are not valid (because they would run a command on a
blocked session).

Objections?

Sample run:

$ ./isolationtester -n  ~-/specs/classroom-scheduling.spec
permutation rx1 wy1 c1 ry2 wx2 c2
permutation rx1 wy1 ry2 c1 wx2 c2
permutation rx1 wy1 ry2 wx2 c1 c2
permutation rx1 wy1 ry2 wx2 c2 c1
permutation rx1 ry2 wy1 c1 wx2 c2
permutation rx1 ry2 wy1 wx2 c1 c2
permutation rx1 ry2 wy1 wx2 c2 c1
permutation rx1 ry2 wx2 wy1 c1 c2
permutation rx1 ry2 wx2 wy1 c2 c1
permutation rx1 ry2 wx2 c2 wy1 c1
permutation ry2 rx1 wy1 c1 wx2 c2
permutation ry2 rx1 wy1 wx2 c1 c2
permutation ry2 rx1 wy1 wx2 c2 c1
permutation ry2 rx1 wx2 wy1 c1 c2
permutation ry2 rx1 wx2 wy1 c2 c1
permutation ry2 rx1 wx2 c2 wy1 c1
permutation ry2 wx2 rx1 wy1 c1 c2
permutation ry2 wx2 rx1 wy1 c2 c1
permutation ry2 wx2 rx1 c2 wy1 c1
permutation ry2 wx2 c2 rx1 wy1 c1


-- 
Álvaro Herrera alvhe...@alvh.no-ip.org


isotester-dryrun.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] isolationtester's dry run mode

2011-10-25 Thread Alvaro Herrera

Excerpts from Alvaro Herrera's message of mar oct 25 13:32:31 -0300 2011:
 This patch adds a -n option to isolationtester.  With this option,
 instead of running the tests, it prints the permutations that would be
 run.  This is very useful for generating test specs which need to remove
 permutations that are not valid (because they would run a command on a
 blocked session).

Forgot to credit the author: this patch was developed by Alexander
Shulgin.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types - typo + NULL string constructor

2011-10-25 Thread Jeff Davis
On Mon, 2011-10-24 at 13:15 +0300, Heikki Linnakangas wrote:
 Hmm, I don't think that's safe. After Oid wraparound, a range type oid 
 might get reused for some other range type, and the cache would return 
 stale values. Extremely unlikely to happen by accident, but could be 
 exploited by an attacker.
 

Any ideas on how to remedy that? I don't have another plan for making it
perform well. Plugging it into the cache invalidation mechanism seems
like overkill, but I suppose that would solve the problem.

Aren't there a few other cases like this floating around the code? I
know the single-xid cache is potentially vulnerable to xid wraparound
for the same reason.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
 On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
  I am currently using the prefix search feature in text search. I find
  that the prefix characters are treated the same as a normal lexeme and
  passed through stemming and stopword dictionaries. This seems like a bug
  to me.
 
 Hm, I don't think so. If they don't pass through stopword dictionaries,
 then queries containing stopwords will fail to find any rows - which is
 probably not what one would expect.

I think what you are saying a feature is really a bug. I am fairly sure
that when someone says to_tsquery('english', 's:*') one is looking for
an entry that has a *non-stopword* word that starts with 's'. And
specially so in a text search configuration that eliminates stop words. 

Does it even make sense to stem, abbreviate, synonym for a few letters?
It will be so unpredictable.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for distinguishing PG instances in event log v2

2011-10-25 Thread Magnus Hagander
2011/7/16 MauMau maumau...@gmail.com:
 Hello,

 The attached file is a revised patch which reflects all review comments by
 Magnus in:

 http://archives.postgresql.org/pgsql-hackers/2011-07/msg00839.php

 I made sure the previous tests (both custom and default PostgreSQL event
 source) succeeded.

 I'm submitting this to the currently open CommitFest 2001-9 shortly. Please
 review it again.

I have applied this patch after another round of rather extensive modifications.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2011-10-25 Thread Alexander Korotkov
On Mon, Oct 24, 2011 at 3:05 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 If we allow user to specify own gist_penalty function, then such function
 should deal with:
 1) GiST-specific data structures such as GISTENTRY.
 2) Decomposing ranges using range_deserialize.
 3) Inifinities, which we could handle in general penalty functions.
 Thats why I prefere to implement subtype_diff.

I forgot another agument for having subtype_diff:
4) In my picksplit algorithm it would be more natural to use subtype_diff
for measuring overlap than use penalty function.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations

2011-10-25 Thread Erik Rijkers
On Tue, October 25, 2011 19:49, Kerem Kat wrote:
 On Mon, Oct 24, 2011 at 20:52, Erik Rijkers e...@xs4all.nl wrote:
 On Wed, October 19, 2011 15:01, Kerem Kat wrote:
 Adding CORRESPONDING to Set Operations
 I have corrected the ordering of the targetlists of subqueries. And
 added 12 regression
 tests for column list ordering. Can you confirm that the order has
 changed for you?


Yes, this one is OK.

thanks,

Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Firing order of RI triggers

2011-10-25 Thread Tom Lane
I've looked into the cause of bug #6268,
http://archives.postgresql.org/pgsql-bugs/2011-10/msg00223.php

It's fairly simple: we're firing RI triggers in the wrong order.

What's happening is that we update the tuple and queue RI_FKey_check_upd
and RI_FKey_cascade_upd events for the update action, *in that order*.
When RI_FKey_check_upd runs, it checks things and quite properly
complains that there's no matching PK, since the row that used to have
the matching PK is now obsolete.

Had RI_FKey_cascade_upd fired first, all would have been well,
because when RI_FKey_check_upd fired for this particular update, it
would've seen the new tuple is already obsolete and done nothing.

The reason they fire in the wrong order is that triggers for a single
event are fired in name order, and the names being used are things like
RI_ConstraintTrigger_53569.  Most of the time, the trigger with higher
OID is going to sort last ... and createForeignKeyTriggers creates the
check triggers before the action triggers.

You might wonder why this doesn't mean that all self-referential foreign
key situations are broken all the time.  Well, the answer is that the
problem is usually masked by the optimization that avoids firing a check
trigger at all if the referencing field didn't change --- see
AfterTriggerSaveEvent.  In the test case given in the bug, the first
UPDATE within the transaction doesn't see the problem because of this.
But in the second UPDATE of the same row, that optimization is disabled,
so the check trigger fires and fails.

As far as I can see, the only practical way to fix this is to change the
names given to RI triggers so that cascade actions will fire before
check triggers.  Just changing the order of creation would fix it 99.99%
of the time, but fail on the times when the first trigger had OID 9
and the second OID 100, for example.  And I definitely don't think
we want to mess with the general rule that triggers fire in name order.

I'm thinking we could do RI_ConstraintTrigger_a_ for action
triggers and RI_ConstraintTrigger_c_ for checking triggers,
and then the names would be guaranteed to sort correctly.

I'm not sure if this is something we can back-patch --- I don't see any
dependencies in our own code on what names RI triggers have, but I'm
afraid there is client-side code out there that knows it.  In any case,
changing the name assignments would not fix things for existing
triggers; but if we did back-patch then any affected users could just
drop and re-create the problematic FK constraint.  Or maybe we could
back-patch a change in creation order and rely on that usually working.
Given the lack of prior complaints that might be good enough.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Florian Pflug
On Oct25, 2011, at 18:47 , Sushant Sinha wrote:
 On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
 On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
 I am currently using the prefix search feature in text search. I find
 that the prefix characters are treated the same as a normal lexeme and
 passed through stemming and stopword dictionaries. This seems like a bug
 to me.
 
 Hm, I don't think so. If they don't pass through stopword dictionaries,
 then queries containing stopwords will fail to find any rows - which is
 probably not what one would expect.
 
 I think what you are saying a feature is really a bug. I am fairly sure
 that when someone says to_tsquery('english', 's:*') one is looking for
 an entry that has a *non-stopword* word that starts with 's'. And
 specially so in a text search configuration that eliminates stop words.

But the whole idea of removing stopwords from the query is that users
*don't* need to be aware of the precise list of stopwords. The way I see
it, stopwords are simply an optimization that helps reduce the size of
your fulltext index.

Assume, for example, that the postgres mailing list archive search used
tsearch (which I think it does, but I'm not sure). It'd then probably make
sense to add postgres to the list of stopwords, because it's bound to 
appear in nearly every mail. But wouldn't you want searched which include
'postgres*' to turn up empty? Quite certainly not.

 Does it even make sense to stem, abbreviate, synonym for a few letters?
 It will be so unpredictable.

That depends on the language. In german (my native tongue), one can
concatenate nouns to form new nouns. It's this not entirely unreasonable
that one would want the prefix to be stemmed to it's singular form before
being matched.

Also, suppose you're using a dictionary which corrects common typos. Who
says you wouldn't want that to be applied to prefix queries?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations

2011-10-25 Thread Kerem Kat
On Mon, Oct 24, 2011 at 20:52, Erik Rijkers e...@xs4all.nl wrote:
 On Wed, October 19, 2011 15:01, Kerem Kat wrote:
 Adding CORRESPONDING to Set Operations
 Initial patch, filename: corresponding_clause_v2.patch

 I had a quick look at the behaviour of this patch.

 Btw, the examples in your email were typoed (one select is missing):

 SELECT 1 a, 2 b, 3 c UNION CORRESPONDING 4 b, 5 d, 6 c, 7 f;
 should be:
  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING select 4 b, 5 d, 6 c, 7 f;

 and

 SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) 4 b, 5 d, 6 c, 7 f;
 should be:
  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) select 4 b, 5 d, 6 c, 7 f;


Yes you are correct, mea culpa.




 But there is also a small bug, I think: the order in the CORRESPONDING BY 
 list should be followed,
 according to the standard (foundation, p. 408):

 2) If corresponding column list is specified, then let SL be a select 
 list of those column
 names explicitly appearing in the corresponding column list in the order 
 that these
 column names appear in the corresponding column list. Every column name 
 in the
 corresponding column list shall be a column name of both T1 and T2.

 That would make this wrong, I think:

 SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(c,b) select 5 d, 6 c, 7 f, 4 b ;

  b | c
 ---+---
  2 | 3
  4 | 6
 (2 rows)

 i.e., I think it should show columns in the order c, b (and not b, c); the 
 order of the
 CORRESPONDING BY phrase.

 (but maybe I'm misreading the text of the standard; I find it often difficult 
 to follow)


It wasn't a misread, I checked the draft, in my version same
explanation is at p.410.
I have corrected the ordering of the targetlists of subqueries. And
added 12 regression
tests for column list ordering. Can you confirm that the order has
changed for you?



 Thanks,


 Erik Rijkers



Regards,

Kerem KAT
*** a/doc/src/sgml/queries.sgml
--- b/doc/src/sgml/queries.sgml
***
*** 1225,1230 
--- 1225,1233 
 primaryEXCEPT/primary
/indexterm
indexterm zone=queries-union
+primaryCORRESPONDING/primary
+   /indexterm
+   indexterm zone=queries-union
 primaryset union/primary
/indexterm
indexterm zone=queries-union
***
*** 1241,1249 
 The results of two queries can be combined using the set operations
 union, intersection, and difference.  The syntax is
  synopsis
! replaceablequery1/replaceable UNION optionalALL/optional replaceablequery2/replaceable
! replaceablequery1/replaceable INTERSECT optionalALL/optional replaceablequery2/replaceable
! replaceablequery1/replaceable EXCEPT optionalALL/optional replaceablequery2/replaceable
  /synopsis
 replaceablequery1/replaceable and
 replaceablequery2/replaceable are queries that can use any of
--- 1244,1252 
 The results of two queries can be combined using the set operations
 union, intersection, and difference.  The syntax is
  synopsis
! replaceablequery1/replaceable UNION optionalALL/optional optionalCORRESPONDING optionalBY (replaceableselect_list/replaceable)/optional/optional replaceablequery2/replaceable
! replaceablequery1/replaceable INTERSECT optionalALL/optional optionalCORRESPONDING optionalBY (replaceableselect_list/replaceable)/optional/optional replaceablequery2/replaceable
! replaceablequery1/replaceable EXCEPT optionalALL/optional optionalCORRESPONDING optionalBY (replaceableselect_list/replaceable)/optional/optional replaceablequery2/replaceable
  /synopsis
 replaceablequery1/replaceable and
 replaceablequery2/replaceable are queries that can use any of
***
*** 1283,1288 
--- 1286,1299 
/para
  
para
+ literalCORRESPONDING/ returns all columns that are in both replaceablequery1/ and replaceablequery2/ with the same name.
+   /para
+ 
+   para
+ literalCORRESPONDING BY/ returns all columns in the column list that are also in both replaceablequery1/ and replaceablequery2/ with the same name.
+   /para
+ 
+   para
 In order to calculate the union, intersection, or difference of two
 queries, the two queries must be quoteunion compatible/quote,
 which means that they return the same number of columns and
*** a/doc/src/sgml/sql.sgml
--- b/doc/src/sgml/sql.sgml
***
*** 859,865 
  [ WHERE replaceable class=PARAMETERcondition/replaceable ]
  [ GROUP BY replaceable class=PARAMETERexpression/replaceable [, ...] ]
  [ HAVING replaceable class=PARAMETERcondition/replaceable [, ...] ]
! [ { UNION | INTERSECT | EXCEPT } [ ALL ] replaceable class=PARAMETERselect/replaceable ]
  [ ORDER BY replaceable class=parameterexpression/replaceable [ ASC | DESC | USING replaceable class=parameteroperator/replaceable ] [ NULLS { FIRST | LAST } ] [, ...] ]
  [ LIMIT { replaceable class=PARAMETERcount/replaceable | ALL } ]
  [ OFFSET replaceable class=PARAMETERstart/replaceable ]
--- 859,865 
  [ WHERE replaceable class=PARAMETERcondition/replaceable ]
  [ GROUP BY 

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Chris Redekop
 Chris, can you rearrange the backup so you copy the pg_control file as
 the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference.  I also tried the
patch and I can no longer reproduce the subtrans error, however instead it
now it starts up, but never gets to the point where it'll accept
connections.  It starts up but if I try to do anything I always get FATAL:
 the database system is starting up...even if the load is removed from the
primary, the standby still never finishes starting up.  Attached below is
a log of one of these startup attempts.  In my testing with the patch
applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go
into the db is starting up statethe pg_clog error is still there, but
seems much harder to reproduce nowI've seen it only once since applying
the patch (out of probably 50 or 60 under-load startup attempts).  It does
seem to be moody like that thoit will be very difficult to reproduce
for a while, and then it will happen damn-near every time for a
while...weirdness

On a bit of a side note, I've been thinking of changing my scripts so that
they perform an initial rsync prior to doing the
startbackup-rsync-stopbackup just so that the second rsync will be
fasterso that the backup is in progress for a shorter period of time, as
while it is running it will stop other standbys from starting upthis
shouldn't cause any issues eh?


2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG:  database system was
interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG:  creating missing WAL
directory pg_xlog/archive_status
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG:  entering standby mode
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 2/CF00
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG:  streaming replication
successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG:  checkpoint record is at
2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG:  redo record is at
2/CF20; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG:  next transaction ID:
0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG:  next MultiXactId: 839;
next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG:  oldest unfrozen
transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG:  transaction ID wrap limit
is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG:  resetting unlogged
relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG:  initializing for hot
standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG:  consistent recovery state
reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG:  redo starts at 2/CF20
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG:  consistent state delayed
because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT:  xlog redo  running
xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130
xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344
4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354
4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364
4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517
4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527
4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537
4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388
4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398
4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408
4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418
4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586
4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596
4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606
4634607;
 subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL:  the database system is
starting up
DEBUG:  standby sync_rep_test has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL:  the database system is
starting up
and so on...


On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug f...@phlo.org 

Re: [HACKERS] Firing order of RI triggers

2011-10-25 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mar oct 25 14:57:43 -0300 2011:

 I'm not sure if this is something we can back-patch --- I don't see any
 dependencies in our own code on what names RI triggers have, but I'm
 afraid there is client-side code out there that knows it.

Yeah, sounds possible.

 In any case,
 changing the name assignments would not fix things for existing
 triggers; but if we did back-patch then any affected users could just
 drop and re-create the problematic FK constraint.  Or maybe we could
 back-patch a change in creation order and rely on that usually working.
 Given the lack of prior complaints that might be good enough.

The latter looks reasonable ... particularly if the symptoms of a
botched order would be immediately visible -- the user could just drop
and reload the constraints to fix the order in the very unlikely case
that they are reversed.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Firing order of RI triggers

2011-10-25 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of mar oct 25 14:57:43 -0300 2011:
 ...  Or maybe we could
 back-patch a change in creation order and rely on that usually working.
 Given the lack of prior complaints that might be good enough.

 The latter looks reasonable ... particularly if the symptoms of a
 botched order would be immediately visible -- the user could just drop
 and reload the constraints to fix the order in the very unlikely case
 that they are reversed.

Well, the symptoms would probably be just like in the bug report: you'd
get unexpected failures from double updates of a self-referential row in
a single transaction.  That's a sufficiently weird corner case that most
people probably wouldn't exercise it right away.  But given that this
problem has been there from day one and nobody noticed before, I'm not
too concerned about the intersection of people who have an issue and
people who are unlucky enough to get an end-of-decade trigger OID.
I think 100% solution in HEAD and 99.99% solution in back branches
should be good enough.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:

 Assume, for example, that the postgres mailing list archive search used
 tsearch (which I think it does, but I'm not sure). It'd then probably make
 sense to add postgres to the list of stopwords, because it's bound to 
 appear in nearly every mail. But wouldn't you want searched which include
 'postgres*' to turn up empty? Quite certainly not.

That improves recall for postgres:* query and certainly doesn't help
other queries like post:*. But more importantly it affects precision
for all queries like a:*, an:*, and:*, s:*, 't:*', the:*, etc
(When that is the only search it also affects recall as no row matches
an empty tsquery). Since stopwords are smaller, it means prefix search
for a few characters is meaningless. And I would argue that is when the
prefix search is more important -- only when you know a few characters.


-Sushant.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on documentation builds on OSX w/ macports

2011-10-25 Thread Florian Pflug
On Oct20, 2011, at 05:07 , Dan Ports wrote:
 On Thu, Oct 20, 2011 at 02:02:09AM +0200, Florian Pflug wrote:
 I've patched the ports for openjade, iso8879 and docbook-dsssl,
 and added a new port for docbook-sgml-4.2. These patches are sitting
 in the macports trac now, waiting to be applied.
 
 I'll try to take a look at them in the next couple days (with my
 MacPorts hat on), unless someone beats me to it.

And an effective hat that was! Thanks for cleaning up and applying
all of those patches, Dan!

Should someone want to try this out, beware that clang (at least
the version shipped with XCode 4.2) doesn't like openjade. Also,
universal builds of openjade seem to be broken. On my machine, I
thus need to say

  port install openjade -universal configure.compiler=llvm-gcc-4.2

instead of just

  port install openjade

I've submitted a patch for that, so hopefully that problem too will
go away in the near future.

Anyway, after having installed openjade, a simple

  port install docbook-sgml-4.2

should be enough to make our documentation build work.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-25 Thread Tom Lane
I believe I have reproduced the behavior described by Andrew Hammond in
http://archives.postgresql.org/pgsql-general/2011-10/msg00928.php

This is using the regression database:

1. In session 1, do
set default_statistics_target TO 1;
analyze tenk1;
(We need the large stats target to ensure that tenk1's pg_statistic
entries require toasting.)

2. Attach to session 1 with a debugger and set a breakpoint at
CommitTransaction's call to CallXactCallbacks (or anyplace after
ProcArrayEndTransaction and before AtEOXact_Inval).

3. In session 2, do

select count(*) from tenk1 where fivethous  2500;

(This loads up session 2's syscaches with toasted pg_statistic entries.)

4. In session 1, again do

analyze tenk1;

and wait for it to stop at the breakpoint.

5. In session 3 (or you can use session 2 for this), do
 vacuum verbose pg_statistic;
You should see it removing toast entries that were generated in step 1
and obsoleted in step 4.

6. In session 2, again do

select count(*) from tenk1 where fivethous  2500;

and voila:

ERROR:  missing chunk number 0 for toast value 53668 in pg_toast_2619

What has happened here is that the second ANALYZE has marked itself
committed in pg_clog and no longer running in the ProcArray, so VACUUM
feels entitled to remove toast tuples that the ANALYZE deleted.  However,
the ANALYZE has not yet sent out the sinval messages that would inform
session 2 that its syscache entries are obsolete.  In Andrew's report,
presumably the machine was under enough load to slow down ANALYZE at
just this point, and there was a concurrent autovacuum that would have
done the rest of the deed.  The problem could only be seen for a short
interval, which squares with his report, and with a similar one from
Tim Uckun back in September.

Ordinarily, sending out sinval messages post-commit is okay because we
don't release locks until after that, and we suppose that our locks
prevent any other transactions from getting to the point of using
syscache entries that might have been invalidated by our transaction.
However, *we have carefully hacked on ANALYZE until it doesn't take any
locks that would block concurrent queries on the analyzed table.*  So
the normal protection against stale syscache entries simply doesn't
work for pg_statistic fetches.

I'm not sure about a good way to fix this.  When we last dealt with a
similar failure, Heikki suggested that we forcibly detoast all fields in
a tuple that we're putting into the syscaches:
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00661.php
I don't much like that, though, as it seems expensive, and I'm worried
about possible circularity of needing to know about all toastable fields
while making a syscache entry, and anyway it's papering over a symptom
rather than solving the actual problem that we're relying on a stale
syscache entry.

We could fix it by not using a syscache anymore for pg_statistic
entries, but that's probably not acceptable from a performance
standpoint.

A clean fix would be to add locking that blocks would-be users of
pg_statistic entries when an ANALYZE is about to commit.  This isn't
much fun from a performance standpoint either, but at least it should be
relatively cheap most of the time.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] isolationtester and invalid permutations

2011-10-25 Thread Alvaro Herrera
Instead of simply aborting a spec that specifies running commands on
blocked sessions (what we call an invalid permutation), it seems more
useful to report the problem, cleanup the sessions, and continue with
the next permutation.

This, in conjunction with the dry-run patch I submitted earlier, makes
it easier to determine a working spec: dry-run the spec; copy the
so-generated permutation lines into the spec; run the spec normally,
which reports the invalid permutations; comment out the invalid
permutations from the spec; done.

The attached patch, again from Alexander Shulgin (with some tweaks from
me) does that.

Comments?

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org


isotester-allow-invalid-perms.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Chris Redekop

 That isn't a Hot Standby problem, a recovery problem nor is it certain
 its a PostgreSQL problem.

Do you have any theories on this that I could help investigate?  It happens
even when using pg_basebackup and it persists until another sync is
performed, so the files must be in some state that that it can't recover
fromwithout understanding the internals just viewing from an
outside perspective, I don't really see how this could not be a PostgreSQL
problem


Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Fujii Masao
On Tue, Oct 25, 2011 at 9:03 PM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Oct 25, 2011 at 13:54, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander mag...@hagander.net wrote:
 I don't think we should necessarily give up completely. But doing a
 pg_basebackup way *first* seems reasonable - because it's going to be
 the easiest one to get right, given that we have more control there.
 Doesn't mean we shouldn't extend it in the future...

 Agreed. The question is -- how far should we change pg_basebackup to
 get right? I think it's not difficult to change it so that it backs up
 the control file at the end. But eliminating the need for full_page_writes=on
 seems not easy. No? So I'm not inclined to do that in at least first commit.
 Otherwise, I'm afraid the patch would become huge.

 It's more server side of base backups than the actual pg_basebackup
 tool of course, but I'm sure that's what we're all referring to here.

 Personally, I'd see the fpw stuff as part of the infrastructure
 needed. Meaning that the fpw stuff should go in *first*, and the
 pg_basebackup stuff later.

Agreed. I'll extract FPW stuff from the patch that I submitted, and revise it
as the infrastructure patch.

The changes of pg_start_backup() etc that Ishiduka-san did are also
a server-side infrastructure. I will extract them as another infrastructure one.

Ishiduka-san, if you have time, feel free to try the above, barring objection.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-25 Thread Robert Haas
On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What has happened here is that the second ANALYZE has marked itself
 committed in pg_clog and no longer running in the ProcArray, so VACUUM
 feels entitled to remove toast tuples that the ANALYZE deleted.  However,
 the ANALYZE has not yet sent out the sinval messages that would inform
 session 2 that its syscache entries are obsolete.  In Andrew's report,
 presumably the machine was under enough load to slow down ANALYZE at
 just this point, and there was a concurrent autovacuum that would have
 done the rest of the deed.  The problem could only be seen for a short
 interval, which squares with his report, and with a similar one from
 Tim Uckun back in September.

 Ordinarily, sending out sinval messages post-commit is okay because we
 don't release locks until after that, and we suppose that our locks
 prevent any other transactions from getting to the point of using
 syscache entries that might have been invalidated by our transaction.
 However, *we have carefully hacked on ANALYZE until it doesn't take any
 locks that would block concurrent queries on the analyzed table.*  So
 the normal protection against stale syscache entries simply doesn't
 work for pg_statistic fetches.

This is very similar to one of the issues that reared its ugly head in
regards to Simon's now-reverted patch to lower DDL locking strength.
You identified some other issues there as well, but *one* of the
issues was that, as in this case, the sinval mechanism fails to
provide the necessary synchronization guarantees unless the lock
required to reread the updated data conflicts with the lock required
to change the data.  In that case, the data meant the pg_class
entry or the pg_attribute entry whereas here it means the
pg_statistic entry, but I believe the principal is the same.  And
there as here, (1) there is a fundamental conflict between what the
sinval mechanism requires for correctness and what is actually
desirable in terms of lock levels from a user experience point of view
and (2) it is relatively easy to write code that looks superficially
safe but which actually contains subtle race conditions.  IIRC, you
never thought Simon's patch looked safe, but I'm guessing that this
pg_statistic bug has been around for a long time.

So I'm wondering if we ought to rethink our position that users of the
sinval machinery must provide their own external synchronization
through heavyweight locking, and instead build the synchronization
into the sinval mechanism itself.  One idea I had was to include the
XID of the transaction sending the sinval mechanism in every message,
and to force clients receiving a message to do XactLockTableWait() for
each such XID.  That would force the backend reloading its cache to
wait until the committing transaction reaches the lock-release phase.
If we sent out sinval messages just before removing ourselves from the
ProcArray, I think that would more-or-less fix this bug (although
maybe I'm missing some reason why it's not practical to send them that
early) except that I don't see any way to handle the sinval-reset
case, which seems to more or less kill this idea in its tracks.

But maybe there's some other mechanism whereby we could combine
sending the sinval messages slightly earlier (before
ProcArrayEndTransaction) with blocking anyone who processes those
messages until after the committing backend finishes
ProcArrayEndTransaction.  For example, you could add an additional
LWLock, which has to be held in exclusive mode by a committing
transaction that sends any sinval messages.  It must be acquired
before sending the sinval messages and can't be released until after
ProcArrayEndTransaction() is complete.  Anyone processing a sinval
message must acquire and release the lock in shared mode before
reloading their caches, so that we guarantee that at the time you
reread the catalogs, any transactions involved in sending those
messages are visible.

That's actually a bit coarse-grained; there's probably a better
mechanism, but I'm just throwing this out to see if the basic idea has
any legs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-25 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ordinarily, sending out sinval messages post-commit is okay because we
 don't release locks until after that, and we suppose that our locks
 prevent any other transactions from getting to the point of using
 syscache entries that might have been invalidated by our transaction.
 However, *we have carefully hacked on ANALYZE until it doesn't take any
 locks that would block concurrent queries on the analyzed table.*  So
 the normal protection against stale syscache entries simply doesn't
 work for pg_statistic fetches.

 This is very similar to one of the issues that reared its ugly head in
 regards to Simon's now-reverted patch to lower DDL locking strength.
 You identified some other issues there as well, but *one* of the
 issues was that, as in this case, the sinval mechanism fails to
 provide the necessary synchronization guarantees unless the lock
 required to reread the updated data conflicts with the lock required
 to change the data.

Right.  We may take as little as AccessShareLock on a relation before
examining its pg_statistic entries, and ANALYZE isn't taking anything
that would block that.

 So I'm wondering if we ought to rethink our position that users of the
 sinval machinery must provide their own external synchronization
 through heavyweight locking, and instead build the synchronization
 into the sinval mechanism itself.

Yeah, it's starting to feel like we need a basic redesign of sinval
... although I'd not care to back-patch that, so we also need to think
of a sane solution for the back branches.

 If we sent out sinval messages just before removing ourselves from the
 ProcArray, I think that would more-or-less fix this bug (although
 maybe I'm missing some reason why it's not practical to send them that
 early) except that I don't see any way to handle the sinval-reset
 case, which seems to more or less kill this idea in its tracks.

The other reason that doesn't work is there's a race condition: someone
might load their cache entry immediately after the sinval message went
past, but before the updating transaction commits.

 But maybe there's some other mechanism whereby we could combine
 sending the sinval messages slightly earlier (before
 ProcArrayEndTransaction) with blocking anyone who processes those
 messages until after the committing backend finishes
 ProcArrayEndTransaction.  For example, you could add an additional
 LWLock, which has to be held in exclusive mode by a committing
 transaction that sends any sinval messages.

Doesn't sound very scalable :-(.

Even given your recent changes to reduce the overhead of checking for
sinval messages, I'm not sure that it'd be practical to move the sinval
message processing to just-before-we-look-up-a-cache-entry.  Right now,
we do AcceptInvalidationMessages basically once per table per query
(or maybe it's twice or so, but anyway a very small multiplier on that).
If we try to do it every time through SearchSysCache, we are probably
talking two to three orders of magnitude more checks, which ISTM is
certain to push the sinval queue back up to the top of the heap for
contention.

But in any case, this isn't the core of the problem.  The real point
here is that we need a guarantee that a syscache entry we're going to
use is/was valid as of some suitable time point later than the start of
our current transaction.  (Once we have taken a snapshot, VACUUM will
know that it can't remove any tuples that were deleted after the time of
that snapshot; so even for SnapshotNow fetches, it's important to have
an MVCC snapshot to protect toast-table dereferences.)  Perhaps rather
than tying the problem into SearchSysCache, we should attach the
overhead to GetTransactionSnapshot, which is called appealingly few
times per query.  But right offhand it seems like that only protects us
against the toast-tuple-deletion problem, not against the more general
one of getting a stale view of the status of some relation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers