[PATCHES] default database creation with initdb

2005-06-18 Thread Andreas Pflug
As per discussion on -hackers the attached patch creates the 'default' 
database at initdb time as a default target for initial connections to 
keep template1 free from connections and available as template source.


I consider this DB a system object, so it's created before 
make_template0 sets the last_system_oid (wondering why template0 isn't 
considered a system db too)


Regards,
Andreas
Index: src/bin/initdb/initdb.c
===
RCS file: /projects/cvsroot/pgsql/src/bin/initdb/initdb.c,v
retrieving revision 1.83
diff -u -r1.83 initdb.c
--- src/bin/initdb/initdb.c 30 Apr 2005 08:08:51 -  1.83
+++ src/bin/initdb/initdb.c 18 Jun 2005 08:37:16 -
@@ -177,6 +177,7 @@
 static void set_info_version(void);
 static void setup_schema(void);
 static void vacuum_db(void);
+static void make_default(void);
 static void make_template0(void);
 static void trapsig(int signum);
 static void check_ok(void);
@@ -1828,6 +1829,38 @@
 }
 
 /*
+ * copy template1 to pg_system
+ */
+static void
+make_default(void)
+{
+   PG_CMD_DECL;
+   char  **line;
+   static char *pg_system_setup[] = {
+   CREATE DATABASE \default\;\n,
+   REVOKE CREATE,TEMPORARY ON DATABASE \default\ FROM 
public;\n,
+   NULL
+   };
+
+   fputs(_(copying template1 to default ... ), stdout);
+   fflush(stdout);
+
+   snprintf(cmd, sizeof(cmd),
+\%s\ %s template1 %s,
+backend_exec, backend_options,
+DEVNULL);
+
+   PG_CMD_OPEN;
+
+   for (line = pg_system_setup; *line; line++)
+   PG_CMD_PUTS(*line);
+
+   PG_CMD_CLOSE;
+
+   check_ok();
+}
+
+/*
  * copy template1 to template0
  */
 static void
@@ -2606,6 +2639,8 @@
 
vacuum_db();
 
+   make_default();
+
make_template0();
 
if (authwarning != NULL)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PATCHES] default database creation with initdb

2005-06-18 Thread Magnus Hagander
Umm. Tiny item, but your comment still refers to the database as
pg_system ;-) 

//Magnus 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Andreas Pflug
 Sent: Saturday, June 18, 2005 10:42 AM
 To: PostgreSQL-patches
 Subject: [PATCHES] default database creation with initdb
 
 As per discussion on -hackers the attached patch creates the 
 'default' 
 database at initdb time as a default target for initial 
 connections to keep template1 free from connections and 
 available as template source.
 
 I consider this DB a system object, so it's created before 
 make_template0 sets the last_system_oid (wondering why 
 template0 isn't considered a system db too)
 
 Regards,
 Andreas
 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] default database creation with initdb

2005-06-18 Thread Andreas Pflug

Magnus Hagander wrote:

Umm. Tiny item, but your comment still refers to the database as
pg_system ;-) 


:-)

Regards,
Andreas
Index: src/bin/initdb/initdb.c
===
RCS file: /projects/cvsroot/pgsql/src/bin/initdb/initdb.c,v
retrieving revision 1.83
diff -u -r1.83 initdb.c
--- src/bin/initdb/initdb.c 30 Apr 2005 08:08:51 -  1.83
+++ src/bin/initdb/initdb.c 18 Jun 2005 08:54:07 -
@@ -177,6 +177,7 @@
 static void set_info_version(void);
 static void setup_schema(void);
 static void vacuum_db(void);
+static void make_default(void);
 static void make_template0(void);
 static void trapsig(int signum);
 static void check_ok(void);
@@ -1828,6 +1829,38 @@
 }
 
 /*
+ * copy template1 to default
+ */
+static void
+make_default(void)
+{
+   PG_CMD_DECL;
+   char  **line;
+   static char *default_setup[] = {
+   CREATE DATABASE \default\;\n,
+   REVOKE CREATE,TEMPORARY ON DATABASE \default\ FROM 
public;\n,
+   NULL
+   };
+
+   fputs(_(copying template1 to default ... ), stdout);
+   fflush(stdout);
+
+   snprintf(cmd, sizeof(cmd),
+\%s\ %s template1 %s,
+backend_exec, backend_options,
+DEVNULL);
+
+   PG_CMD_OPEN;
+
+   for (line = default_setup; *line; line++)
+   PG_CMD_PUTS(*line);
+
+   PG_CMD_CLOSE;
+
+   check_ok();
+}
+
+/*
  * copy template1 to template0
  */
 static void
@@ -2606,6 +2639,8 @@
 
vacuum_db();
 
+   make_default();
+
make_template0();
 
if (authwarning != NULL)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] default database creation with initdb

2005-06-18 Thread Tom Lane
Andreas Pflug [EMAIL PROTECTED] writes:
 + CREATE DATABASE \default\;\n,
 + REVOKE CREATE,TEMPORARY ON DATABASE \default\ FROM 
 public;\n,

Uh, why the rights revocation?  That makes the thing essentially useless
... except to superusers, who will not be affected anyway.  I don't
think this is a sane default.

In any case, fixing initdb for this is trivial.  I count something north
of 160 other references to template1 in the current sources, scattered
across 60+ files.  Every one of those has to be looked at and possibly
changed.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] default database creation with initdb

2005-06-18 Thread Bruno Wolff III
On Sat, Jun 18, 2005 at 09:27:49 -0400,
  Robert Treat [EMAIL PROTECTED] wrote:
 On Saturday 18 June 2005 04:55, Andreas Pflug wrote:
  Magnus Hagander wrote:
   Umm. Tiny item, but your comment still refers to the database as
   pg_system ;-)
  
 
 What is the purpose of this database? A generalized, shared resource for tool 
 makers and add-on packages to store information in PostgreSQL, or a working 
 database that is usable (and to be used) out of the box for new users?  I 
 really don't think we want the latter... I can see users connecting via psql 
 and then playing around with different add/create type statements.  It is all 
 too common a question from newbies... does postgresql have a default 
 database to get started with? They'll see this database and begin creating 
 schema and using this as thier main database, and I think we ought to avoid 
 that. If people don't like pg_system, pg_addons seem like a much safer name 
 to go with imho. 

I believe the intention is that things that need to connect to some
database to do their work (e.g. psql -l, createuser) will connect to that
database. createdb will still connect to template1, but now will be less
likely to have a conflict with another user being connected to template1
at the same time. I didn't check the patch to see if the behavior of the
psql -l and createuser were changed or if just the initdb behavior was
changed.

I don't think it will be a big deal if people put stuff in that database.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes (WIP)

2005-06-18 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 I thought I would have a look at:
 (Datatypes) Add function to return compressed length of TOAST data values.

My recollection of that discussion is that we just wanted something
that would return the actual VARSIZE() of the datum.  You're building
something way too confusing ...

A more interesting point is that the way you've declared the function,
it will only work on text values, which is pretty limiting.  Ideally
we'd define this thing as pg_datum_length(any-varlena-type) returns int
but there is no pseudotype corresponding to any varlena type.

I was thinking about this the other day in connection with my proposal
to make something that could return the TOAST value OID of an
out-of-line datum.  I think the only non-restrictive way to do it would
be to declare the function as taking any, and then add a runtime check
using the get_fn_expr_argtype() mechanism to test that what you've been
given is in fact a varlena datatype.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Stefan Kaltenbrunner
Tom Lane wrote:
 I've attached the 2PC patch as applied in case you want to have a look.
 I did some fairly significant hacking on it, and I thought it'd be a
 good idea to enumerate some of the changes:


FYI: this commit seems to cause problems/crashes during make check on my
OpenBSD/Sparc64 buildfarmclient:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbilldt=2005-06-17%2023:50:04

I just checked manually with a fresh checkout - and I got similiar failures.


Stefan

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 FYI: this commit seems to cause problems/crashes during make check on my
 OpenBSD/Sparc64 buildfarmclient:

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbilldt=2005-06-17%2023:50:04

 I just checked manually with a fresh checkout - and I got similiar failures.

Can you get a stack trace from the core dump?

As best I can tell from the buildfarm report, one of the regression
tests is causing a sig11 --- but it's *not* the prepared_xacts test,
so I'm not sure the failure is related to this patch.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Stefan Kaltenbrunner
Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 
FYI: this commit seems to cause problems/crashes during make check on my
OpenBSD/Sparc64 buildfarmclient:
 
 
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbilldt=2005-06-17%2023:50:04
 
 
I just checked manually with a fresh checkout - and I got similiar failures.
 
 
 Can you get a stack trace from the core dump?

(gdb) bt
#0  0x50377ba4 in memcpy () from /usr/lib/libc.so.34.2
#1  0x00326efc in hash_search ()
#2  0x00343430 in pg_tzset ()
#3  0x001fbcf0 in assign_timezone ()
#4  0x00330e88 in set_config_option ()
#5  0x0029b5b0 in PostgresMain ()
#6  0x0026b140 in BackendRun ()
#7  0x0026a9f0 in BackendStartup ()
#8  0x002685bc in ServerLoop ()
#9  0x00267b88 in PostmasterMain ()
#10 0x00220cc8 in main ()

rebuilding with --enable-debug right now ...



 
 As best I can tell from the buildfarm report, one of the regression
 tests is causing a sig11 --- but it's *not* the prepared_xacts test,
 so I'm not sure the failure is related to this patch.

hmm - maybe one of the timezone patches ?


Stefan

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Can you get a stack trace from the core dump?

 (gdb) bt
 #0  0x50377ba4 in memcpy () from /usr/lib/libc.so.34.2
 #1  0x00326efc in hash_search ()
 #2  0x00343430 in pg_tzset ()
 #3  0x001fbcf0 in assign_timezone ()
 #4  0x00330e88 in set_config_option ()
 #5  0x0029b5b0 in PostgresMain ()
 #6  0x0026b140 in BackendRun ()
 #7  0x0026a9f0 in BackendStartup ()
 #8  0x002685bc in ServerLoop ()
 #9  0x00267b88 in PostmasterMain ()
 #10 0x00220cc8 in main ()

 hmm - maybe one of the timezone patches ?

Looks suspicious, doesn't it.  How long since you last tested on that
machine?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Stefan Kaltenbrunner
Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 
Can you get a stack trace from the core dump?
 
 
(gdb) bt
#0  0x50377ba4 in memcpy () from /usr/lib/libc.so.34.2
#1  0x00326efc in hash_search ()
#2  0x00343430 in pg_tzset ()
#3  0x001fbcf0 in assign_timezone ()
#4  0x00330e88 in set_config_option ()
#5  0x0029b5b0 in PostgresMain ()
#6  0x0026b140 in BackendRun ()
#7  0x0026a9f0 in BackendStartup ()
#8  0x002685bc in ServerLoop ()
#9  0x00267b88 in PostmasterMain ()
#10 0x00220cc8 in main ()
 
 
hmm - maybe one of the timezone patches ?
 
 
 Looks suspicious, doesn't it.  How long since you last tested on that
 machine?

*argl* - it's not 2PC ...

the machine had some issues a week ago or so - but it looks like the
problem occured first here:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbilldt=2005-06-15%2023:50:04

and in that changeset we have some timezone-patches ...


Stefan

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 the machine had some issues a week ago or so - but it looks like the
 problem occured first here:

Hmm, what kind of issues, and are you sure they are fixed?

The stack trace looks to me like it is trying to apply the PGTZ setting
that's coming in from the client during startup.  Now I could believe a
platform-specific breakage there from Magnus' recent hacking, but the
problem is that *every* backend launched during the regression tests
is going to be doing this, and doing it in the same way with the same
value.  It's pretty tough to see why some backends would fail at this
spot and some not ... unless what it is is an intermittent hardware
problem.  Is there a version of memtest86 that works on your machine?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Stefan Kaltenbrunner
Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 
the machine had some issues a week ago or so - but it looks like the
problem occured first here:
 
 
 Hmm, what kind of issues, and are you sure they are fixed?

admin error :-).
I had a second postgresql instance running causing the buildfarm runs to
report a failure because of shared memory-limits.
Had nothing to do with hardware problems though ...

 
 The stack trace looks to me like it is trying to apply the PGTZ setting
 that's coming in from the client during startup.  Now I could believe a
 platform-specific breakage there from Magnus' recent hacking, but the
 problem is that *every* backend launched during the regression tests
 is going to be doing this, and doing it in the same way with the same
 value.  It's pretty tough to see why some backends would fail at this
 spot and some not ... unless what it is is an intermittent hardware
 problem.  Is there a version of memtest86 that works on your machine?

memtest86 works on x86 CPU's only afaik - no ?

anyway - here is the promised backtrace:

#0  0x4489fba4 in memcpy () from /usr/lib/libc.so.34.2
#1  0x00326f9c in hash_search (hashp=0xa6e030, keyPtr=0xa5ff90,
action=HASH_ENTER, foundPtr=0x0) at dynahash.c:653
#2  0x003434f0 in pg_tzset (name=0xa5ff90 PST8PDT) at pgtz.c:1039
#3  0x001fbcf0 in assign_timezone (value=0xa5ff90 PST8PDT,
doit=1 '\001', source=PGC_S_CLIENT) at variable.c:351
#4  0x00330f28 in set_config_option (name=0xa52830 timezone,
value=0xa52870 PST8PDT, context=10645504, source=PGC_S_CLIENT,
isLocal=0 '\0',
changeVal=1 '\001') at guc.c:3748
#5  0x0029b5f0 in PostgresMain (argc=4, argv=0xa52928,
username=0xa52740 mastermind) at postgres.c:2759
#6  0x0026b180 in BackendRun (port=0xa77800) at postmaster.c:2800
#7  0x0026aa30 in BackendStartup (port=0xa77800) at
postmaster.c:2440
#8  0x002685fc in ServerLoop () at postmaster.c:1221
#9  0x00267bc8 in PostmasterMain (argc=0,
argv=0x57d8) at postmaster.c:930
#10 0x00220cc8 in main (argc=6, argv=0x57d8) at
main.c:268


Stefan

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 anyway - here is the promised backtrace:

 #0  0x4489fba4 in memcpy () from /usr/lib/libc.so.34.2
 #1  0x00326f9c in hash_search (hashp=0xa6e030, keyPtr=0xa5ff90,
 action=HASH_ENTER, foundPtr=0x0) at dynahash.c:653
 #2  0x003434f0 in pg_tzset (name=0xa5ff90 PST8PDT) at pgtz.c:1039
 #3  0x001fbcf0 in assign_timezone (value=0xa5ff90 PST8PDT,
 doit=1 '\001', source=PGC_S_CLIENT) at variable.c:351
 #4  0x00330f28 in set_config_option (name=0xa52830 timezone,
 value=0xa52870 PST8PDT, context=10645504, source=PGC_S_CLIENT,
 isLocal=0 '\0',
 changeVal=1 '\001') at guc.c:3748
 #5  0x0029b5f0 in PostgresMain (argc=4, argv=0xa52928,
 username=0xa52740 mastermind) at postgres.c:2759

Yeah, that's exactly what I thought it was doing: setting timezone
PST8PDT due to the client PGTZ environment variable.  So the question
now is why this is an intermittent failure.  Why doesn't every single
regression test crash in the same place?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PATCHES] Typo in pl_funcs.c comment

2005-06-18 Thread Michael Fuhr
pl_funcs.c has a comment that refers to functins, which I'm
assuming should be functions.  The attached patch corrects
the spelling.

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/
Index: src/pl/plpgsql/src/pl_funcs.c
===
RCS file: /projects/cvsroot/pgsql/src/pl/plpgsql/src/pl_funcs.c,v
retrieving revision 1.43
diff -c -r1.43 pl_funcs.c
*** src/pl/plpgsql/src/pl_funcs.c   14 Jun 2005 06:43:14 -  1.43
--- src/pl/plpgsql/src/pl_funcs.c   18 Jun 2005 19:54:57 -
***
*** 1,5 
  /**
!  * pl_funcs.c - Misc functins for the PL/pgSQL
   *  procedural language
   *
   * IDENTIFICATION
--- 1,5 
  /**
!  * pl_funcs.c - Misc functions for the PL/pgSQL
   *  procedural language
   *
   * IDENTIFICATION

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Heikki Linnakangas

On Sat, 18 Jun 2005, Tom Lane wrote:


I've attached the 2PC patch as applied in case you want to have a look.
I did some fairly significant hacking on it, and I thought it'd be a
good idea to enumerate some of the changes:


I modified the in-memory data structure so that there is a separate
dummy PGPROC for each prepared transaction (GXACT), and we control
the visibility of the transaction to TransactionIdIsInProgress by
inserting or removing the PGPROC in the global ProcArray.  This costs
a little space but it has a lot of benefits:

(..list of benefits..)


Yep, that looks much better. In general, a prepared transaction now looks 
and behaves more like normal active transactions. That leaves less room 
for error.



Another area that I changed was fsyncing of prepared transaction status
files.  I really didn't like expecting checkpoint to do this --- for one
thing there are race conditions associated with fsyncing a status file
that someone else is busy deleting.  (You could possibly avoid that by
having the checkpointer hold the TwoPhaseStateLock while it's fsyncing,
but the performance implications of that are unpleasant.)  The
checkpoint coding dealt with the race condition by ignoring *all*
open() failures, which is hardly what I'd call a robust solution.

The way it's done in the committed patch, the prepare sequence is:
1. Write the status file, append a deliberately bad CRC,
   and fsync it.
2. Write the WAL record and fsync it.
3. Overwrite the bad CRC with a good one and fsync.
It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety
perspective I don't think there's much choice.  Steps 2 and 3 form a
critical section: if we complete 2 but fail at 3 we have to PANIC
because on-disk state is not consistent with what it says in WAL.  So
I felt we had to do the full pushup in step 1 to be as certain as we
can possibly be that step 3 won't fail.

I'm not totally satisfied with this --- it's OK from a correctness
standpoint but the performance leaves something to be desired.


Ouch, that really hurts performance.

In typical 2PC use, the state files live for a very short period of time, 
just long enough for the transaction manager to prepare all the resource 
managers participating in the global transaction, and then commit them. 
We're talking  1 s. If we let checkpoint to fsync the state files, we 
would only have to fsync those state files that happen to be alive when 
the checkpoint comes. And if we fsync the state files at the end of the 
checkpoint, after all the usual heap pages etc, it's very likely that 
even those rare state files that were alive when the checkpoint began, 
have already been deleted.


So we're really talking about 1 fsync vs. 3 fsyncs.

Can we figure out another way to solve the race condition? Would it 
in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
considering that it usually wouldn't be held for long, since usually the 
checkpoint would have very little work to do?


Was there other reasons beside the race condition why you did this way?


I also fooled around a bit with locking of GXACTs.  In the original
patch you prevent two backends from trying to commit/rollback the same
GXACT by removing it from the pool at the start of the process.
That's not workable anymore, so what I did was to add a field
locking_xid showing the XID of the transaction currently working on
the GXACT.  As long as this XID is active according to ProcArray
(ignoring the prepared xact itself of course) the GXACT cannot be
touched by anyone else.  This provides a convenient way to interlock
creation of a GXACT as well as its eventual destruction.


Ok.


I also changed the handling of smgr pending deletions.  The original
patch made smgr a 2PC resource manager, but that does not work because
XLOG replay doesn't (and shouldn't I think) run the resource managers;
thus replay of a PREPARE/COMMIT sequence would fail to delete whatever
files should have been deleted.  The correct way to do this is that
COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly
like a normal COMMIT or ABORT WAL record, including listing any files
to be deleted.  So I pulled the files-to-delete info into the 2PC file
header rather than keeping it in resource records.


Ok.

- Heikki

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 On Sat, 18 Jun 2005, Tom Lane wrote:
 I'm not totally satisfied with this --- it's OK from a correctness
 standpoint but the performance leaves something to be desired.

 Ouch, that really hurts performance.

 In typical 2PC use, the state files live for a very short period of time, 
 just long enough for the transaction manager to prepare all the resource 
 managers participating in the global transaction, and then commit them. 
 We're talking  1 s. If we let checkpoint to fsync the state files, we 
 would only have to fsync those state files that happen to be alive when 
 the checkpoint comes.

That's a good point --- I was thinking this was basically 4 fsyncs per xact
(counting the additional WAL fsync needed for COMMIT PREPARED) versus 3,
but if the average lifetime of a state file is short then it's 4 vs 2,
and what's more the 2 are on WAL, which should be way cheaper than
fsyncing random files.

 And if we fsync the state files at the end of the 
 checkpoint, after all the usual heap pages etc, it's very likely that 
 even those rare state files that were alive when the checkpoint began, 
 have already been deleted.

That argument is bogus, at least with respect to the way you were doing
it in the original patch, because what you were fsyncing was whatever
existed when CheckPointTwoPhase() started.  It could however be
interesting if we actually implemented something that checked the age of
the prepared xact.

 Can we figure out another way to solve the race condition? Would it 
 in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
 considering that it usually wouldn't be held for long, since usually the 
 checkpoint would have very little work to do?

If you're concerned about throughput of 2PC xacts then we can't sit on
the TwoPhaseStateLock while doing I/O; that will block both preparation
and commital of all 2PC xacts for a pretty long period in CPU terms.

Here's a sketch of an idea inspired by your comment above:

1. In each gxact in shared memory, store the WAL offset of the PREPARE
record, which we will know before we are ready to mark the gxact
valid.

2. When CheckPointTwoPhase runs (which we'll put near the end of the
checkpoint sequence), the only gxacts that need to be fsync'd are those
that are marked valid and have a PREPARE WAL location older than the
checkpoint's redo horizon (anything newer will be replayed from WAL on
crash, so it doesn't need fsync to complete the checkpoint).  If you're
right that the lifespan of a state file is often shorter than the time
needed for a checkpoint, this wins big.  In any case we'll never have to
fsync state files that disappear before the next checkpoint.

3. One way to handle CheckPointTwoPhase is:

* At start, take TwoPhaseStateLock (can be in shared mode) for just long
enough to scan the gxact list and make a list of the XID of things that
need fsync per above rule.

* Without the lock, try to open and fsync each item in the list.
Success: remove from list
ENOENT failure on open: add to list of not-there failures
Any other failure: ereport(ERROR)

* If the failure list is not empty, again take TwoPhaseStateLock in
shared mode, and check that each of the failures is now gone (or at
least marked invalid); if so it's OK, otherwise ereport the ENOENT
error.

Another possibility is to further extend the locking protocol for gxacts
so that the checkpointer can lock just the item it is fsyncing (which is
not possible at the moment because the checkpointer hasn't got an XID,
but probably we could think of another approach).  But that would
certainly delay attempts to commit the item being fsync'd, whereas the
above approach might not have to do so, depending on the filesystem
implementation.

Now there's a small problem with this approach, which is that we cannot
store the PREPARE WAL record location in the state files, since the
state file has to be completely computed before writing the WAL record.
However, we don't really need to do that: during recovery of a prepared
xact we know the thing has been fsynced (either originally, or when we
rewrote it during the WAL recovery sequence --- we can force an
immediate fsync in that one case).  So we can just put zero, or maybe
better the current end-of-WAL location, into the reconstructed gxact in
memory.

Thoughts?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Heikki Linnakangas

On Sat, 18 Jun 2005, Tom Lane wrote:


Heikki Linnakangas [EMAIL PROTECTED] writes:

Can we figure out another way to solve the race condition? Would it
in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
considering that it usually wouldn't be held for long, since usually the
checkpoint would have very little work to do?


If you're concerned about throughput of 2PC xacts then we can't sit on
the TwoPhaseStateLock while doing I/O; that will block both preparation
and commital of all 2PC xacts for a pretty long period in CPU terms.

Here's a sketch of an idea inspired by your comment above:

1. In each gxact in shared memory, store the WAL offset of the PREPARE
record, which we will know before we are ready to mark the gxact
valid.

2. When CheckPointTwoPhase runs (which we'll put near the end of the
checkpoint sequence), the only gxacts that need to be fsync'd are those
that are marked valid and have a PREPARE WAL location older than the
checkpoint's redo horizon (anything newer will be replayed from WAL on
crash, so it doesn't need fsync to complete the checkpoint).  If you're
right that the lifespan of a state file is often shorter than the time
needed for a checkpoint, this wins big.  In any case we'll never have to
fsync state files that disappear before the next checkpoint.

3. One way to handle CheckPointTwoPhase is:

* At start, take TwoPhaseStateLock (can be in shared mode) for just long
enough to scan the gxact list and make a list of the XID of things that
need fsync per above rule.

* Without the lock, try to open and fsync each item in the list.
Success: remove from list
ENOENT failure on open: add to list of not-there failures
Any other failure: ereport(ERROR)

* If the failure list is not empty, again take TwoPhaseStateLock in
shared mode, and check that each of the failures is now gone (or at
least marked invalid); if so it's OK, otherwise ereport the ENOENT
error.


In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is 
marked as valid after the prepare record is written to WAL. If checkpoint 
runs after the WAL record is written but before the gxact is marked as 
valid, it doesn't get fsynced. Right?


Otherwise, looks good to me.


Another possibility is to further extend the locking protocol for gxacts
so that the checkpointer can lock just the item it is fsyncing (which is
not possible at the moment because the checkpointer hasn't got an XID,
but probably we could think of another approach).  But that would
certainly delay attempts to commit the item being fsync'd, whereas the
above approach might not have to do so, depending on the filesystem
implementation.


The above sketch is much better.


Now there's a small problem with this approach, which is that we cannot
store the PREPARE WAL record location in the state files, since the
state file has to be completely computed before writing the WAL record.
However, we don't really need to do that: during recovery of a prepared
xact we know the thing has been fsynced (either originally, or when we
rewrote it during the WAL recovery sequence --- we can force an
immediate fsync in that one case).  So we can just put zero, or maybe
better the current end-of-WAL location, into the reconstructed gxact in
memory.


This reminds me of something. What should we do about XID wraparounds and 
prepared transactions? Should we have some mechanism to freeze prepared 
transactions, like heap tuples? At the minimum, I think we should issue a 
warning if the xid counter approaches the oldest prepared transaction.


A transaction shouldn't live that long in normal use, but I can imagine an 
orphaned transaction sitting there for years if it doesn't hold any locks 
etc that bother other applications.


I don't think we should implement heuristic commit/rollback, though. 
That creates a whole new class of problems.


- Heikki

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PATCHES] Post-mortem: final 2PC patch

2005-06-18 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is 
 marked as valid after the prepare record is written to WAL. If checkpoint 
 runs after the WAL record is written but before the gxact is marked as 
 valid, it doesn't get fsynced. Right?

It's safe because we have the CheckpointStartLock: no checkpoint
occurring in that interval can be trying to set its redo pointer
after the PREPARE record.  This is essentially the same race condition
previously detected for COMMIT vs. writing clog, and we need the
interlock with either approach to fsyncing 2PC.

 This reminds me of something. What should we do about XID wraparounds and 
 prepared transactions?

I don't think we need any special protection.  The prepared xacts will
be blocking advancement of GlobalXmin, and the (new in 8.1) XID wrap
detection code will shut us down safely.

 A transaction shouldn't live that long in normal use, but I can imagine an 
 orphaned transaction sitting there for years if it doesn't hold any locks 
 etc that bother other applications.

No, because people will notice the effects on VACUUM if nothing else.

 I don't think we should implement heuristic commit/rollback, though. 
 That creates a whole new class of problems.

Agreed.  I did put in the time-of-prepare display in pg_prepared_xacts,
so anyone who really wants this can program the behavior they want from
outside the database.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes

2005-06-18 Thread Mark Kirkwood

Tom Lane wrote:

Mark Kirkwood [EMAIL PROTECTED] writes:


I thought I would have a look at:
(Datatypes) Add function to return compressed length of TOAST data values.



My recollection of that discussion is that we just wanted something
that would return the actual VARSIZE() of the datum.  You're building
something way too confusing ...



I was guessing a little about exactly what was wanted - for some reason 
  I couldn't access the mail archives for the last few days (however, 
can now).



A more interesting point is that the way you've declared the function,
it will only work on text values, which is pretty limiting.  Ideally
we'd define this thing as pg_datum_length(any-varlena-type) returns int
but there is no pseudotype corresponding to any varlena type.

I was thinking about this the other day in connection with my proposal
to make something that could return the TOAST value OID of an
out-of-line datum.  I think the only non-restrictive way to do it would
be to declare the function as taking any, and then add a runtime check
using the get_fn_expr_argtype() mechanism to test that what you've been
given is in fact a varlena datatype.



Yes - was thinking I needed to check if the Datum was a varlena (and 
didn't know how to do it...), so thanks for the pointer!


With these thoughts in mind, I'll have another go :-)

Cheers

Mark


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PATCHES] Typo in pl_funcs.c comment

2005-06-18 Thread Bruce Momjian

Thanks, applied.

---

Michael Fuhr wrote:
 pl_funcs.c has a comment that refers to functins, which I'm
 assuming should be functions.  The attached patch corrects
 the spelling.
 
 -- 
 Michael Fuhr
 http://www.fuhr.org/~mfuhr/

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


[PATCHES] Python setof patch

2005-06-18 Thread Gerrit van Dyk

Hi,

This patch allows the PL/Python module to do (SRF) functions.

The patch was taken from the CVS version.

I have modified the plpython.c file and have added a test sql script for 
testing the functionality. It was actually the script that was in the 
8.0.3 version but have since been removed.


In order to signal the end of a set, the called python function must 
simply return plpy.EndOfSet and the set would be returned.


Please feel free to email me if you experience any problems.

My next project I am working in is to get the PL/Python module to return 
records (ie. structured data)


Regards
Gerrit van Dyk
AgileWorks (Pty) Ltd
diff -c -r -P orig/pgsql/src/pl/plpython/plpython.c 
new/pgsql/src/pl/plpython/plpython.c
*** orig/pgsql/src/pl/plpython/plpython.c   2005-06-15 10:55:02.0 
+0200
--- new/pgsql/src/pl/plpython/plpython.c2005-06-14 14:12:01.0 
+0200
***
*** 286,291 
--- 286,294 
  static PyObject *PLy_exc_fatal = NULL;
  static PyObject *PLy_exc_spi_error = NULL;
  
+ /* End-of-set Indication */
+ static PyObject *PLy_endofset = NULL;
+ 
  /* some globals for the python module
   */
  static char PLy_plan_doc[] = {
***
*** 770,775 
--- 773,788 
fcinfo-isnull = true;
rv = (Datum) NULL;
}
+   /* test for end-of-set condition */
+   else if (fcinfo-flinfo-fn_retset  plrv == PLy_endofset)
+   {
+  ReturnSetInfo *rsi;
+ 
+  fcinfo-isnull = true;
+  rv = (Datum)NULL;
+  rsi = (ReturnSetInfo *)fcinfo-resultinfo;
+  rsi-isDone = ExprEndResult;
+   }
else
{
fcinfo-isnull = false;
***
*** 2317,2325 
--- 2330,2340 
PLy_exc_error = PyErr_NewException(plpy.Error, NULL, NULL);
PLy_exc_fatal = PyErr_NewException(plpy.Fatal, NULL, NULL);
PLy_exc_spi_error = PyErr_NewException(plpy.SPIError, NULL, NULL);
+   PLy_endofset = PyErr_NewException(plpy.EndOfSet,NULL,NULL);
PyDict_SetItemString(plpy_dict, Error, PLy_exc_error);
PyDict_SetItemString(plpy_dict, Fatal, PLy_exc_fatal);
PyDict_SetItemString(plpy_dict, SPIError, PLy_exc_spi_error);
+   PyDict_SetItemString(plpy_dict, EndOfSet, PLy_endofset);
  
/*
 * initialize main module, and add plpy
diff -c -r -P orig/pgsql/src/pl/plpython/sql/plpython_setof.sql 
new/pgsql/src/pl/plpython/sql/plpython_setof.sql
*** orig/pgsql/src/pl/plpython/sql/plpython_setof.sql   1970-01-01 
02:00:00.0 +0200
--- new/pgsql/src/pl/plpython/sql/plpython_setof.sql2005-06-14 
14:12:01.0 +0200
***
*** 0 
--- 1,12 
+ 
+ CREATE or replace FUNCTION test_setof() returns setof text
+   AS
+ 'if GD.has_key(calls):
+   GD[calls] = GD[calls] + 1
+   if GD[calls]  2:
+   del GD[calls]
+   return plpy.EndOfSet
+ else:
+   GD[calls] = 1
+ return str(GD[calls])'
+   LANGUAGE plpythonu;

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] Add PG version number to NLS files

2005-06-18 Thread Martin Pitt
Hi!

Bruce Momjian [2005-06-15 15:26 -0400]:
 Peter Eisentraut wrote:
  Bruce Momjian wrote:
   The 'bind' calles in the binaries are going to look for the proper
   version.  Does that help, or is libpq the only thing we need to
   handle?
  
  Shared libraries have their version number embedded in the file name for 
  the explicit purpose of installing more than one version side by side.  
  Therefore, the PO files also need to make arrangements for side by side 
  installation.  No such promises are made for non-library files.
 
 OK, seems we need examples of how our current setup fails, both for
 libpq and binaries.

Debian's and Ubuntu's packages currently look like this (binary -
translation domain):

/usr/lib/libpq3.so: libpq3
/usr/lib/libpq4.so: libpq4
/usr/lib/postgresql/7.4/bin/postmaster: postgres-7.4
/usr/lib/postgresql/8.0/bin/postmaster: postgres-8.0
/usr/lib/postgresql/7.4/bin/psql: psql-7.4
/usr/lib/postgresql/8.0/bin/psql: psql-8.0
[similar for all other client binaries)

Without my domain patches, i. e. with upstream's scheme it would look
like:

/usr/lib/libpq3.so: libpq 
/usr/lib/libpq4.so: libpq
/usr/lib/postgresql/7.4/bin/postmaster: postgres
/usr/lib/postgresql/8.0/bin/postmaster: postgres

As you see, there is a conflict, so to be able to install both
pacakges at the same time, one translation version had to be picked
and stuffed into a separate -translations package, which both versions
depend on. That's ugly and would cause one version to have some
missing translations, therefore I patched the domains to be
version-specific.

I do think that adopting my scheme at least for libpq really makes
sense. Adopting it for the binaries would not do any harm, but it
might not be conformant to your packaging policy, which I don't really
know. The clashing domains are the only things that prevents the
parallel installation of different major versions, so the question
whether or not you want to adopt it mainly boils down to whether you
want to support parallel server installations.

I am grateful that you did the SONAME change upstream, that relieved
Debian and probably other vendors of much pain. OTOH, the Debian
specific translation domain changes are really simple and don't impose
a significant maintenance overhead, so there is no urgency at all.

Thanks and have a nice day!

Martin

-- 
Martin Pitthttp://www.piware.de
Ubuntu Developer   http://www.ubuntu.com
Debian Developer   http://www.debian.org


signature.asc
Description: Digital signature