date:20110729

Robert Haas  writes:
> On Fri, Jul 29, 2011 at 5:25 PM, Tom Lane  wrote:
>> Portability, or rather lack of it.  What about using erand48, which we
>> already have a dependency on (and substitute code for)?

> Neither our implementation nor glibc's appears to be thread-safe,

I think you're confusing srand48 with erand48.  The latter relies on a
caller-supplied state value, so if it's not thread-safe the caller has
only itself to blame.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fix for pg_update on win32

2011-07-29 Thread Bruce Momjian

Based on EnterpriseDB testing, I have applied the attached patch to fix
a pg_upgrade bug on Win32 on 9.1 and 9.2.  The problem is that on Win32
you can't stat() a directory with a trailing slash --- this is already
mentioned in our src/port/path.c.  The patch removes a lone trailing
slash.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/exec.c b/contrib/pg_upgrade/exec.c
new file mode 100644
index ef21899..b632584
*** a/contrib/pg_upgrade/exec.c
--- b/contrib/pg_upgrade/exec.c
*** check_data_dir(const char *pg_data)
*** 168,174 
  	{
  		struct stat statBuf;
  
! 		snprintf(subDirName, sizeof(subDirName), "%s/%s", pg_data,
   requiredSubdirs[subdirnum]);
  
  		if (stat(subDirName, &statBuf) != 0)
--- 168,176 
  	{
  		struct stat statBuf;
  
! 		snprintf(subDirName, sizeof(subDirName), "%s%s%s", pg_data,
! 			/* Win32 can't stat() a directory with a trailing slash. */
!  *requiredSubdirs[subdirnum] ? "/" : "",
   requiredSubdirs[subdirnum]);
  
  		if (stat(subDirName, &statBuf) != 0)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

>> Comments and further feedback, if any, appreciated.
>
> Did you look at how this conflicts with my patch to add not null
> rows to pg_constraint?
>
> https://commitfest.postgresql.org/action/patch_view?id=601
>

I was certainly not aware of this patch in the commitfest. Your patch
has a larger footprint with more functional changes in it. IMHO, it
will be easiest to queue this non-inheritable constraints patch behind
your patch in the commitfest. There will be certain bitrot, which I
can fix once your patch gets committed.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incremental checkopints

2011-07-29 Thread jordani

> If you make writes go out more often, they will be less efficient

I think fsync is more important. But many writes + fsync is no good too.
Let suppose that 30 WAL segments are good for performance (to be written
at once). In incremental approach we can have 60 segments and we can write
30 at once. There is no checkpoint_timeout - more buffers will stay more
time.

I can not see any disadvantage.

Jordan Ivanov


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduced power consumption in autovacuum launcher process

2011-07-29 Thread Peter Geoghegan

Attached is revision of this patch that now treats the latch in
PGPROC, waitLatch, as the generic "process latch", rather than just
using it for sync rep; It is initialised appropriately as a shared
latch generically, within InitProcGlobal(), and ownership is
subsequently set within InitProcess(). We were doing so before, though
only for the benefit of sync rep.

The idea here is to have a per-process latch to guard against time-out
invalidation issues from within each generic signal handler, by
calling SetLatch() on our generic latch there, much as we already do
from within non-generic archiver process signal handlers on the
archiver's static, non-generic latch (the archiver has its MyProc
pointer set to NULL, and we allow for the possibility that generic
handlers may be registered within processes that have a NULL MyProc -
though latch timeout invalidation issues then become the
responsibility of the process exclusively, just as is currently the
case with the archiver. In other words, they better not register a
generic signal handler).

It doesn't really matter that the SetLatch() call will usually be
unnecessary, because, as Heikki once pointed out, redundantly setting
a latch that is already set is very cheap. We don't check if it's set
directly in advance of setting the latch, because the Latch struct is
"logically opaque" and there is no "public function" to check if it's
set, nor should there be; the checking simply happens in SetLatch().

On 18 July 2011 20:06, Heikki Linnakangas
 wrote:
> Right, we can easily change the timeout argument to be in milliseconds
> instead of microseconds.

I've done so in this latest revision as a precautionary measure. I
don't see much point in sub-millisecond granularity, and besides, the
Windows implementation will not provide that granularity anyway as
things stand.

Thoughts?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a6959f..b2cf973 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10210,7 +10210,7 @@ retry:
 	/*
 	 * Wait for more WAL to arrive, or timeout to be reached
 	 */
-	WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, 500L);
+	WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, 5000L);
 	ResetLatch(&XLogCtl->recoveryWakeupLatch);
 }
 else
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 9940a42..0663eb8 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -182,7 +182,7 @@ DisownLatch(volatile Latch *latch)
  * to wait for. If the latch is already set (and WL_LATCH_SET is given), the
  * function returns immediately.
  *
- * The 'timeout' is given in microseconds. It must be >= 0 if WL_TIMEOUT
+ * The 'timeout' is given in milliseconds. It must be >= 0 if WL_TIMEOUT
  * event is given, otherwise it is ignored. On some platforms, signals cause
  * the timeout to be restarted, so beware that the function can sleep for
  * several times longer than the specified timeout.
@@ -236,8 +236,9 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	if (wakeEvents & WL_TIMEOUT)
 	{
 		Assert(timeout >= 0);
-		tv.tv_sec = timeout / 100L;
-		tv.tv_usec = timeout % 100L;
+		tv.tv_sec = timeout / 1000L;
+		/* tv_usec should be in microseconds */
+		tv.tv_usec = (timeout % 1000L) * 1000L;
 		tvp = &tv;
 	}

@@ -329,6 +330,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 /*
  * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
  * latch is already set.
+ *
+ * Note that there is a general requirement to call this function within
+ * signal handlers, to ensure that latch timeout is not invalidated. For
+ * generic signal handlers that may be registered for multiple processes,
+ * it will be called with the per-process latch in PGPROC as its argument.
  */
 void
 SetLatch(volatile Latch *latch)
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 8d59ad4..5ee4f40 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -99,23 +99,14 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, SOCKET sock,
 	int			numevents;
 	int			result = 0;
 	int			pmdeath_eventno = 0;
-	long		timeout_ms;

 	Assert(wakeEvents != 0);
+	Assert(timeout >= 0 || (wakeEvents & WL_TIMEOUT) == 0);

 	/* Ignore WL_SOCKET_* events if no valid socket is given */
 	if (sock == PGINVALID_SOCKET)
 		wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);

-	/* Convert timeout to milliseconds for WaitForMultipleObjects() */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		Assert(timeout >= 0);
-		timeout_ms = timeout / 1000;
-	}
-	else
-		timeout_ms = INFINITE;
-
 	/* Construct an array of event handles for WaitforMultipleObjects() */
 	latch

Re: [HACKERS] cataloguing NOT NULL constraints

2011-07-29 Thread Alvaro Herrera

Excerpts from Robert Haas's message of sáb jul 23 07:40:12 -0400 2011:
> On Sat, Jul 23, 2011 at 4:37 AM, Dean Rasheed  
> wrote:
> > That looks wrong to me, because a NOT NULL constraint is a column
> > constraint not a table constraint. The CREATE TABLE syntax explicitly
> > distinguishes these 2 cases, and only allows NOT NULLs in column
> > constraints. So from a consistency point-of-view, I think that ALTER
> > TABLE should follow suit.
> >
> > So the new syntax could be:
> >
> > ALTER TABLE table_name ALTER [COLUMN] col_name ADD column_constraint
> >
> > where column_constraint is the same as in CREATE TABLE (i.e., allowing
> > all the other constraint types at the same time).
> >
> > It looks like that approach would probably lend itself to more
> > code-reusability too, especially once we start adding options to the
> > constraint.
> 
> So you'd end up with something like this?
> 
> ALTER TABLE foo ALTER COLUMN bar ADD CONSTRAINT somename NOT NULL
> 
> That works for me.  I think sticking the name of the constraint in
> there at the end of the line as Alvaro proposed would be terrible for
> future syntax extensibility - we'll be much less likely to paint
> ourselves into a corner with something like this.

Here's a patch that does things more or less in this way.  Note that
this is separate from the other patch, so while you can specify a
constraint name for the NOT NULL clause, it's not stored anywhere.

This is preliminary: there's no docs nor new tests.  Here's how it works
(you can also throw in PRIMARY KEY into the mix, but not EXCLUSION):

alvherre=# create table bar (a int);
CREATE TABLE
alvherre=# alter table bar alter column a add constraint foo_fk references foo 
initially deferred deferrable check (a <> 4) constraint a_uq unique constraint 
fnn not null;
NOTICE:  ALTER TABLE / ADD UNIQUE creará el índice implícito «a_uq» para la 
tabla «bar»
ALTER TABLE
alvherre=# \d bar
Tabla «public.bar»
 Columna |  Tipo   | Modificadores 
-+-+---
 a   | integer | not null
Índices:
"a_uq" UNIQUE CONSTRAINT, btree (a)
Restricciones CHECK:
"bar_a_check" CHECK (a <> 4)
Restricciones de llave foránea:
"foo_fk" FOREIGN KEY (a) REFERENCES foo(a) DEFERRABLE INITIALLY DEFERRED


The implementation is a bit dirty (at least IMO), but I don't see a way
around that, mainly because ALTER TABLE / ALTER COLUMN does not have a
proper ColumnDef to stick the Constraint nodes into; so while the other
constraints can do fine without that, it isn't very helpful for NOT NULL.
So it has to create a phony ColumnDef for transformConstraintItems to use.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


alter_add.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] RC1 / Beta4?

2011-07-29 Thread Joshua Berkus

All,

Where are we on RC1 or Beta4 for PostgreSQL 9.1?  

While I know we're doing going to do a final release in August due to the 
europeans, it would be nice to move things along before then.  There don't seem 
to be any blockers open.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] USECS_* constants undefined with float8 timestamps?

On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson
 wrote:
> Hi all,
>
> I just noticed that the USECS_* constants are not defined when the server
> is compiled without integer dates and timestamps.
>
> Explicitly, timestamp.h is
>
> #ifdef HAVE_INT64_TIMESTAMP
> #define USECS_PER_DAY   INT64CONST(864)
> #define USECS_PER_HOUR  INT64CONST(36)
> #define USECS_PER_MINUTE INT64CONST(6000)
> #define USECS_PER_SEC   INT64CONST(100)
> #endif
>
> Is there a particular reason for this?  Even with float8 timestamps
> there are uses for these constants in extensions.

I don't see any particular reason not define them unconditionally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench internal contention

On Fri, Jul 29, 2011 at 5:25 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On machines with lots of CPU cores, pgbench can start eating up a lot
>> of system time.  Investigation reveals that the problem is with
>> random(),
>
> Interesting.
>
>> I patched it to use random_r() - the patch is attached - and here are
>> the (rather gratifying) results of that test:
>> Since a client-limited benchmark isn't very interesting, I think this
>> change makes sense.  Thoughts?  Objections?
>
> Portability, or rather lack of it.  What about using erand48, which we
> already have a dependency on (and substitute code for)?

Neither our implementation nor glibc's appears to be thread-safe, and
erand48() is deprecated according to my Linux man page:

NOTES
   These functions are declared obsolete by  SVID  3,  which  states  that
   rand(3) should be used instead.

glibc provides erand48_r(), and I suppose we could kludge up something
similar out of what's already in src/port?

This is also not exactly the world's most sophisticated algorithm, but
perhaps for pgbench that doesn't matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-29 Thread daveg

On Fri, Jul 29, 2011 at 09:55:46AM -0400, Tom Lane wrote:
> The thing that was bizarre about the one instance in the buildfarm was
> that the error was persistent, ie, once a session had failed all its
> subsequent attempts to access pg_class failed too.  I gather from Dave's
> description that it's working that way for him too.  I can think of ways
> that there might be a transient race condition against a REINDEX, but
> it's very unclear why the failure would persist across multiple
> attempts.  The best idea I can come up with is that the session has
> somehow cached a wrong commit status for the reindexing transaction,
> causing it to believe that both old and new copies of the index's
> pg_class row are dead ... but how could that happen?  The underlying

It is definitely persistant. Once triggered the error occurs for any new
statement until the session exits.

> state in the catalog is not wrong, because no concurrent sessions are
> upset (at least not in the buildfarm case ... Dave, do you see more than
> one session doing this at a time?).

It looks like it happens to multiple sessions so far as one can tell from
the timestamps of the errors:

 timestampsessionid error
  - --
 03:05:37.434 4e26a861.4a6d could not find pg_class tuple for index 2662
 03:05:37.434 4e26a861.4a6f could not find pg_class tuple for index 2662

 03:06:12.041 4e26a731.438e could not find pg_class tuple for index 2662

 03:06:12.042 4e21b6a3.629b could not find pg_class tuple for index 2662
 03:06:12.042 4e26a723.42ec could not find pg_class tuple for index 2662 at 
character 13


-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench internal contention

Robert Haas  writes:
> On machines with lots of CPU cores, pgbench can start eating up a lot
> of system time.  Investigation reveals that the problem is with
> random(),

Interesting.

> I patched it to use random_r() - the patch is attached - and here are
> the (rather gratifying) results of that test:
> Since a client-limited benchmark isn't very interesting, I think this
> change makes sense.  Thoughts?  Objections?

Portability, or rather lack of it.  What about using erand48, which we
already have a dependency on (and substitute code for)?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgbench internal contention

On machines with lots of CPU cores, pgbench can start eating up a lot
of system time.  Investigation reveals that the problem is with
random(), which glibc implements like this:

long int
__random ()
{
  int32_t retval;
  __libc_lock_lock (lock);
  (void) __random_r (&unsafe_state, &retval);
  __libc_lock_unlock (lock);
  return retval;
}
weak_alias (__random, random)

Rather obviously, if you're running enough pgbench threads, you're
going to have a pretty ugly point of contention there.  On the 32-core
machine provided by Nate Boley, with my usual 5-minute SELECT-only
test, lazy-vxid and sinval-fastmessages applied, and scale factor 100,
"time" shows that pgbench uses almost as much system time as user
time:

$ time pgbench -n -S -T 300 -c 64 -j 64
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 64
number of threads: 64
duration: 300 s
number of transactions actually processed: 55319555
tps = 184396.016257 (including connections establishing)
tps = 184410.926840 (excluding connections establishing)

real5m0.019s
user21m10.100s
sys 17m45.480s

I patched it to use random_r() - the patch is attached - and here are
the (rather gratifying) results of that test:

$ time ./pgbench -n -S -T 300 -c 64 -j 64
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 64
number of threads: 64
duration: 300 s
number of transactions actually processed: 71851589
tps = 239503.585813 (including connections establishing)
tps = 239521.816698 (excluding connections establishing)

real5m0.016s
user20m40.880s
sys 9m25.930s

Since a client-limited benchmark isn't very interesting, I think this
change makes sense.  Thoughts?  Objections?  Coding style
improvements?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


random_r.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI error messages

Alvaro Herrera  writes:
> Excerpts from Peter Eisentraut's message of vie jul 29 14:46:20 -0400 2011:
>> On lÃ¶r, 2011-07-16 at 21:55 +0300, Heikki Linnakangas wrote:
>> Do you have an idea how to address this case:

> Call sprintf to expand the %u before ereport()?

That sounds like way too much work for a wording "improvement" that is
exceedingly marginal in the first place.

IMO the only thing really wrong with those messages is that they aren't
complete sentences, which is a style guideline that we ought to try to
adhere to even if we're not bothering to translate them.  Perhaps
"Transaction canceled because of ..." would read better than "Canceled
on ..."?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI error messages

2011-07-29 Thread Alvaro Herrera

Excerpts from Peter Eisentraut's message of vie jul 29 14:46:20 -0400 2011:
> On lör, 2011-07-16 at 21:55 +0300, Heikki Linnakangas wrote:
> > I think I would prefer something like this:
> > 
> > ERROR:  could not serialize access due to read/write dependencies
> > among 
> > transactions
> > DETAIL: Reason code: %s
> > HINT:  The transaction might succeed if retried.
> > 
> > Where %s gets the current detail field, untranslated, like:
> > 
> > Canceled on commit attempt with conflict in from prepared pivot.
> 
> Do you have an idea how to address this case:

Call sprintf to expand the %u before ereport()?

> @@ -3865,7 +3865,7 @@ CheckForSerializableConflictOut(bool visible, Relation 
> relation,
> ereport(ERROR,
> (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>  errmsg("could not serialize access due to read/write 
> dependencies among transactions"),
> -errdetail_internal("Canceled on conflict out to old 
> pivot %u.", xid),
> +errdetail("Reason code: %s.", "canceled on conflict 
> out to old pivot %u", xid),  // XXX bogus
>   errhint("The transaction might succeed if retried.")));

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Alvaro Herrera

Excerpts from Nikhil Sontakke's message of vie jul 29 14:12:37 -0400 2011:
> Hi all,
> 
> PFA, patch which implements non-inheritable "ONLY" constraints. This
> has been achieved by introducing a new column "conisonly" in
> pg_constraint catalog. Specification of 'ONLY' in the ALTER TABLE ADD
> CONSTRAINT CHECK command is used to set this new column to true.
> Constraints which have this column set to true cannot be inherited by
> present and future children ever.
> 
> The psql and pg_dump binaries have been modified to account for such
> persistent non-inheritable check constraints. This patch also has
> documentation changes along with relevant changes to the test cases.
> The regression runs pass fine with this patch applied.
> 
> Comments and further feedback, if any, appreciated.

Did you look at how this conflicts with my patch to add not null
rows to pg_constraint?

https://commitfest.postgresql.org/action/patch_view?id=601

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incremental checkopints

2011-07-29 Thread Greg Smith


On 07/29/2011 11:04 AM, jord...@go-link.net wrote:

I think that current implementation of checkpoints is not good for huge
shared buffer cache and for many WAL segments. If there is more buffers
and if buffers can be written rarely more updates of buffers can be
combined so total number of writes to disk will be significantly less. I
think that incremental checkpoints can achieve this goal (maybe more) and
price is additional memory (about 1/1000 of size of buffer cache).
   


The current code optimizes for buffers that are written frequently.  
Those will sit in shared_buffers and in the hoped for case, only be 
written once at checkpoint time.


There are two issues with adopting increment checkpoints instead, one 
fundamental, the other solvable but not started on yet:


1) Postponing writes as long as possible always improves the resulting 
throughput of those writes.  Any incremental checkpoint approach will 
detune throughput by some amount.  If you make writes go out more often, 
they will be less efficient; that's just how things work if you 
benchmark anything that allows write combining.  Any incremental 
checkpoint approach is likely to improve latency in some cases if it 
works well, while decreasing throughput in most cases.


2) The incremental checkpoint approach used by other databases, such as 
the MySQL implementation, works by tracking what transaction IDs were 
associated with a buffer update.  The current way PostgreSQL saves 
buffer sync information for the checkpoint to process things doesn't 
store enough information to do that.  As you say, the main price there 
is some additional memory.


From my perspective, the main problem with plans to tweak the 
checkpoint code is that we don't have a really good benchmark that 
tracks both throughput and latency to test proposed changes against.  
Mark Wong has been working to get his TCP-E clone DBT-5 running 
regularly for that purpose, and last I heard that was basically done at 
this point--he's running daily tests now.  There's already a small pile 
of patches that adjust checkpoint behavior around that were postponed 
from being included in 9.1 mainly because it was hard to prove they were 
useful given the benchmark used to test them, pgbench.  I have higher 
hopes for DBT-5 as being a test that gives informative data in this 
area.  I would want to go back and revisit the existing patches (sorted 
checkpoints, spread sync) before launching into this whole new area.  I 
don't think any of those has even been proven not to work, they just 
didn't help the slightly unrealistic pgbench write-heavy workload.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] include host names in hba error messages

On Fri, Jul 29, 2011 at 2:44 PM, Peter Eisentraut  wrote:
> On tis, 2011-07-19 at 14:17 -0400, Robert Haas wrote:
>> I think it would be less confusing to write the IP address as the main
>> piece of information, and put the hostname in parentheses only if we
>> accepted it as valid (i.e. we did both lookups, and everything
>> matched).
>>
>> ERROR: no pg_hba.conf entry for host 127.0.0.1 ("localhost"), user
>> "x", database "y"
>>
>> As for the case where we the forward lookup and reverse lookup don't
>> match, could we add that as a DETAIL?
>>
>> ERROR: no pg_hba.conf entry for host 127.0.0.1, user "x", database "y"
>> DETAIL: Forward and reverse DNS lookups do not match.
>
> On further reflection, the only way we would get a complete match host
> name is if there actually were a line in pg_hba.conf with that host
> name, but it didn't match because of other parameters.  So that would be
> quite rare, and so the error message would look one way or the other
> depending on obscure circumstances, which would be confusing.
>
> But picking up on your second suggestion, I propose instead that we put
> a note in the detail about the host name and what we know about it, if
> we know it, e.g.
>
> ERROR: no pg_hba.conf entry for host 127.0.0.1, user "x", database "y"
> DETAIL: Client IP address resolved to "localhost", forward lookup matches.
>
> I chose to use errdetail_log(), which only goes into the server log, so
> we don't expose too much about the server's DNS setup to the client.

Seems reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI error messages

2011-07-29 Thread Peter Eisentraut

On lör, 2011-07-16 at 21:55 +0300, Heikki Linnakangas wrote:
> I think I would prefer something like this:
> 
> ERROR:  could not serialize access due to read/write dependencies
> among 
> transactions
> DETAIL: Reason code: %s
> HINT:  The transaction might succeed if retried.
> 
> Where %s gets the current detail field, untranslated, like:
> 
> Canceled on commit attempt with conflict in from prepared pivot.

Do you have an idea how to address this case:

@@ -3865,7 +3865,7 @@ CheckForSerializableConflictOut(bool visible, Relation 
relation,
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 errmsg("could not serialize access due to read/write 
dependencies among transactions"),
-errdetail_internal("Canceled on conflict out to old 
pivot %u.", xid),
+errdetail("Reason code: %s.", "canceled on conflict 
out to old pivot %u", xid),  // XXX bogus
  errhint("The transaction might succeed if retried.")));
 
if (SxactHasSummaryConflictIn(MySerializableXact)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] include host names in hba error messages

2011-07-29 Thread Peter Eisentraut

On tis, 2011-07-19 at 14:17 -0400, Robert Haas wrote:
> I think it would be less confusing to write the IP address as the main
> piece of information, and put the hostname in parentheses only if we
> accepted it as valid (i.e. we did both lookups, and everything
> matched).
> 
> ERROR: no pg_hba.conf entry for host 127.0.0.1 ("localhost"), user
> "x", database "y"
> 
> As for the case where we the forward lookup and reverse lookup don't
> match, could we add that as a DETAIL?
> 
> ERROR: no pg_hba.conf entry for host 127.0.0.1, user "x", database "y"
> DETAIL: Forward and reverse DNS lookups do not match.

On further reflection, the only way we would get a complete match host
name is if there actually were a line in pg_hba.conf with that host
name, but it didn't match because of other parameters.  So that would be
quite rare, and so the error message would look one way or the other
depending on obscure circumstances, which would be confusing.

But picking up on your second suggestion, I propose instead that we put
a note in the detail about the host name and what we know about it, if
we know it, e.g.

ERROR: no pg_hba.conf entry for host 127.0.0.1, user "x", database "y"
DETAIL: Client IP address resolved to "localhost", forward lookup matches.

I chose to use errdetail_log(), which only goes into the server log, so
we don't expose too much about the server's DNS setup to the client.

diff --git i/src/backend/libpq/auth.c w/src/backend/libpq/auth.c
index d153880..1b6399d 100644
--- i/src/backend/libpq/auth.c
+++ w/src/backend/libpq/auth.c
@@ -439,6 +439,17 @@ ClientAuthentication(Port *port)
    NULL, 0,
    NI_NUMERICHOST);
 
+#define HOSTNAME_LOOKUP_DETAIL(port) \
+(port->remote_hostname  \
+ ? (port->remote_hostname_resolv == +1	\
+	? errdetail_log("Client IP address resolved to \"%s\", forward lookup matches.", port->remote_hostname) \
+	: (port->remote_hostname_resolv == 0\
+	   ? errdetail_log("Client IP address resolved to \"%s\", forward lookup not checked.", port->remote_hostname) \
+	   : (port->remote_hostname_resolv == -1			\
+		  ? errdetail_log("Client IP address resolved to \"%s\", forward lookup does not match.", port->remote_hostname) \
+		  : 0)))		\
+ : 0)
+
 if (am_walsender)
 {
 #ifdef USE_SSL
@@ -446,12 +457,14 @@ ClientAuthentication(Port *port)
 	   (errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
 		errmsg("no pg_hba.conf entry for replication connection from host \"%s\", user \"%s\", %s",
 			   hostinfo, port->user_name,
-			   port->ssl ? _("SSL on") : _("SSL off";
+			   port->ssl ? _("SSL on") : _("SSL off")),
+		HOSTNAME_LOOKUP_DETAIL(port)));
 #else
 	ereport(FATAL,
 	   (errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
 		errmsg("no pg_hba.conf entry for replication connection from host \"%s\", user \"%s\"",
-			   hostinfo, port->user_name)));
+			   hostinfo, port->user_name),
+		HOSTNAME_LOOKUP_DETAIL(port)));
 #endif
 }
 else
@@ -462,13 +475,15 @@ ClientAuthentication(Port *port)
 		errmsg("no pg_hba.conf entry for host \"%s\", user \"%s\", database \"%s\", %s",
 			   hostinfo, port->user_name,
 			   port->database_name,
-			   port->ssl ? _("SSL on") : _("SSL off";
+			   port->ssl ? _("SSL on") : _("SSL off")),
+		HOSTNAME_LOOKUP_DETAIL(port)));
 #else
 	ereport(FATAL,
 	   (errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
 		errmsg("no pg_hba.conf entry for host \"%s\", user \"%s\", database \"%s\"",
 			   hostinfo, port->user_name,
-			   port->database_name)));
+			   port->database_name),
+		HOSTNAME_LOOKUP_DETAIL(port)));
 #endif
 }
 break;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

On Fri, Jul 29, 2011 at 1:49 PM, Tom Lane  wrote:
> It would likely be better to not expose the struct type, just individual
> lookup functions.

I'm not sure about that...  I think that's just going to introduce a
lot of excess notation.

>> And, a translation from ObjectType to type name (e.g "table", "type", ...)
>> is helpful to generate error messages.
>
>>   const char *get_object_type_name(ObjectType objtype);
>
> Again, I think this is too low level because of message translation
> considerations.

On this point I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

Hi all,

PFA, patch which implements non-inheritable "ONLY" constraints. This
has been achieved by introducing a new column "conisonly" in
pg_constraint catalog. Specification of 'ONLY' in the ALTER TABLE ADD
CONSTRAINT CHECK command is used to set this new column to true.
Constraints which have this column set to true cannot be inherited by
present and future children ever.

The psql and pg_dump binaries have been modified to account for such
persistent non-inheritable check constraints. This patch also has
documentation changes along with relevant changes to the test cases.
The regression runs pass fine with this patch applied.

Comments and further feedback, if any, appreciated.

Regards,
Nikhils
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5e5f8a7..683ad67 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1995,6 +1995,16 @@
  
 
  
+  conisonly
+  bool
+  
+  
+   This constraint is defined locally for the relation.  It is a
+  non-inheritable constraint.
+  
+ 
+
+ 
   conkey
   int2[]
   pg_attribute.attnum
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 4c2a4cd..3ee3ec0 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -984,6 +984,14 @@ ALTER TABLE distributors ADD CONSTRAINT zipchk CHECK 
(char_length(zipcode) = 5);
   
 
   
+   To add a check constraint only to a table and not to its children:
+
+ALTER TABLE ONLY distributors ADD CONSTRAINT zipchk CHECK 
(char_length(zipcode) = 5);
+
+   (The check constraint will not be inherited by future children too.)
+  
+
+  
To remove a check constraint from a table and all its children:
 
 ALTER TABLE distributors DROP CONSTRAINT zipchk;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4399493..1b382b8 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -98,10 +98,10 @@ static Oid AddNewRelationType(const char *typeName,
   Oid new_array_type);
 static void RelationRemoveInheritance(Oid relid);
 static void StoreRelCheck(Relation rel, char *ccname, Node *expr,
- bool is_validated, bool is_local, int inhcount);
+ bool is_validated, bool is_local, int inhcount, bool 
is_only);
 static void StoreConstraints(Relation rel, List *cooked_constraints);
 static bool MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
-   bool allow_merge, bool 
is_local);
+   bool allow_merge, bool 
is_local, bool is_only);
 static void SetRelationNumChecks(Relation rel, int numchecks);
 static Node *cookConstraint(ParseState *pstate,
   Node *raw_constraint,
@@ -1860,7 +1860,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum, Node 
*expr)
  */
 static void
 StoreRelCheck(Relation rel, char *ccname, Node *expr,
- bool is_validated, bool is_local, int inhcount)
+ bool is_validated, bool is_local, int inhcount, bool 
is_only)
 {
char   *ccbin;
char   *ccsrc;
@@ -1943,7 +1943,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr,
  ccbin,/* Binary form 
of check constraint */
  ccsrc,/* Source form 
of check constraint */
  is_local, /* 
conislocal */
- inhcount);/* coninhcount 
*/
+ inhcount, /* 
coninhcount */
+ is_only); 
/* conisonly */
 
pfree(ccbin);
pfree(ccsrc);
@@ -1984,7 +1985,7 @@ StoreConstraints(Relation rel, List *cooked_constraints)
break;
case CONSTR_CHECK:
StoreRelCheck(rel, con->name, con->expr, 
!con->skip_validation,
- con->is_local, 
con->inhcount);
+ con->is_local, 
con->inhcount, con->is_only);
numchecks++;
break;
default:
@@ -2100,6 +2101,7 @@ AddRelationNewConstraints(Relation rel,
cooked->skip_validation = false;
cooked->is_local = is_local;
cooked->inhcount = is_local ? 0 : 1;
+   cooked->is_only = false;
cookedConstraints = lappend(cookedConstraints, cooked);
}
 
@@ -2167,7 +2169,7 @@ AddRelationNewConstraints(Relation rel,
 * what ATAddCheckConstraint wants.)

Re: [HACKERS] [RFC] Common object property boards

Kohei Kaigai  writes:
> In addition to this suggestion, I think the big static array also contains
> the following items:
> - Text form of the object type (e.g, "table", "function", ...)

What will you do with that that wouldn't be better done by calling
getObjectDescription?  The latter's output is somewhat localizable, but
individual words would be hard to translate.

> Does the main lookup function ought to return an entry of the big array?
> If so, the definition of structure should be declared in objectaddress.h,
> as follows:

It would likely be better to not expose the struct type, just individual
lookup functions.

> And, a translation from ObjectType to type name (e.g "table", "type", ...)
> is helpful to generate error messages.

>   const char *get_object_type_name(ObjectType objtype);

Again, I think this is too low level because of message translation
considerations.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: OT: OFF TOPIC: returning multiple result sets from a stored procedure

2011-07-29 Thread Matt Keranen

>I honestly do not mean any offence, just out of curiosity.
>If you guys care about money and time why would you spend the best years of 
>your life basically copying commercial products for free?

1) For the same reasons commercial vendors build competing products: different 
tools in the same category fit different scenarios

2) The right-tool-for-the-job concept: Sometimes a commercial solution is the 
right choice, sometime an open-source solution is. Budgets, applicability, and 
local expertise all should factor into platform selection decision.

3) Because sometimes the output of a platform is more important than the 
platform itself: If your role is to work with data, not to produce software, it 
can in the end be more efficient in the end to have a platform you can modify 
as needed.

[HACKERS] [RFC] Common object property boards

2011-07-29 Thread Kohei Kaigai

Robert Haas wrote:
| I think that get_object_namespace() needs to be rethought.  If you
| take a look at AlterObjectNamespace() and its callers, you'll notice
| that we already have, encoded in those call sites, the knowledge of
| which object types can be looked up in which system caches, and which
| columns within the resulting heap tuples will contain the schema OIDs.
|  Here, we're essentially duplicating that information in another
| place, which doesn't seem good.  I think perhaps we should create a
| big static array, each row of which would contain:
| 
| - ObjectType
| - system cache ID for OID lookups
| - system catalog table OID for scans
| - attribute number for the name attribute, where applicable (see
|   AlterObjectNamespace)
| - attribute number for the namespace attribute
| - attribute number for the owner attribute
| - ...and maybe some other properties
| 
| We could then create a function (in objectaddress.c, probably) that
| you call with an ObjectType argument, and it returns a pointer to the
| appropriate struct entry, or calls elog() if none is found.  This
| function could be used by the existing object_exists() functions in
| lieu of having its own giant switch statement, by
| AlterObjectNamespace(), and by this new consolidated DROP code.  If
| you agree with this approach, we should probably make it a separate
| patch.
| 

According to the suggestion from Robert, I'd like to see opinions about
this feature prior to its implementation.

In addition to this suggestion, I think the big static array also contains
the following items:
- Text form of the object type (e.g, "table", "function", ...)
- Function pointer of validator. It shall raise an error when relkind is
  not RELKIND_RELATION for ObjectType being OBJECT_TABLE, for example.

Does the main lookup function ought to return an entry of the big array?
If so, the definition of structure should be declared in objectaddress.h,
as follows:

  ObjectProperty get_object_property(ObjectType objtype);

In addition to the big array indexed by ObjectType, I'd like to have
a utility function that translate ObjectAddress to ObjectType.
(A part of them need syscache lookup, because pg_class contains several
 objcts types: table, view, ...)

  ObjectType get_object_type(const ObjectAddress *address);

If and when we have both of the function to reference ObjectProperty and
to translate ObjectAddress to ObjectType, it is quite easy to implement
service routines to lookup namespaceId, ownerId and others for the
supplied ObjectAddress, as follows:

  Oid get_object_namespace(const ObjectAddress *address);
  Oid get_object_ownership(const ObjectAddress *address);
  char *get_object_name(const ObjectAddress *address);
  HeapTuple get_object_tuple(const ObjectAddress *address);

And, a translation from ObjectType to type name (e.g "table", "type", ...)
is helpful to generate error messages.

  const char *get_object_type_name(ObjectType objtype);

Any comments please.

Thanks,
--
NEC Europe Ltd, SAP Global Competence Center
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] error: could not find pg_class tuple for index 2662

Robert Haas  writes:
> On Fri, Jul 29, 2011 at 11:27 AM, Tom Lane  wrote:
>> Well, no, because the ScanPgRelation call is not failing internally.
>> It's performing a seqscan of pg_class and not finding a matching tuple.

> SnapshotNow race?

That's what I would have guessed to start with, except that then it
wouldn't be a persistent failure.  You'd get one, but then the next
query would succeed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] error: could not find pg_class tuple for index 2662

On Fri, Jul 29, 2011 at 11:27 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane  wrote:
>>> The thing that was bizarre about the one instance in the buildfarm was
>>> that the error was persistent, ie, once a session had failed all its
>>> subsequent attempts to access pg_class failed too.
>
>> I was thinking more along the lines of a failure while processing a
>> sinval message emitted by the REINDEX.  The sinval message doesn't get
>> fully processed and therefore we get confused about what the
>> relfilenode is for pg_class.  If that happened for any other relation,
>> we could recover by scanning pg_class.  But if it happens for pg_class
>> or pg_class_oid_index, we're toast, because we can't scan them without
>> knowing what relfilenode to open.
>
> Well, no, because the ScanPgRelation call is not failing internally.
> It's performing a seqscan of pg_class and not finding a matching tuple.

SnapshotNow race?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] error: could not find pg_class tuple for index 2662

Robert Haas  writes:
> On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane  wrote:
>> The thing that was bizarre about the one instance in the buildfarm was
>> that the error was persistent, ie, once a session had failed all its
>> subsequent attempts to access pg_class failed too.

> I was thinking more along the lines of a failure while processing a
> sinval message emitted by the REINDEX.  The sinval message doesn't get
> fully processed and therefore we get confused about what the
> relfilenode is for pg_class.  If that happened for any other relation,
> we could recover by scanning pg_class.  But if it happens for pg_class
> or pg_class_oid_index, we're toast, because we can't scan them without
> knowing what relfilenode to open.

Well, no, because the ScanPgRelation call is not failing internally.
It's performing a seqscan of pg_class and not finding a matching tuple.

You could hypothesize about maybe an sinval message got missed leading
us to scan the old (pre-VAC-FULL) copy of pg_class, but that still
wouldn't explain how come we can't find a valid-looking entry for
pg_class_oid_index in it.

Tis a puzzlement.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] USECS_* constants undefined with float8 timestamps?

2011-07-29 Thread Johann 'Myrkraverk' Oskarsson

Hi all,

I just noticed that the USECS_* constants are not defined when the server
is compiled without integer dates and timestamps.

Explicitly, timestamp.h is

#ifdef HAVE_INT64_TIMESTAMP
#define USECS_PER_DAY   INT64CONST(864)
#define USECS_PER_HOUR  INT64CONST(36)
#define USECS_PER_MINUTE INT64CONST(6000)
#define USECS_PER_SEC   INT64CONST(100)
#endif

Is there a particular reason for this?  Even with float8 timestamps
there are uses for these constants in extensions.


-- 
   Johann Oskarssonhttp://www.2ndquadrant.com/|[]
   PostgreSQL Development, 24x7 Support, Training and Services  --+--
  |
   Blog: http://my.opera.com/myrkraverk/blog/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] error: could not find pg_class tuple for index 2662

On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane  wrote:
> daveg  writes:
>> On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:
>>> Ah, OK, sorry.  Well, in 9.0, VACUUM FULL is basically CLUSTER, which
>>> means that a REINDEX is happening as part of the same operation.  In
>>> 9.0, there's no point in doing VACUUM FULL immediately followed by
>>> REINDEX.  My guess is that this is happening either right around the
>>> time the VACUUM FULL commits or right around the time the REINDEX
>>> commits.  It'd be helpful to know which, if you can figure it out.
>
>> I'll update my vacuum script to skip reindexes after vacuum full for 9.0
>> servers and see if that makes the problem go away.
>
> The thing that was bizarre about the one instance in the buildfarm was
> that the error was persistent, ie, once a session had failed all its
> subsequent attempts to access pg_class failed too.  I gather from Dave's
> description that it's working that way for him too.  I can think of ways
> that there might be a transient race condition against a REINDEX, but
> it's very unclear why the failure would persist across multiple
> attempts.  The best idea I can come up with is that the session has
> somehow cached a wrong commit status for the reindexing transaction,
> causing it to believe that both old and new copies of the index's
> pg_class row are dead ... but how could that happen?  The underlying
> state in the catalog is not wrong, because no concurrent sessions are
> upset (at least not in the buildfarm case ... Dave, do you see more than
> one session doing this at a time?).

I was thinking more along the lines of a failure while processing a
sinval message emitted by the REINDEX.  The sinval message doesn't get
fully processed and therefore we get confused about what the
relfilenode is for pg_class.  If that happened for any other relation,
we could recover by scanning pg_class.  But if it happens for pg_class
or pg_class_oid_index, we're toast, because we can't scan them without
knowing what relfilenode to open.

Now that can't be quite right, because of course those are mapped
relations... and I'm pretty sure there are some other holes in my line
of reasoning too.  Just thinking out loud...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Hannu Krosing

On Fri, 2011-07-29 at 10:23 -0400, Robert Haas wrote:
> On Fri, Jul 29, 2011 at 10:20 AM, Hannu Krosing  wrote:
> >> An additional point to think about: if we were willing to insist on
> >> streaming replication, we could send the commit sequence numbers via a
> >> side channel rather than writing them to WAL, which would be a lot
> >> cheaper.
> >
> > Why do you think that side channel is cheaper than main WAL ?
> 
> You don't have to flush it to disk, 

You can probably write the "i became visible" WAL record without forcing
a flush and still get the same visibility order.

> and you can use some other lock
> that isn't as highly contended as WALInsertLock to synchronize it.

but you will need to synchronise it with WAL replay on slave anyway. It
seems easiest to just insert it in the WAL stream and be done with it. 

> >> That might even be a reasonable thing to do, because if
> >> you're doing log shipping, this is all going to be super-not-real-time
> >> anyway.
> >
> > But perhaps you still may want to preserve visibility order to be able
> > to do PITR to exact transaction "commit", no ?
> 
> Maybe.  In practice, I suspect most people won't be willing to pay the
> price a feature like this would exact.

Unless we find some really bad problems with different visibility orders
on master and slave(s) you are probably right.

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots

On Fri, Jul 29, 2011 at 10:20 AM, Hannu Krosing  wrote:
>> An additional point to think about: if we were willing to insist on
>> streaming replication, we could send the commit sequence numbers via a
>> side channel rather than writing them to WAL, which would be a lot
>> cheaper.
>
> Why do you think that side channel is cheaper than main WAL ?

You don't have to flush it to disk, and you can use some other lock
that isn't as highly contended as WALInsertLock to synchronize it.

>> That might even be a reasonable thing to do, because if
>> you're doing log shipping, this is all going to be super-not-real-time
>> anyway.
>
> But perhaps you still may want to preserve visibility order to be able
> to do PITR to exact transaction "commit", no ?

Maybe.  In practice, I suspect most people won't be willing to pay the
price a feature like this would exact.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Hannu Krosing

On Thu, 2011-07-28 at 20:14 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 7:54 PM, Ants Aasma  wrote:
> > On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner
> >  wrote:
> >> (4)  We communicate acceptable snapshots to the replica to make the
> >> order of visibility visibility match the master even when that
> >> doesn't match the order that transactions returned from commit.
> >
> > I wonder if some interpretation of 2 phase commit could make Robert's
> > original suggestion implement this.
> >
> > On the master the commit sequence would look something like:
> > 1. Insert commit record to the WAL
> > 2. Wait for replication
> > 3. Get a commit seq nr and mark XIDs visible
> > 4. WAL log the seq nr
> > 5. Return success to client
> >
> > When replaying:
> > * When replaying commit record, do everything but make
> >  the tx visible.
> > * When replaying the commit sequence number
> >if there is a gap between last visible commit and current:
> >  insert the commit sequence nr. to list of waiting commits.
> >else:
> >  mark current and all directly following waiting tx's visible
> >
> > This would give consistent visibility order on master and slave. Robert
> > is right that this would undesirably increase WAL traffic. Delaying this
> > traffic would undesirably increase replay lag between master and slave.
> > But it seems to me that this could be an optional WAL level on top of
> > hot_standby that would only be enabled if consistent visibility on
> > slaves is desired.
> 
> I think you nailed it.

Agreed, this would keep current semantics on master and same visibility
order on master and slave.

> An additional point to think about: if we were willing to insist on
> streaming replication, we could send the commit sequence numbers via a
> side channel rather than writing them to WAL, which would be a lot
> cheaper. 

Why do you think that side channel is cheaper than main WAL ?

How would you handle synchronising the two ?

> That might even be a reasonable thing to do, because if
> you're doing log shipping, this is all going to be super-not-real-time
> anyway. 

But perhaps you still may want to preserve visibility order to be able
to do PITR to exact transaction "commit", no ?

>  OTOH, I know we don't want to make WAL shipping anything less
> than a first class citizen, so maybe not.
> 
> At any rate, we may be getting a little sidetracked here from the
> original point of the thread, which was how to make snapshot-taking
> cheaper.  Maybe there's some tie-in to when transactions become
> visible, but I think it's pretty weak.  The existing system could be
> hacked up to avoid making transactions visible out of LSN order, and
> the system I proposed could make them visible either in LSN order or
> do the same thing we do now.  They are basically independent problems,
> AFAICS.

Agreed.


-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] per-column FDW options, v5

2011/7/29 Shigeru Hanada :
> Here is a rebased version of per-column FDW options patch.  I've
> proposed this patch in last CF, but it was marked as returned with
> feedback.  So I would like to propose in next CF 2011-09.  I already
> moved CF item into new topic "SQL/MED" of CF 2011-09.

I did a quick review of this patch and it looks good to me, modulo
some quibbles with the wording of the documentation and comments and a
couple of minor stylistic nitpicks.  Barring objections, I'll fix this
up a bit and commit it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Incremental checkopints

2011-07-29 Thread jordani

Hi,
I have read all information about checkpoints in PostgreSQL I have found.
I think that current implementation of checkpoints is not good for huge
shared buffer cache and for many WAL segments. If there is more buffers
and if buffers can be written rarely more updates of buffers can be
combined so total number of writes to disk will be significantly less. I
think that incremental checkpoints can achieve this goal (maybe more) and
price is additional memory (about 1/1000 of size of buffer cache).

My main source of information is
http://wiki.postgresql.org/wiki/User:Gsmith#How_do_checkpoints_happen_inside_the_PostgreSQL_backend.3F
I see that some data are required to be written into WAL in 3) and 6). I
will use CD to denote that data and P1, P2... to denote pages that are
dirty and has to be written to disk in 4).

In incremental checkpoint when WAL segment has written we will not start
writing but we will add to queue pages P1, P2 ... and CD. If meanwhile
background writer has to clean some page that page is removed from queue.
When checkpoint_segments are written in the transaction log we have in
queue:
P1, P2 ... CD, Pi ... CD, Pj ... CD ...
Here we have to make checkpoint in order to free first WAL segment. Only
pages before first CD have to be written and fsync’d.

I suppose that this task can be done in background writer. So first we can
make some number of writes per round both lru and checkpoint. There is no
deadline for each incremental checkpoint but if WAL is growing total
number of writes have to increase. Also it is not required to do
checkpoint for each WAL segment. It is possible to write N pages from
queue and to combine several potential checkpoint in one.

I hope I have explained the general idea. I am not C programmer so it is
hard to me to give more details.

Jordan Ivanov

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

> We could imagine doing something like CHECK ONLY (foo), but that seems
> quite non-orthogonal with (a) everything else in CREATE TABLE, and
> (b) ALTER TABLE ONLY.
>
>
Yeah, I thought about CHECK ONLY support as part of table definition, but as
you say - it appears to be too non-standard right now and we can always go
back to this later if the need be felt.

Regards,
Nikhils

Re: [HACKERS] error: could not find pg_class tuple for index 2662

daveg  writes:
> On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:
>> Ah, OK, sorry.  Well, in 9.0, VACUUM FULL is basically CLUSTER, which
>> means that a REINDEX is happening as part of the same operation.  In
>> 9.0, there's no point in doing VACUUM FULL immediately followed by
>> REINDEX.  My guess is that this is happening either right around the
>> time the VACUUM FULL commits or right around the time the REINDEX
>> commits.  It'd be helpful to know which, if you can figure it out.

> I'll update my vacuum script to skip reindexes after vacuum full for 9.0
> servers and see if that makes the problem go away.

The thing that was bizarre about the one instance in the buildfarm was
that the error was persistent, ie, once a session had failed all its
subsequent attempts to access pg_class failed too.  I gather from Dave's
description that it's working that way for him too.  I can think of ways
that there might be a transient race condition against a REINDEX, but
it's very unclear why the failure would persist across multiple
attempts.  The best idea I can come up with is that the session has
somehow cached a wrong commit status for the reindexing transaction,
causing it to believe that both old and new copies of the index's
pg_class row are dead ... but how could that happen?  The underlying
state in the catalog is not wrong, because no concurrent sessions are
upset (at least not in the buildfarm case ... Dave, do you see more than
one session doing this at a time?).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Kevin Grittner

Robert Haas  wrote:

>> (4)  We communicate acceptable snapshots to the replica to make
>> the order of visibility visibility match the master even when
>> that doesn't match the order that transactions returned from
>> commit.

>>  I (predictably) like (4) -- even though it's a lot of work
> 
> I think that (4), beyond being a lot of work, will also have
> pretty terrible performance.  You're basically talking about
> emitting two WAL records for every commit instead of one.

Well, I can think of a great many other ways this could be done,
each with its own trade-offs of various types of overhead against
how close the replica is to current.  At one extreme you could do
what you describe, at the other you could generate a new snapshot on
the replica once every few minutes.

Then there are more clever ways, in discussions a few months ago I
suggested that adding two new bit flags to the commit record would
suffice, and I don't remember anyone blowing holes in that idea.  Of
course, that was to achieve serializable behavior on the replica,
based on some assumption that the current hot standby already
supported repeatable read.  We might need another bit or two to
solve the problems with that which have surfaced on this thread.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

> > Yeah, I have already hacked it a bit. This constraint now needs to be
> > spit out later as an ALTER command with ONLY attached to it
> > appropriately. Earlier all CHECK constraints were generally emitted as
> > part of the table definition itself.
>
> IIRC, there's already support for splitting out a constraint that way,
> in order to deal with circular dependencies.  You just need to treat
> this as an additional reason for splitting.
>
>
Yeah, I have indeed followed the existing separate printing logic for "ONLY"
constraints. Had to make the table dependent on this constraint to print the
constraint *after* the table definition.

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

> > Yeah, I have already hacked it a bit. This constraint now needs to be
> > spit out later as an ALTER command with ONLY attached to it
> > appropriately. Earlier all CHECK constraints were generally emitted as
> > part of the table definition itself.
>
> Hrm.  That doesn't seem so good.  Maybe we've got the design wrong
> here.  It doesn't seem like we want to lose the ability to define
> arbitrary constraints at table-creation time.
>
>
Well the handling is different now for "ONLY" constraints only. The normal
constraints can still be attached at table-creation time.

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

Robert Haas  writes:
> On Fri, Jul 29, 2011 at 8:30 AM, Nikhil Sontakke  wrote:
>> Yeah, I have already hacked it a bit. This constraint now needs to be
>> spit out later as an ALTER command with ONLY attached to it
>> appropriately. Earlier all CHECK constraints were generally emitted as
>> part of the table definition itself.

> Hrm.  That doesn't seem so good.  Maybe we've got the design wrong
> here.  It doesn't seem like we want to lose the ability to define
> arbitrary constraints at table-creation time.

Well, you can't define arbitrary indexes within the CREATE TABLE syntax,
either.  This does not bother me a lot.

We could imagine doing something like CHECK ONLY (foo), but that seems
quite non-orthogonal with (a) everything else in CREATE TABLE, and
(b) ALTER TABLE ONLY.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

Nikhil Sontakke  writes:
>> (Also, don't forget you need to hack pg_dump, too.)

> Yeah, I have already hacked it a bit. This constraint now needs to be
> spit out later as an ALTER command with ONLY attached to it
> appropriately. Earlier all CHECK constraints were generally emitted as
> part of the table definition itself.

IIRC, there's already support for splitting out a constraint that way,
in order to deal with circular dependencies.  You just need to treat
this as an additional reason for splitting.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

On Fri, Jul 29, 2011 at 8:30 AM, Nikhil Sontakke  wrote:
> Yeah, I have already hacked it a bit. This constraint now needs to be
> spit out later as an ALTER command with ONLY attached to it
> appropriately. Earlier all CHECK constraints were generally emitted as
> part of the table definition itself.

Hrm.  That doesn't seem so good.  Maybe we've got the design wrong
here.  It doesn't seem like we want to lose the ability to define
arbitrary constraints at table-creation time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

> psql=# \d a
>   Table "public.a"
>  Column |  Type   | Modifiers
> +-+---
>  b  | integer |
> Check constraints:
>"achk" CHECK (false)
>"bchk" CHECK (b > 0)
>
> Is this acceptable? Or we need to put in work into psql to show ONLY
> somewhere in the description? If yes, ONLY CHECK sounds weird, maybe
> we should use LOCAL CHECK or some such mention:
>
> Check constraints:
>"achk" LOCAL CHECK (false)
>"bchk" CHECK (b > 0)

I think you need to stick with "ONLY".  Using two different words is
just going to create confusion. You could fool around with where
exactly you put it on the line, but switching to a different word
seems like not a good idea.

Ok, maybe something like:

"achk" (ONLY) CHECK (false)

>>(Also, don't forget you need to hack pg_dump, too.)

Yeah, I have already hacked it a bit. This constraint now needs to be
spit out later as an ALTER command with ONLY attached to it
appropriately. Earlier all CHECK constraints were generally emitted as
part of the table definition itself.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

On Fri, Jul 29, 2011 at 7:41 AM, Nikhil Sontakke  wrote:
> Hi,
>
>>>Any preferences for the name?
>>> connoinh
>>> conisonly
>>> constatic or confixed
>>
>> I'd probably pick conisonly from those choices.
>>
>
> The use of "\d" inside psql will show ONLY constraints without any
> embellishments similar to normal constraints. E.g.
>
>
> ALTER TABLE ONLY a ADD CONSTRAINT achk CHECK (FALSE);
>
> ALTER TABLE a ADD CONSTRAINT bchk CHECK (b > 0);
>
> psql=# \d a
>       Table "public.a"
>  Column |  Type   | Modifiers
> +-+---
>  b      | integer |
> Check constraints:
>    "achk" CHECK (false)
>    "bchk" CHECK (b > 0)
>
> Is this acceptable? Or we need to put in work into psql to show ONLY
> somewhere in the description? If yes, ONLY CHECK sounds weird, maybe
> we should use LOCAL CHECK or some such mention:
>
> Check constraints:
>    "achk" LOCAL CHECK (false)
>    "bchk" CHECK (b > 0)

I think you need to stick with "ONLY".  Using two different words is
just going to create confusion. You could fool around with where
exactly you put it on the line, but switching to a different word
seems like not a good idea.

(Also, don't forget you need to hack pg_dump, too.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?