Re: [HACKERS] [PATCH] Fix float8 parsing of denormal values (on some platforms?)

2012-02-03 Thread Tom Lane
Marti Raudsepp ma...@juffo.org writes:
 We're already seeing first buildfarm failures, on system narwhal
 using an msys/mingw compiler.

Yeah.  After a full day's cycle, the answer seems to be that
denormalized input works fine everywhere except:

1. mingw on Windows (even though MSVC builds work)

2. some Gentoo builds fail (knowing that platform, the phase of the moon
is probably the most predictable factor determining this).

I'm inclined at this point to remove the regression test cases, because
we have no principled way to deal with case #2.  We could install
alternative expected files that accept the failure, but then what's
the point of having the test case at all?

It might be worth pressuring the mingw folk to get with the program,
but I'm not going to be the one to do that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix float8 parsing of denormal values (on some platforms?)

2012-02-03 Thread Andrew Dunstan



On 02/03/2012 03:49 AM, Tom Lane wrote:

Marti Raudseppma...@juffo.org  writes:

We're already seeing first buildfarm failures, on system narwhal
using an msys/mingw compiler.

Yeah.  After a full day's cycle, the answer seems to be that
denormalized input works fine everywhere except:

1. mingw on Windows (even though MSVC builds work)

2. some Gentoo builds fail (knowing that platform, the phase of the moon
is probably the most predictable factor determining this).

I'm inclined at this point to remove the regression test cases, because
we have no principled way to deal with case #2.  We could install
alternative expected files that accept the failure, but then what's
the point of having the test case at all?

It might be worth pressuring the mingw folk to get with the program,
but I'm not going to be the one to do that.




No doubt they would tell us to upgrade the compiler. Narwhal's is 
horribly old. Neither of my mingw-based members is failing.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch pg_is_in_backup()

2012-02-03 Thread Bernd Helmle



--On 3. Februar 2012 13:21:11 +0900 Fujii Masao masao.fu...@gmail.com wrote:


It seems to be more user-friendly to introduce a view like pg_stat_backup
rather than the function returning an array.


I like this idea. A use case i saw for monitoring backup_label's in the past, 
was mainly to discover a forgotten exclusive pg_stop_backup() (e.g. due to 
broken backup scripts). If the view would be able to distinguish both, 
exclusive and non-exclusive backups, this would be great.


--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch pg_is_in_backup()

2012-02-03 Thread Fujii Masao
On Fri, Feb 3, 2012 at 6:10 PM, Bernd Helmle maili...@oopsware.de wrote:


 --On 3. Februar 2012 13:21:11 +0900 Fujii Masao masao.fu...@gmail.com
 wrote:

 It seems to be more user-friendly to introduce a view like pg_stat_backup
 rather than the function returning an array.


 I like this idea. A use case i saw for monitoring backup_label's in the
 past, was mainly to discover a forgotten exclusive pg_stop_backup() (e.g.
 due to broken backup scripts). If the view would be able to distinguish
 both, exclusive and non-exclusive backups, this would be great.

Agreed. Monitoring an exclusive backup is very helpful. But I wonder
why we want to monitor non-exclusive backup. Is there any use case?
If we want to monitor non-exclusive backup, why not pg_dump backup?

If there is no use case, it seems sufficient to implement the function
which reports the information only about exclusive backup.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch pg_is_in_backup()

2012-02-03 Thread Magnus Hagander
On Fri, Feb 3, 2012 at 10:47, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Feb 3, 2012 at 6:10 PM, Bernd Helmle maili...@oopsware.de wrote:


 --On 3. Februar 2012 13:21:11 +0900 Fujii Masao masao.fu...@gmail.com
 wrote:

 It seems to be more user-friendly to introduce a view like pg_stat_backup
 rather than the function returning an array.


 I like this idea. A use case i saw for monitoring backup_label's in the
 past, was mainly to discover a forgotten exclusive pg_stop_backup() (e.g.
 due to broken backup scripts). If the view would be able to distinguish
 both, exclusive and non-exclusive backups, this would be great.

 Agreed. Monitoring an exclusive backup is very helpful. But I wonder
 why we want to monitor non-exclusive backup. Is there any use case?

Actually, we can already monitor much of the non-exclusive one through
pg_stat_replication. Including the info on when it was started (at
least in almost every case, that will be more or less the
backend_start time for that one)

 If we want to monitor non-exclusive backup, why not pg_dump backup?

.. which we can also monitor though pg_stat_activity by looking at
application_name (which can be faked of course, but still)

 If there is no use case, it seems sufficient to implement the function
 which reports the information only about exclusive backup.

Yeah, thinking more of it, i think I agree. But the function should
then probably be named in such a way that it's clear that we're
talking about exclusive backups, e.g. not pg_is_in_backup() but
instead pg_is_in_exclusive_backup() (renamed if we change it to return
the timestamp instead, of course, but you get the idea)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch pg_is_in_backup()

2012-02-03 Thread Fujii Masao
On Fri, Feb 3, 2012 at 6:52 PM, Magnus Hagander mag...@hagander.net wrote:
 On Fri, Feb 3, 2012 at 10:47, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Feb 3, 2012 at 6:10 PM, Bernd Helmle maili...@oopsware.de wrote:


 --On 3. Februar 2012 13:21:11 +0900 Fujii Masao masao.fu...@gmail.com
 wrote:

 It seems to be more user-friendly to introduce a view like pg_stat_backup
 rather than the function returning an array.


 I like this idea. A use case i saw for monitoring backup_label's in the
 past, was mainly to discover a forgotten exclusive pg_stop_backup() (e.g.
 due to broken backup scripts). If the view would be able to distinguish
 both, exclusive and non-exclusive backups, this would be great.

 Agreed. Monitoring an exclusive backup is very helpful. But I wonder
 why we want to monitor non-exclusive backup. Is there any use case?

 Actually, we can already monitor much of the non-exclusive one through
 pg_stat_replication. Including the info on when it was started (at
 least in almost every case, that will be more or less the
 backend_start time for that one)

Right.

 If we want to monitor non-exclusive backup, why not pg_dump backup?

 .. which we can also monitor though pg_stat_activity by looking at
 application_name (which can be faked of course, but still)

Yep.

 If there is no use case, it seems sufficient to implement the function
 which reports the information only about exclusive backup.

 Yeah, thinking more of it, i think I agree. But the function should
 then probably be named in such a way that it's clear that we're
 talking about exclusive backups, e.g. not pg_is_in_backup() but
 instead pg_is_in_exclusive_backup() (renamed if we change it to return
 the timestamp instead, of course, but you get the idea)

Agreed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-03 Thread Christian Ullrich

* Peter Eisentraut wrote:


I noticed ecpglib still uses PQconnectdb() with a craftily assembled
connection string.  Here is a patch to use PQconnectdbParams() instead.


+   const char *conn_keywords[6];
+   const char *conn_values[6];

Shouldn't these be [7]? You can have up to 6 items, plus the terminator.

--
Christian Ullrich



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-02-03 Thread Robert Haas
On Thu, Feb 2, 2012 at 8:31 PM, Joachim Wieland j...@mcknight.de wrote:
 On Wed, Feb 1, 2012 at 12:24 PM, Robert Haas robertmh...@gmail.com wrote:
 I think we're more-or-less proposing to rename Archive to
 Connection, aren't we?

 And then ArchiveHandle can store all the things that aren't related to
 a specific connection.

 How about something like that:
 (Hopefully you'll come up with better names...)

 StateHandle {
   Connection
   Schema
   Archive
   Methods
   union {
      DumpOptions
      RestoreOptions
   }
 }

 Dumping would mean to do:

  Connection - Schema - Archive using DumpOptions through the
 specified Methods

 Restore:

   Archive - Schema - Connection using RestoreOptions through the
 specified Methods

 Dumping from one database and restoring into another one would be two
 StateHandles with different Connections, Archive == NULL, Schema
 pointing to the same Schema, Methods most likely also pointing to the
 same function pointer table and each with different Options in the
 union of course.

 Granted, you could say that above I've only grouped the elements of
 the ArchiveHandle, but I don't really see that breaking it up into
 several objects makes it any better or easier. There could be some
 accessor functions to hide the details of the whole object to the
 different consumers.

I'm not sure I understand what the various structures would contain.

My gut feeling for how to begin grinding through this is to go through
and do the following:

1. Rename Archive to Connection.
2. Add a PGconn object to it.
3. Change the declaration inside ArchiveHandle from Archive public
to Connection source_connection.

I think that'd get us significantly closer to sanity and be pretty
simple to understand, and then we can take additional passes over it
until we're happy with what we've got.

If you're OK with that much I'll go do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] sepgsql's DROP Permission checks

2012-02-03 Thread Robert Haas
On Sat, Jan 28, 2012 at 1:53 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2012/1/26 Robert Haas robertmh...@gmail.com:
 On Thu, Jan 26, 2012 at 7:27 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 It seems to me reasonable design.
 The attached patch is rebased one according to your perform-deletion patch.

 That looks pretty sensible.  But I don't think this is true any more:

 +    Please note that it shall not be checked on the objects removed by
 +    cascaded deletion according to the standard manner in SQL.

 I've been thinking more about the question of object access hooks with
 arguments as well.  In a couple of designs you've proposed, we've
 needed InvokeObjectAccessHook0..11 or whatever, and I don't like that
 design - it seems grotty and not type-safe.  However, looking at what
 you did in this patch, I have another idea.  Suppose that we add ONE
 additional argument to the object access hook function, as you've done
 here, and if a particular type of hook invocation needs multiple
 arguments, it can pass them with a struct.  In fact, let's use a
 struct regardless, for uniformity, and pass the value as a void *.
 That is:

 typedef struct ObjectAccessDrop {
    int dropflags;
 } ObjectAccessDrop;

 At the call site, we do this:

 if (object_access_hook)
 {
    ObjectAccessDrop arg;
    arg.dropflags = flags;
    InvokeObjectAccessHook(..., arg);
 }

 If there's no argument, then we can just do:

 InvokeObjectAccessHook(..., NULL);

 The advantage of this is that if we change the structure definition,
 loadable modules compiled against a newer code base should either (1)
 still work or (2) fail to compile.  The way we have it right now, if
 we decide to pass a second argument (say, the DropBehavior) to the
 hook, we're potentially going to silently break sepgsql no matter how
 we do it.  But if we enforce use of a struct, then the only thing the
 client should ever be doing with the argument it gets is casting it to
 ObjectAccessDrop *.  Then, if we've added fields to the struct, the
 code will still do the right thing even if the field order has been
 changed or whatever.   If we've removed fields or changed their data
 types, things should blow up fairly obviously instead of seeming to
 work but actually failing.

 Thoughts?

 I also think your idea is good; flexible and reliable toward future 
 enhancement.

 If we have one point to improve this idea, is it needed to deliver
 access, classid,
 objectid and subid as separated argument?

 If we define a type to deliver information on object access hook as follows:

 typedef struct {
    ObjectAccessType    access;
    ObjectAddress          address;
    union {
        struct {
            int    flags;
        } drop;
    } arg;
 } ObjectAccessHookData;

 All the argument that object_access_hook takes should be a pointer of this
 structure only, and no need to type cast on the module side.

 One disadvantage is that it needs to set up this structure on caller
 side including
 ObjectAccessType and ObjectAddress information. However, it can be embedded
 within preprocessor macro to keep nums of lines as currently we do.

 example:
 #define InvaokeDropAccessHook(classid, objectid, objsubid, flags) \
  do {
      if (object_access_hook)
      {
          ObjectAccessHookData  __hook_data;

          __hook_data.access = OAT_DROP;
          __hook_data.address.classId  = (classid);
          __hook_data.address.objectId = (objectid);
          __hook_data.address.objectSubid = (objsubid);
          __hook_data.args.drop.flags = (flags);

          (*object_access_hook)(__hook_data);
     }
  } while (0)

 How about your opinion?

I don't see any real advantage of that.  One advantage of the current
design is that any hook types which *don't* require extra arguments
need not set up and pass a structure; they can just pass NULL.  So I
suggest we keep classid, objid, and subid as separate arguments, and
just add one new one which can be type-specific.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-02-03 Thread Robert Haas
On Wed, Feb 1, 2012 at 2:39 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Feb 1, 2012 at 7:31 PM, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2012-01-29 at 22:01 +, Simon Riggs wrote:
 Patch now locks index in AccessExclusiveLock in final stage of drop.

 Doesn't that defeat the point of doing the CONCURRENTLY business in the
 first place?

 That was my initial reaction.

 We lock the index in AccessExclusiveLock only once we are certain
 nobody else is looking at it any more.

 So its a Kansas City Shuffle, with safe locking in case of people
 doing strange low level things.

Yeah, I think this is much safer, and in this version that doesn't
seem to harm concurrency.

Given our previous experiences in this area, I wouldn't like to bet my
life savings on this having no remaining bugs - but if it does, I
can't find them.

I'll mark this Ready for Committer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Assertion failure in AtCleanup_Portals

2012-02-03 Thread Pavan Deolasee
If I run the following sequence of commands, I get an assertion
failure in current HEAD.

postgres=# BEGIN;
BEGIN
postgres=# SELECT 1/0;
ERROR:  division by zero
postgres=# ROLLBACK TO A;
ERROR:  no such savepoint
postgres=# \q

The process fails when the session is closed and aborted transaction
gets cleaned at the proc_exit time. The stack trace is as below.

(gdb) bt
#0  0x00c16422 in __kernel_vsyscall ()
#1  0x00244651 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x00247a82 in *__GI_abort () at abort.c:92
#3  0x0844b800 in ExceptionalCondition (conditionName=0x86161e8
!(portal-cleanup == ((void *)0)), errorType=0x8615e94
FailedAssertion,
fileName=0x8615e88 portalmem.c, lineNumber=793) at assert.c:57
#4  0x08472a5a in AtCleanup_Portals () at portalmem.c:793
#5  0x080f2535 in CleanupTransaction () at xact.c:2373
#6  0x080f40d7 in AbortOutOfAnyTransaction () at xact.c:3855
#7  0x0845cc5e in ShutdownPostgres (code=0, arg=0) at postinit.c:966
#8  0x08332ba7 in shmem_exit (code=0) at ipc.c:221
#9  0x08332ab3 in proc_exit_prepare (code=0) at ipc.c:181
#10 0x08332a15 in proc_exit (code=0) at ipc.c:96
#11 0x0835b07c in PostgresMain (argc=2, argv=0xa12fa64,
username=0xa12f958 pavan) at postgres.c:4091

I analyzed this a bit. It seems that the first statement throws an
error, runs AbortCurrentTransaction and puts the transaction block in
TBLOCK_ABORT state. Any commands executed after this would usually
fail with error current transaction is aborted. But the ROLLBACK TO
gets past this check because IsTransactionExitStmt() returns success
(and rightly so). So we end up creating a portal to run the ROLLBACK
TO command. This command fails because the SAVEPOINT A does not exist
and we throw an error. But since the transaction block is already in
TBLOCK_ABORT state, the AbortCurrentTransaction doesn't do any work.
The cleanup routine for the portal remains unexecuted. Later when the
backend exits and cleanup routines are called, the assertion inside
AtCleanup_Portals fails.

I am not sure what is the right fix for this. I see Tom fixed a
similar complain sometime back.
http://git.postgresql.org/pg/commitdiff/6252c4f9e201f619e5eebda12fa867acd4e4200e

But may its not enough to cover all code paths, as demonstrated by this example.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-02-03 Thread Joachim Wieland
On Fri, Feb 3, 2012 at 7:52 AM, Robert Haas robertmh...@gmail.com wrote:
 If you're OK with that much I'll go do it.

Sure, go ahead!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Robert Haas
On Sat, Jan 21, 2012 at 10:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Jan 13, 2012 at 3:07 PM, Tomas Vondra t...@fuzzy.cz wrote:

 Fixed. The default value of TIMING option did not work as intended, it
 was set to true even for plain EXPLAIN (without ANALYZE). In that case
 the EXPLAIN failed.


 I've applied this over the show Heap Fetches in EXPLAIN ANALYZE
 output for index-only scans patch.  It applied and built and passes
 installcheck.

 I have not done a full review of this, but I want to say that I want
 this very much.  Not only can I get the row counts, but also the Heap
 Fetches and the output of BUFFERS, without the overhead of timing.  I
 haven't looked at contrib aspects of it at all.

 Thank you for making this.

+1.

After looking at this a bit, I think we can make the interface a bit
more convenient.  I'd like to propose that we call the EXPLAIN option
ROWS rather than TIMING, and that we provide the row-count
information if *either* ROWS or ANALYZE is specified.

Then, if you want rows without timing, you can say this:

EXPLAIN (ROWS) query...

Rather than, as in the approach taken by the current patch:

EXPLAIN (ANALYZE, TIMING off) query...

...which is a little more long-winded.  This also has the nice
advantage of making the name of the EXPLAIN option match the name of
the INSTRUMENT flag.

Revised patch along those lines attached.  Barring objections, I'll
commit this version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


explain-rows.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] xlog min recovery request ... is past current point ...

2012-02-03 Thread Christophe Pettus
PostgreSQL 9.0.4:

While bringing up a streaming replica, and while it is working its way through 
the WAL segments before connecting to the primary, I see a lot of messages of 
the form:

2012-02-01 21:26:13.978 PST,,,24448,,4f2a1e61.5f80,54,,2012-02-01 21:25:53 
PST,1/0,0,LOG,0,restored log file 00010DB40065 from 
archive,
2012-02-01 21:26:14.032 PST,,,24448,,4f2a1e61.5f80,55,,2012-02-01 21:25:53 
PST,1/0,0,WARNING,01000,xlog min recovery request DB5/42E15098 is past current 
point DB4/657FA490,writing block 5 of relation base/155650/156470_vm
xlog redo insert: rel 1663/155650/1658867; tid 9640/53
2012-02-01 21:26:14.526 PST,,,24448,,4f2a1e61.5f80,56,,2012-02-01 21:25:53 
PST,1/0,0,LOG,0,restored log file 00010DB40066 from 
archive,

All of these are on _vm relations.  The recovery completed successfully and the 
secondary connected to the primary without issue, so: Are these messages 
something to be concerned over?

--
-- Christophe Pettus
  x...@thebuild.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Tomas Vondra
On 3 Únor 2012, 17:12, Robert Haas wrote:
 On Sat, Jan 21, 2012 at 10:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Jan 13, 2012 at 3:07 PM, Tomas Vondra t...@fuzzy.cz wrote:

 Fixed. The default value of TIMING option did not work as intended, it
 was set to true even for plain EXPLAIN (without ANALYZE). In that case
 the EXPLAIN failed.


 I've applied this over the show Heap Fetches in EXPLAIN ANALYZE
 output for index-only scans patch.  It applied and built and passes
 installcheck.

 I have not done a full review of this, but I want to say that I want
 this very much.  Not only can I get the row counts, but also the Heap
 Fetches and the output of BUFFERS, without the overhead of timing.  I
 haven't looked at contrib aspects of it at all.

 Thank you for making this.

 +1.

 After looking at this a bit, I think we can make the interface a bit
 more convenient.  I'd like to propose that we call the EXPLAIN option
 ROWS rather than TIMING, and that we provide the row-count
 information if *either* ROWS or ANALYZE is specified.

 Then, if you want rows without timing, you can say this:

 EXPLAIN (ROWS) query...

 Rather than, as in the approach taken by the current patch:

 EXPLAIN (ANALYZE, TIMING off) query...

 ...which is a little more long-winded.  This also has the nice
 advantage of making the name of the EXPLAIN option match the name of
 the INSTRUMENT flag.

 Revised patch along those lines attached.  Barring objections, I'll
 commit this version.

I don't think changing the EXPLAIN syntax this way is a good idea. I think
that one option should not silently enable/disable others, so ROWS should
not enable ANALYZE automatically. It should throw an error just like the
TIMING option does when invoked without ANALYZE (IIRC).

Which leads to syntax like this

  EXPLAIN (ANALYZE, ROWS)

and that's a bit strange because it mentions ROWS but actually manipulates
TIMING behind the curtains. You can't actually do

  EXPLAIN (ANALYZE, ROWS off)

because row counts are always collected. So I think the syntax I proposed
makes more sense.

kind regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Robert Haas
On Fri, Feb 3, 2012 at 12:08 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I don't think changing the EXPLAIN syntax this way is a good idea. I think
 that one option should not silently enable/disable others, so ROWS should
 not enable ANALYZE automatically.

I didn't propose that.  The point is that the desired behavior
(however we name it) is a SUBSET of what ANALYZE does.

So we can either:

1. Have ANALYZE enable all the behavior, and have another option
(TIMING) that can be used to turn some of it back on again.

2. Have ANALYZE enable all the behavior, and have another option
(ROWS) that enables just the subset of it that we want.

I prefer #2 to #1, because I think it's a bit confusing to have an
option whose effect is to partially disable another option.  YMMV, of
course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Tomas Vondra
On 3.2.2012 18:28, Robert Haas wrote:
 On Fri, Feb 3, 2012 at 12:08 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I don't think changing the EXPLAIN syntax this way is a good idea. I think
 that one option should not silently enable/disable others, so ROWS should
 not enable ANALYZE automatically.
 
 I didn't propose that.  The point is that the desired behavior
 (however we name it) is a SUBSET of what ANALYZE does.
 
 So we can either:
 
 1. Have ANALYZE enable all the behavior, and have another option
 (TIMING) that can be used to turn some of it back on again.
 
 2. Have ANALYZE enable all the behavior, and have another option
 (ROWS) that enables just the subset of it that we want.
 
 I prefer #2 to #1, because I think it's a bit confusing to have an
 option whose effect is to partially disable another option.  YMMV, of
 course.

OK, thanks for the explanation. I don't like the idea of subsets as it
IMHO makes it less obvious what options are enabled. For example this

   EXPLAIN (ROWS) query...

does not immediately show it's actually going to do ANALYZE.

I prefer to keep the current 'ANALYZE' definition, i.e. collecting both
row counts and timing data (which is what 99% of people wants anyway),
and an option to disable the timing.

And the BUFFERS option currently works exactly like that, so defining
ROWS the way you proposed would be inconsistent with the current behavior.

Sure, we could redefine BUFFERS as a subset, so you could do

  EXPLAIN (ROWS) ... instead of ... EXPLAIN (ANALYZE, TIMING off)
  EXPLAIN (BUFFERS) ... instead of ... EXPLAIN (ANALYZE, BUFFERS on)

but what if someone wants both at the same time? Maybe he could do

  EXPLAIN (ROWS, BUFFERS)

and treat that as a union of those subsets. I don't think it's worth it.

I surely can live with both solutions (mine or the one you proposed).


kind regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-03 Thread Dan Scales
Hi Robert,

Thanks for the feedback!  I think you make a good point about the small size of 
dirty data in the OS cache.  I think what you can say about this double-write 
patch is that it will work not work well for configurations that have a small 
Postgres cache and a large OS cache, since every write from the Postgres cache 
requires double-writes and an fsync.  However, it should work much better for 
configurations with a much large Postgres cache and relatively smaller OS cache 
(including the configurations that I've given performance results for).  In 
that case, there is a lot more capacity for dirty pages in the Postgres cache, 
and you won't have nearly as many dirty evictions.  The checkpointer is doing a 
good number of the writes, and this patch sorts the checkpointer's buffers so 
its IO is efficient.

Of course, I can also increase the size of the non-checkpointer ring buffer to 
be much larger, though I wouldn't want to make it too large, since it is 
consuming memory.  If I increase the size of the ring buffers significantly, I 
will probably need to add some data structures so that the ring buffer lookups 
in smgrread() and smgrwrite() are more efficient.

Can you let me know what the shared_buffers and RAM sizes were for your pgbench 
run?  I can try running the same workload.  If the size of shared_buffers is 
especially small compared to RAM, then we should increase the size of 
shared_buffers when using double_writes.

Thanks,

Dan 


- Original Message -
From: Robert Haas robertmh...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: PG Hackers pgsql-hackers@postgresql.org
Sent: Thursday, February 2, 2012 7:19:47 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales sca...@vmware.com wrote:
 I've been prototyping the double-write buffer idea that Heikki and Simon
 had proposed (as an alternative to a previous patch that only batched up
 writes by the checkpointer).  I think it is a good idea, and can help
 double-writes perform better in the case of lots of backend evictions.
 It also centralizes most of the code change in smgr.c.  However, it is
 trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers.  AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk.  That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB.  Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction.  That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use.  I tried running pgbench -i -s 10 with this patch applied,
full_page_writes=off, double_writes=on.  It took 41.2 seconds to
complete.  The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things.  I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLockWaitUntilFree (was: Group commit, revised)

2012-02-03 Thread Robert Haas
On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I think you should break this off into a new function,
 LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
 Also, instead of adding lwWaitOnly, I would suggest that we generalize
 lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
 to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc.  That way
 we don't have to add another boolean every time someone invents a new
 type of wait - not that that should hopefully happen very often.  A
 side benefit of this is that you can simplify the test in
 LWLockRelease(): keep releasing waiters until you come to an exclusive
 waiter, then stop.  This possibly saves one shared memory fetch and
 subsequent test instruction, which might not be trivial - all of this
 code is extremely hot.

 Makes sense. Attached is a new version, doing exactly that.

I think I recommended a bad name for this function.  It's really
LWLockAcquireOrWaitUntilFree.  Can we rename that?

We might also want to consider what all the behaviors are that we
might want here.  It strikes me that there are two possible actions
when a lock is immediately available (acquire or noop) and three
possible actions when it isn't (wait-then-acquire, wait-then-noop, or
immediate-noop), for a total of six possible combinations:

acquire/wait-then-acquire = LWLockAcquire
acquire/wait-then-noop = what you just added
acquire/immediate-noop = LWLockConditionalAcquire
noop/wait-then-acquire
noop/wait-then-noop
noop/immediate-noop

The last one is clearly pointless, since you don't need a lock at all
to do nothing.  But is there a use case for either of the other two?
I can't immediately think of any reason why we'd want
noop/wait-then-acquire but noop/wait-then-noop seems potentially
useful (that's what I originally thought LWLockWaitUntilFree was going
to do).  For example, if for some reason you wanted all the WAL
flushes to happen in one process, you could have everyone else use
that primitive to sleep, and then recheck whether enough flushing had
happened (I don't actually think that's better; it's not very robust
against WAL writer going away).  Or, for ProcArrayLock, you might (as
you proposed previously) we could mark our XID with a removal sequence
number in lieu of actually removing it, and then wait for
ProcArrayLock to be free before overwriting it with a new XID.  Since
we're out of time for 9.2 I don't think we probably want to devote too
much time to experimenting with this right now, but it might be
interesting to play around with it next cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-03 Thread Robert Haas
On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales sca...@vmware.com wrote:
 Thanks for the feedback!  I think you make a good point about the small size 
 of dirty data in the OS cache.  I think what you can say about this 
 double-write patch is that it will work not work well for configurations that 
 have a small Postgres cache and a large OS cache, since every write from the 
 Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
 The entire database fits in shared_buffers, with most of it left
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment.  And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create new, empty ones, because we assume that
the pages from the old files may still be present in the OS cache.  We
also rely on the fact that an evicted CLOG page can be pulled back in
quickly without (in most cases) a disk access.  We also rely on
shared_buffers not being too large to avoid walloping the I/O
controller too hard at checkpoint time - which is forcing some people
to set shared_buffers much smaller than would otherwise be ideal.  In
other words, even if setting shared_buffers to most of the available
system memory would fix the problem I mentioned, it would create a
whole bunch of new ones, many of them non-trivial.  It may be a good
idea to think about what we'd need to do to work efficiently in that
sort of configuration, but there is going to be a very large amount of
thinking, testing, and engineering that has to be done to make it a
reality.

There's another issue here, too.  The idea that we're going to write
data to the double-write buffer only when we decide to evict the pages
strikes me as a bad one.  We ought to proactively start dumping pages
to the double-write area as soon as they're dirtied, and fsync them
after every N pages, so that by the time we need to evict some page
that requires a double-write, it's already durably on disk in the
double-write buffer, and we can do the real write without having to
wait.  It's likely that, to make this perform acceptably for bulk
loads, you'll need the writes to the double-write buffer and the
fsyncs of that buffer to be done by separate processes, so that one
backend (the background writer, perhaps) can continue spooling
additional pages to the double-write files while some other process (a
new auxiliary process?) fsyncs the ones that are already full.  Along
with that, the page replacement algorithm probably needs to be
adjusted to avoid evicting pages that need an as-yet-unfinished
double-write like the plague, even to the extent of allowing the
BufferAccessStrategy rings to grow if the double-writes can't be
finished before the ring wraps around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Robert Haas
On Fri, Feb 3, 2012 at 2:56 PM, Tomas Vondra t...@fuzzy.cz wrote:
 OK, thanks for the explanation. I don't like the idea of subsets as it
 IMHO makes it less obvious what options are enabled. For example this

   EXPLAIN (ROWS) query...

 does not immediately show it's actually going to do ANALYZE.

Well, it isn't, if ANALYZE means rows + timing...

 I prefer to keep the current 'ANALYZE' definition, i.e. collecting both
 row counts and timing data (which is what 99% of people wants anyway),
 and an option to disable the timing.

 And the BUFFERS option currently works exactly like that, so defining
 ROWS the way you proposed would be inconsistent with the current behavior.

 Sure, we could redefine BUFFERS as a subset, so you could do

  EXPLAIN (ROWS) ... instead of ... EXPLAIN (ANALYZE, TIMING off)
  EXPLAIN (BUFFERS) ... instead of ... EXPLAIN (ANALYZE, BUFFERS on)

 but what if someone wants both at the same time? Maybe he could do

  EXPLAIN (ROWS, BUFFERS)

 and treat that as a union of those subsets. I don't think it's worth it.

Yeah, I forgot that we'd have to allow that, though I don't think it
would be a big deal to fix that.

 I surely can live with both solutions (mine or the one you proposed).

Let's wait and see if anyone else has an opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Tomas Vondra
On 3.2.2012 22:51, Robert Haas wrote:
 On Fri, Feb 3, 2012 at 2:56 PM, Tomas Vondra t...@fuzzy.cz wrote:
 OK, thanks for the explanation. I don't like the idea of subsets as it
 IMHO makes it less obvious what options are enabled. For example this

   EXPLAIN (ROWS) query...

 does not immediately show it's actually going to do ANALYZE.
 
 Well, it isn't, if ANALYZE means rows + timing...

Yeah, sure. It depends on how you define ANALYZE. I see that as 'stats
collected at runtime' (in contrast to a query plan, which is just a ...
well, plan).

 but what if someone wants both at the same time? Maybe he could do

  EXPLAIN (ROWS, BUFFERS)

 and treat that as a union of those subsets. I don't think it's worth it.
 
 Yeah, I forgot that we'd have to allow that, though I don't think it
 would be a big deal to fix that.
 
 I surely can live with both solutions (mine or the one you proposed).
 
 Let's wait and see if anyone else has an opinion.

Sure. And thanks for your feedback!

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby fails if any backend crashes

2012-02-03 Thread Daniel Farina
On Thu, Feb 2, 2012 at 8:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It's a bit disturbing that nobody has reported this from the field yet.
 Seems to imply that hot standby isn't being used much.

I have seen this, but didn't get to dig in, as I thought it could be a
problem from other things done outside Postgres (it also came up in
#6200, but I didn't mention it).

Consider it retroactively reported.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we add crc32 in libpgport?

2012-02-03 Thread Daniel Farina
On Tue, Jan 31, 2012 at 3:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 On Tue, Jan 17, 2012 at 2:14 AM, Daniel Farina dan...@heroku.com wrote:
 See the attached patch.  It has a detailed cover letter/comment at the
 top of the file.

 I have amended that description to be more accurate.

 I looked at this a bit, and it seems to go considerably further than
 I had in mind: unless I've missed something, this will instantiate a
 couple of kilobytes of static data in every .c file that includes
 pg_crc.h, directly or indirectly.  While that might be tolerable in an
 external project, there are going to be a a LOT of copies of that table
 in the backend, many of them unused.  Did you check to see how much
 larger the backend executable got?  For that matter, aren't there a lot
 of build warnings about unused static variables?

Ah, yes, I think my optimizations were off when building, or
something.  I didn't get such verbosity at first, and then I remember
doing something slightly different and then getting a lot of output.
I didn't pay attention to the build size.  I will investigate.

 It occurs to me that rather than an #ifdef symbol, it might be
 appropriate to put the constant table into a separate .h file,
 say pg_crc_tables.h, so that users would control it by including
 that file or not rather than an #ifdef symbol.  This is pretty
 trivial but might look a bit nicer.

I agree, I was about to say what about a preprocessor hack... after
the last paragraph, but saw you beat me to the punch.  I'll have a look soon.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-03 Thread Michael Meskes
On Thu, Feb 02, 2012 at 08:01:48PM +0200, Peter Eisentraut wrote:
 I noticed ecpglib still uses PQconnectdb() with a craftily assembled
 connection string.  Here is a patch to use PQconnectdbParams() instead.

Thanks, committed. Will sync as soon as I'm online again.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Noah Misch
On Fri, Feb 03, 2012 at 11:12:33AM -0500, Robert Haas wrote:
 After looking at this a bit, I think we can make the interface a bit
 more convenient.  I'd like to propose that we call the EXPLAIN option
 ROWS rather than TIMING, and that we provide the row-count
 information if *either* ROWS or ANALYZE is specified.
 
 Then, if you want rows without timing, you can say this:
 
 EXPLAIN (ROWS) query...
 
 Rather than, as in the approach taken by the current patch:
 
 EXPLAIN (ANALYZE, TIMING off) query...
 
 ...which is a little more long-winded.

I favor the originally-submitted syntax.  I like having a single option,
ANALYZE, signal whether to actually run the statement.  The submitted syntax
also aligns with the fact that would motivate someone to use it: Timing has
high overhead on my system. or Timing makes the output unstable.

We have both estimated row counts and actual row counts.  It may someday prove
useful to have an invocation that plans the query and shows estimated row
counts, omitting only cost estimates.  Having a ROWS option that implies
running the query could complicate that effort.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-03 Thread Kevin Grittner
Robert Haas  wrote:
 
 A couple of things stand out at me from these graphs. First, some
 of these transactions had really long latency. Second, there are a
 remarkable number of seconds all of the test during which no
 transactions at all manage to complete, sometimes several seconds
 in a row. I'm not sure why. Third, all of the tests initially start
 of processing transactions very quickly, and get slammed down very
 hard, probably because the very high rate of transaction processing
 early on causes a checkpoint to occur around 200 s.
 
The amazing performance at the very start of all of these tests
suggests that there is a write-back cache (presumably battery-backed)
which is absorbing writes until the cache becomes full, at which
point actual disk writes become a bottleneck.  The problems you
mention here, where no transactions complete, sounds like the usual
problem that many people have complained about on the lists, where
the controller cache becomes so overwhelmed that activity seems to
cease while the controller catches up.  Greg, and to a lesser degree
myself, have written about this for years.
 
On the nofpw graph, I wonder whether the lower write rate just takes
that much longer to fill the controller cache.  I don't think it's
out of the question that it could take 700 seconds instead of 200
seconds depending on whether full pages are being fsynced to WAL. 
This effect is precisely why I think that on such machines the DW
feature may be a huge help.  If one small file is being written to
and fsynced repeatedly, it stays fresh enough not to actually be
written to the disk (it will stay in OS or controller cache), and the
disks are freed up to write everything else, helping to keep the
controller cache from being overwhelmed.  (Whether patches to date
are effective at achieving this is a separate question -- I'm
convinced the concept is sound for certain important workloads.)
 
 I didn't actually log when the checkpoints were occurring,
 
It would be good to have that information if you can get it for
future tests.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Jeff Janes
On Fri, Feb 3, 2012 at 8:12 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jan 21, 2012 at 10:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Jan 13, 2012 at 3:07 PM, Tomas Vondra t...@fuzzy.cz wrote:

 Fixed. The default value of TIMING option did not work as intended, it
 was set to true even for plain EXPLAIN (without ANALYZE). In that case
 the EXPLAIN failed.


 I've applied this over the show Heap Fetches in EXPLAIN ANALYZE
 output for index-only scans patch.  It applied and built and passes
 installcheck.

 I have not done a full review of this, but I want to say that I want
 this very much.  Not only can I get the row counts, but also the Heap
 Fetches and the output of BUFFERS, without the overhead of timing.  I
 haven't looked at contrib aspects of it at all.

 Thank you for making this.

 +1.

 After looking at this a bit, I think we can make the interface a bit
 more convenient.  I'd like to propose that we call the EXPLAIN option
 ROWS rather than TIMING, and that we provide the row-count
 information if *either* ROWS or ANALYZE is specified.

If we are allowed to make big breaks with the past, I would get rid of
ANALYZE altogether.

analyze basically is a synonym of explain, or is contained within
it.  You can't explain anything until you have first analyzed it.

Without ANALYZE, you are explaining the analysis done by the planner.
With it, you are explaining an analysis done by the planner and the
executor.

So we could do away with ANALYZE and have TIMING, ROWS, and BUFFERS,
any of which causes the query to actually be executed, not just
planned.

I suspect we will be unwilling to make such a break with the past.  In
that case, I think I prefer the originally proposed semantics, even
though I agree they are somewhat less natural.  ANALYZE is a big flag
that means This query will be executed, not just planned.  If we are
not going to make a major break, but only nibble around the edges,
then I don't think we should remove the property that the query will
be executed if and only if ANALYZE is specified.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpoint sync pause

2012-02-03 Thread Jeff Janes
On Mon, Jan 16, 2012 at 5:59 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 01/16/2012 11:00 AM, Robert Haas wrote:

 Also, I am still struggling with what the right benchmarking methodology
 even is to judge whether
 any patch in this area works.  Can you provide more details about
 your test setup?


 The test setup is a production server with a few hundred users at peak
 workload, reading and writing to the database.  Each RAID controller (couple
 of them with their own tablespaces) has either 512MG or 1GB of
 battery-backed write cache.  The setup that leads to the bad situation
 happens like this:

 -The steady stream of backend writes that happen between checkpoints have
 filled up most of the OS write cache.  A look at /proc/meminfo shows around
 2.5GB Dirty:

backend writes includes bgwriter writes, right?


 -Since we have shared_buffers set to 512MB to try and keep checkpoint storms
 from being too bad, there might be 300MB of dirty pages involved in the
 checkpoint.  The write phase dumps this all into Linux's cache.  There's now
 closer to 3GB of dirty data there.  @64GB of RAM, this is still only 4.7%
 though--just below the effective lower range for dirty_background_ratio.

Has using a newer kernal with dirty_background_bytes been tried, so it
could be set to a lower level?  If so, how did it do?  Or does it just
refuse to obey below the 5% level, as well?

  Linux is perfectly content to let it all sit there.

 -Sync phase begins.  Between absorption and the new checkpoint writes, there
 are 300 segments to sync waiting here.

If there is 3GB of dirty data spread over 300 segments each segment
is about full-sized (1GB) then on average 1% of each segment is
dirty?

If that analysis holds, then it seem like there is simply an awful lot
of data has to be written randomly, no matter how clever the
re-ordering is.  In other words, it is not that a harried or panicked
kernel or RAID control is failing to do good re-ordering when it has
opportunities to, it is just that you dirty your data too randomly for
substantial reordering to be possible even under ideal conditions.

Does the BBWC, once given an fsync command and reporting success,
write out those block forthwith, or does it lolly-gag around like the
kernel (under non-fsync) does?  If it is waiting around for
write-combing opportunities that will never actually materialize in
sufficient quantities to make up for the wait, how to get it to stop?

Was the sorted checkpoint with an fsync after every file (real file,
not VFD) one of the changes you tried?

 -The first few syncs force data out of Linux's cache and into the BBWC.
  Some of these return almost instantly.  Others block for a moderate number
 of seconds.  That's not necessarily a showstopper, on XFS at least.  So long
 as the checkpointer is not being given all of the I/O in the system, the
 fact that it's stuck waiting for a sync doesn't mean the server is
 unresponsive to the needs of other backends.  Early data might look like
 this:

 DEBUG:  Sync #1 time=21.969000 gap=0.00 msec
 DEBUG:  Sync #2 time=40.378000 gap=0.00 msec
 DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
 DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
 DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
 DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
 DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
 DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
 DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
 DEBUG:  Sync #10 time=302.75 gap=3003.958000 msec

Syncs 3 and 5 kind of surprise me.  It seems like the times should be
more bimodal.  If the cache is already full, why doesn't the system
promptly collapse, like it does later?  And if it is not, why would it
take 12 seconds, or even 2 seconds?  Or is this just evidence that the
gaps you are inserting are partially, but not completely, effective?


 [Here 'gap' is a precise measurement of how close the sync pause feature is
 working, with it set to 3 seconds.  This is from an earlier version of this
 patch.  All the timing issues I used to measure went away in the current
 implementation because it doesn't have to worry about doing background
 writer LRU work anymore, with the checkpointer split out]

 But after a few hundred of these, every downstream cache is filled up.  The
 result is seeing more really ugly sync times, like #164 here:

 DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
 DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
 DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
 DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
 DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
 DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
 DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
 DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
 ...
 DEBUG:  Sync #355 time=2.55 gap=2806.425000 msec
 LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec

 At the same time #164 is 

Re: [HACKERS] LWLockWaitUntilFree (was: Group commit, revised)

2012-02-03 Thread Jeff Janes
On Fri, Feb 3, 2012 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:

 I think I recommended a bad name for this function.  It's really
 LWLockAcquireOrWaitUntilFree.  Can we rename that?

 We might also want to consider what all the behaviors are that we
 might want here.  It strikes me that there are two possible actions
 when a lock is immediately available (acquire or noop) and three
 possible actions when it isn't (wait-then-acquire, wait-then-noop, or
 immediate-noop), for a total of six possible combinations:

 acquire/wait-then-acquire = LWLockAcquire
 acquire/wait-then-noop = what you just added
 acquire/immediate-noop = LWLockConditionalAcquire
 noop/wait-then-acquire
 noop/wait-then-noop
 noop/immediate-noop

 The last one is clearly pointless, since you don't need a lock at all
 to do nothing.  But is there a use case for either of the other two?

You've only considered things with the same mode on each side of the
/.  One thing that I've wanted is:
acquire-exclusive-immediate-and-return-true/wait-then-acquire-shared-and-return-false.

The idea was that there is something that I need to verify has been
done, but it doesn't matter who did it.  To do the work, I need
exclusive, but to verify that the work has already been done I only
need shared.  So, if I don't get the exclusive lock, then by the time
I can get it the work has probably been done by someone else so I only
need a shared lock to verify that.

Come to think of it, I think what I wanted my behavior for was for
this very thing that inspired this WaitUntilFree, improving
group-commit.  But I see in this case the verification is being done
under a spinlock rather than a shared LWLock.  I don't know if there
is any case left over for my proposed behavior, or if spinlock
verification would always be easy to arrange so as to make it
pointless.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-02-03 Thread Jeff Janes
On Mon, Jan 30, 2012 at 6:57 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 27.01.2012 15:38, Robert Haas wrote:

 On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Yeah, we have to be careful with any overhead in there, it can be a hot
 spot. I wouldn't expect any measurable difference from the above, though.
 Could I ask you to rerun the pgbench tests you did recently with this
 patch?
 Or can you think of a better test for this?


 I can't do so immediately, because I'm waiting for Nate Boley to tell
 me I can go ahead and start testing on that machine again.  But I will
 do it once I get the word.


 I committed this. I ran pgbench test on an 8-core box and didn't see any
 slowdown. It would still be good if you get a chance to rerun the bigger
 test, but I feel confident that there's no measurable slowdown.

Is it safe to assume that, under #ifdef LWLOCK_STATS, a call to
LWLockAcquire will always precede any calls to LWLockWaitUntilFree
when a new process is started, to calloc the stats assays?

I guess it is right now, because the only user is WALWrite, which
would never be acquired before WALInsert is at least once.  But this
doesn't seem very future proof.

I guess the same complain could be logged against LWLockConditionalAcquire.

Since people wouldn't be expected to define LWLOCK_STATS on production
builds, perhaps this issue is ignorable.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Caching for stable expressions with constant arguments v6

2012-02-03 Thread Jaime Casanova
On Mon, Jan 16, 2012 at 12:06 PM, Marti Raudsepp ma...@juffo.org wrote:

 Here's v6 of my expression caching patch.

i little review...

first, i notice a change of behaviour... i'm not sure if i can say
this is good or not.

with this function:

create or replace function cached_random() returns numeric as $$
begin
raise notice 'cached';
return random();
end;
$$ language plpgsql stable;

if you execute: select *, cached_random() from (select
generate_series(1, 10) ) i;

on head you get 10 random numbers, with your patch you get 10 times
the same random number... wich means your patch make stable promise a
hard one.
personally i think that's good but i know there are people using,
mistakenly, volatile functions inside stable ones

---

seems you are moving code in simplify_function(), do you think is
useful to do that independently? at least if it provides some clarity
to the code

---

benchmark. i run a few tests in my laptop (which is not very performant but...)
from what i see there is no too much gain for the amount of complexity
added... i can see there should be cases which a lot more gain (for
example if you use a function to hide a select and you use such a
function several times in the select... but i guess it should work the
same with a CTE)


configuration:

name   |  setting
--+
 shared_buffers | 4096
 synchronous_commit   | off

filesystem: xfs

-- This is from the bench_cache.sh but with -T 150
select * from ts where ts between to_timestamp('2005-01-01',
'-MM-DD') and to_timestamp('2005-01-01', '-MM-DD');
head 0.423855
cache 1.949242
select * from ts where tsnow();
head 2.420200
cache 2.580885

/*uncachable*/ select * from one where ts =
to_date(clock_timestamp()::date::text, '-MM-DD') and ts 
(to_date(clock_timestamp()::date::text, '-MM-DD') + interval '1
year')
head 955.007129
cache 846.917163

/*cachable*/ select * from one where ts = to_date(now()::date::text,
'-MM-DD') and ts  (to_date(now()::date::text, '-MM-DD') +
interval '1 year')
head 827.484067
cache 801.743863

a benchmark with pgbench scale 1 (average of 3 runs, -T 300 clients
=1, except on second run)
-scale 1
 == simple ==
head 261.833794
cache 250.22167
 == simple (10 clients) ==
head 244.075592
cache 233.815389
 == extended ==
head 194.676093
cache 202.778580
 == prepared ==
head 300.460328
cache 302.061739
 == select only ==
head 886.207252
cache 909.832986


a benchmark with pgbench scale 20 (average of 3 runs, -T 300 clients
=1, except on second run)
-scale 20
 == simple ==
head 19.890278
cache 19.536342
 == simple (10 clients) ==
head 40.864455
cache 44.457357
 == extended ==
head 21.372751
cache 19.992955
 == prepared ==
head 19.543434
cache 20.226981
 == select only ==
head 31.780529
cache 36.410658

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers