Re: [HACKERS] pgindent README correction

2012-02-08 Thread Magnus Hagander
On Wednesday, February 8, 2012, Bruce Momjian wrote:

 On Mon, Jan 09, 2012 at 01:32:49PM -0500, Robert Haas wrote:
   The one other issue I ran into in following the latest pgindent
   instructions was that I had to add #include stdlib.h to the
   parse.c file (as included in the pg_bsd_indent-1.1.tar.gz file at
   ftp://ftp.postgresql.org/pub/dev ).  Without it I got this:
  
   parse.c: In function *parse*:
   parse.c:236:6: warning: implicit declaration of function *exit*
   parse.c:236:6: warning: incompatible implicit declaration of built-in
   function *exit*
  
   Can someone fix that and put up a 1.2 version?
 
  Sounds like a job for Mr. Momjian.

 What server do I log into to update the ftp pgindent tgz file?


The ftp master server is fornax.postgresql.org.

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] controlling the location of server-side SSL files

2012-02-08 Thread Magnus Hagander
On Tuesday, February 7, 2012, Peter Eisentraut wrote:

 On tis, 2012-01-24 at 22:05 +0200, Peter Eisentraut wrote:
One thing that is perhaps worth thinking about:  Currently, we just
ignore missing root.crt and root.crl files.  With this patch, we
 still
do this, even if the user has given a specific nondefault location.
That seems a bit odd, but I can't think of a simple way to do it
 better.
  
   There's a review in the CF app for this finding only minor issues, so
   I'm marking this patch therein as Ready for Committer.
 
  OK, no one had any concerns about the missing file behavior I
  described above?  If not, then I'll commit it soon.

 I'm still worried about this.  If we ignore a missing root.crt, then the
 effect is that authentication and certificate verification might fail,
 which would be annoying, but you'd notice it soon enough.  But if we
 ignore a missing root.crl, we are creating a security hole.


Yes, ignoring a missing file in a security context is definitely not good.
It should throw an error.

We have a few bad defaults from the old days around SSL for this, but if it
requires breaking backwards compatibility to get it right, I think we
should still do it.


My best idea at the moment is that we should set these parameters to
 empty by default, and make users point them to existing files if they
 want to use that functionality.  Comments?


+1. Anybody who actually cares about setting up security is likely not
going to rely on defaults anyway - and is certainly going to review
whatever they are. So there should be no big problem there.

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Memory usage during sorting

2012-02-08 Thread Hitoshi Harada
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.

I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative. I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it? If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.

Anyway, it's nice to modify the comment just above the change, since
it's now true with it.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory usage during sorting

2012-02-08 Thread Hitoshi Harada
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 The worst thing about the current memory usage is probably that big
 machines can't qsort more than 16,777,216 tuples no matter how much
 memory they have, because memtuples has to live entirely within a
 single allocation, which is currently limited to 1GB.  It is probably
 too late to fix this problem for 9.2. I wish I had started there
 rather than here.

 This 16,777,216 tuple limitation will get even more unfortunate if the
 specializations that speed up qsort but not external sort get
 accepted.


I think it's a fair ask to extend our palloc limitation of 1GB to
64bit space. I see there are a lot of applications that want more
memory by one palloc call in user-defined functions, aggregates, etc.
As you may notice, however, the area in postgres to accomplish it
needs to be investigated deeply. I don't know where it's safe to allow
it and where not. varlena types is unsafe, but it might be possible to
extend varlena header to 64 bit in major release somehow.

 Attached is a completely uncommitable proof of concept/work in
 progress patch to get around the limitation.  It shows a 2 fold
 improvement when indexing an integer column on a 50,000,000 row
 randomly ordered table.

In any case, we do need bird-eye sketch to attack it but I guess it's
worth and at some future point we definitely must do, though I don't
know if it's the next release or third next release from now.


Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-08 Thread Benedikt Grundmann
On 07/02/12 19:58, Bruce Momjian wrote:
 On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote:
  On 02/07/2012 03:23 PM, Bruce Momjian wrote:
  Where did you see that there will be an improvement in the 9.2
  documentation?  I don't see an improvement.
  
  I commented that I'm hoping for an improvement in the documentation
  of how much timing overhead impacts attempts to measure this area
  better.  That's from the add timing of buffer I/O requests feature
  submission.  I'm not sure if Bene read too much into that or not; I
  didn't mean to imply that the docs around random_page_cost have
  gotten better.

I guess I did.  But I'm very glad that as a side effect Bruce and Greg 
have improved it ;-)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [pgsql-www] [HACKERS] pgindent README correction

2012-02-08 Thread Dave Page
On Wed, Feb 8, 2012 at 8:13 AM, Magnus Hagander mag...@hagander.net wrote:
 On Wednesday, February 8, 2012, Bruce Momjian wrote:

 On Mon, Jan 09, 2012 at 01:32:49PM -0500, Robert Haas wrote:
   The one other issue I ran into in following the latest pgindent
   instructions was that I had to add #include stdlib.h to the
   parse.c file (as included in the pg_bsd_indent-1.1.tar.gz file at
   ftp://ftp.postgresql.org/pub/dev ).  Without it I got this:
  
   parse.c: In function *parse*:
   parse.c:236:6: warning: implicit declaration of function *exit*
   parse.c:236:6: warning: incompatible implicit declaration of built-in
   function *exit*
  
   Can someone fix that and put up a 1.2 version?
 
  Sounds like a job for Mr. Momjian.

 What server do I log into to update the ftp pgindent tgz file?


 The ftp master server is fornax.postgresql.org.

Though, you should bookmark ftpmaster.postgresql.org which will always
point to the master FTP server even if it moves to a different host.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-08 Thread Marti Raudsepp
On Wed, Feb 8, 2012 at 06:21, Tom Lane t...@sss.pgh.pa.us wrote:
 Marti Raudsepp ma...@juffo.org writes:
 Case #1 uses the normal textcat(text, text) operator by automatically
 coercing 'x' as text.
 However, case #2 uses the anytextcat(anynonarray, text), which is
 marked as volatile thus acts as an optimization barrier.

 Hmm ... since those operators were invented (in 8.3), we have adopted a
 policy that I/O functions are presumed to be no worse than stable:
 http://archives.postgresql.org/pgsql-committers/2010-07/msg00307.php
 ISTM that would justify relabeling anytextcat/textanycat as stable,
 which should fix this.

Yes, we should definitely take advantage of that.

I scanned through all of pg_proc, there are 4 functions like this that
can be changed: textanycat, anytextcat, quote_literal and
quote_nullable. All of these have SQL wrappers to cast their argument
to ::text.

 quote_literal  | select pg_catalog.quote_literal($1::pg_catalog.text)
 quote_nullable | select pg_catalog.quote_nullable($1::pg_catalog.text)
 textanycat | select $1 || $2::pg_catalog.text
 anytextcat | select $1::pg_catalog.text || $2

Patch attached (in git am format). Passes all regression tests (except
'json' which fails on my machine even on git master).

No documentation changes necessary AFAICT.

Regards,
Marti
From e1943868d21316ff9126283efec54146c14e00fc Mon Sep 17 00:00:00 2001
From: Marti Raudsepp ma...@juffo.org
Date: Wed, 8 Feb 2012 11:26:03 +0200
Subject: [PATCH] Mark textanycat/quote_literal/quote_nullable functions as
 stable

These are SQL functions that rely on immutable functions/operators, but
were previously marked volatile, since they relied on unknown ::text
casts. As of commit aab353a60b95aadc00f81da0c6d99bde696c4b75, all text
I/O functions are guaranteed to be stable or immutable.

Author: Marti Raudsepp ma...@juffo.org
---
 src/include/catalog/catversion.h |2 +-
 src/include/catalog/pg_proc.h|8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index ae4e5f5..1d92ee3 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	201202072
+#define CATALOG_VERSION_NO	201202081
 
 #endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8d01fb..d4206f1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2271,11 +2271,11 @@ DATA(insert OID =  1282 ( quote_ident	   PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0
 DESCR(quote an identifier for usage in a querystring);
 DATA(insert OID =  1283 ( quote_literalPGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 25 25 _null_ _null_ _null_ _null_ quote_literal _null_ _null_ _null_ ));
 DESCR(quote a literal for usage in a querystring);
-DATA(insert OID =  1285 ( quote_literalPGNSP PGUID 14 1 0 0 0 f f f t f v 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_literal($1::pg_catalog.text) _null_ _null_ _null_ ));
+DATA(insert OID =  1285 ( quote_literalPGNSP PGUID 14 1 0 0 0 f f f t f s 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_literal($1::pg_catalog.text) _null_ _null_ _null_ ));
 DESCR(quote a data value for usage in a querystring);
 DATA(insert OID =  1289 ( quote_nullable   PGNSP PGUID 12 1 0 0 0 f f f f f i 1 0 25 25 _null_ _null_ _null_ _null_ quote_nullable _null_ _null_ _null_ ));
 DESCR(quote a possibly-null literal for usage in a querystring);
-DATA(insert OID =  1290 ( quote_nullable   PGNSP PGUID 14 1 0 0 0 f f f f f v 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_nullable($1::pg_catalog.text) _null_ _null_ _null_ ));
+DATA(insert OID =  1290 ( quote_nullable   PGNSP PGUID 14 1 0 0 0 f f f f f s 1 0 25 2283 _null_ _null_ _null_ _null_ select pg_catalog.quote_nullable($1::pg_catalog.text) _null_ _null_ _null_ ));
 DESCR(quote a possibly-null data value for usage in a querystring);
 
 DATA(insert OID = 1798 (  oidin			   PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 26 2275 _null_ _null_ _null_ _null_ oidin _null_ _null_ _null_ ));
@@ -2747,8 +2747,8 @@ DESCR(adjust time precision);
 DATA(insert OID = 1969 (  timetz		   PGNSP PGUID 12 1 0 0 0 f f f t f i 2 0 1266 1266 23 _null_ _null_ _null_ _null_ timetz_scale _null_ _null_ _null_ ));
 DESCR(adjust time with time zone precision);
 
-DATA(insert OID = 2003 (  textanycat	   PGNSP PGUID 14 1 0 0 0 f f f t f v 2 0 25 25 2776 _null_ _null_ _null_ _null_ select $1 || $2::pg_catalog.text _null_ _null_ _null_ ));
-DATA(insert OID = 2004 (  anytextcat	   PGNSP PGUID 14 1 0 0 0 f f f t f v 2 0 25 2776 25 _null_ _null_ _null_ _null_ select $1::pg_catalog.text || $2 _null_ _null_ _null_ ));
+DATA(insert OID = 2003 (  textanycat	   PGNSP PGUID 14 1 0 0 0 f f f t f s 2 0 25 25 2776 _null_ _null_ _null_ _null_ select $1 || $2::pg_catalog.text _null_ _null_ _null_ ));
+DATA(insert OID = 2004 (  anytextcat	   PGNSP 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-08 Thread Simon Riggs
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
 On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
 On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:

  This patch uses FPIs to guard against torn hint writes, even when the
  checksums are disabled. ?One could not simply condition them on the
  page_checksums setting, though. ?Suppose page_checksums=off and we're 
  hinting
  a page previously written with page_checksums=on. ?If that write tears,
  leaving the checksum intact, that block will now fail verification. ?A 
  couple
  of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if 
  the
  old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
  Whenever the cluster starts with checksums disabled, set the field to
  InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and 
  the
  field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
  failure occurs in a page with LSN older than the stored one, emit either a
  softer warning or no message at all.

 We can only change page_checksums at restart (now) so the above seems moot.

 If we crash with FPWs enabled we repair any torn pages.

 There's no live bug, but that comes at a high cost: the patch has us emit
 full-page images for hint bit writes regardless of the page_checksums setting.

Sorry, I don't understand what you mean. I don't see any failure cases
that require that.

page_checksums can only change at a shutdown checkpoint,

The statement If that write tears,
  leaving the checksum intact, that block will now fail verification.
cannot happen, ISTM.

If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.

If that's wrong, please explain a failure case in detail.


  PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
  holding
  is insufficient.

 Am serialising this by only writing PageLSN while holding buf hdr lock.

 That means also taking the buffer header spinlock in every PageGetLSN() caller
 holding only a shared buffer content lock.  Do you think that will pay off,
 versus the settled pattern of trading here your shared buffer content lock for
 an exclusive one?

Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-08 Thread Marko Kreen
On Wed, Feb 08, 2012 at 02:44:13PM +0900, Shigeru Hanada wrote:
 (2012/02/07 23:44), Marko Kreen wrote:
  On Tue, Feb 07, 2012 at 07:25:14PM +0900, Shigeru Hanada wrote:
  - What is the right (or recommended) way to prevent from throwing
  exceptoin in row-processor callback function?  When author should use
  PG_TRY block to catch exception in the callback function?
  
  When it calls backend functions that can throw exceptions?
  As all handlers running in backend will want to convert data
  to Datums, that means always wrap handler code in PG_TRY?
 
 ISTM that telling a) what happens to PGresult and PGconn when row
 processor ends without return, and b) how to recover them would be
 necessary.  We can't assume much about caller because libpq is a client
 library.  IMHO, it's most important to tell authors of row processor
 clearly what should be done on error case.

Yes.

 In such case, must we call PQfinish, or is calling PQgetResult until it
 returns NULL enough to reuse the connection?  IIUC calling
 pqClearAsyncResult seems sufficient, but it's not exported for client...

Simply dropping -result is not useful as there are still rows
coming in.  Now you cannot process them anymore.

The rule of after exception it's valid to close the connection with PQfinish()
or continue processing with PQgetResult()/PQisBusy()/PQconsumeInput() seems
quite simple and clear.  Perhaps only clarification whats valid on sync
connection and whats valid on async connection would be useful.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-08 Thread Shigeru Hanada
(2012/02/02 18:24), Marko Kreen wrote:
 I think you want this instead:
 
https://commitfest.postgresql.org/action/patch_view?id=769

With modified version of pgsql_fdw which uses row processor to retrieve
result tuples, I found significant performance gain on simple read-only
pgbench, though scale factor was very small (-s 3).  Executed command
was pgbench -S -n -c 5 T 30.

Average tps (excluding connections establishing) of 3 times measurements
are:

pgsql_fdw with SQL-cursor:  622
pgsql_fdw with Row Processor : 1219 - 2.0x faster than SQL-cursor
w/o pgsql_fdw(direct access) : 7719 - 6.3x faster than Row Processor

Row processor was almost 2x faster than SQL-cursor!  I'm looking forward
to this feature.

In addition to performance gain, of course memory usage was kept at very
low level. :)

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xlog location arithmetic

2012-02-08 Thread Fujii Masao
On Wed, Feb 8, 2012 at 2:29 AM, Euler Taveira de Oliveira
eu...@timbira.com wrote:
 On 26-01-2012 06:19, Fujii Masao wrote:

 Thanks for your review. Comments below.

 When I compiled the source with xlogdiff.patch, I got the following warnings.

 xlogfuncs.c:511:2: warning: format '%lX' expects argument of type
 'long unsigned int *', but argument 3 has type 'uint64 *' [-Wformat]

 What is your compiler? I'm using gcc 4.6.2. I refactored the patch so I'm now
 using XLogRecPtr and %X.

gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)

$ uname -a
Linux hermes 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20 15:59:53 UTC
2012 i686 i686 i386 GNU/Linux

In the updated version of the patch, I got no warnings at the compile time.
But initdb failed because the OID which you assigned to pg_xlog_location_diff()
has already been used for other function. So you need to update pg_proc.h.

 postgres=# SELECT pg_xlog_location_diff('0/274', '0/274');
 ERROR:  xrecoff 274 is out of valid range, 0..A4A534C

 Ugh? I can't reproduce that. It seems to be related to long int used by the
 prior version.

Maybe.

But another problem happened. When I changed pg_proc.h so that the unused
OID was assigned to pg_xlog_location_diff(), and executed the above again,
I encountered the segmentation fault:

LOG:  server process (PID 14384) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: SELECT
pg_xlog_location_diff('0/274', '0/274');
LOG:  terminating any other active server processes

ISTM that the cause is that int8_numeric() is executed for uint32 value. We
should use int4_numeric(), instead?

 While the output was int8 I could use
 pg_size_pretty but now I couldn't. I attached another patch that implements
 pg_size_pretty(numeric).

 I realized that it collides with the pg_size_pretty(int8) if we don't specify
 a type. Hence, I decided to drop the pg_size_pretty(int8) in favor of
 pg_size_pretty(numeric). It is slower than the former but it is not a
 performance critical function.

I'm OK with this.

-DATA(insert OID = 2288 ( pg_size_prettyPGNSP PGUID 12 
1 0 0 0 f f
f t f v 1 0 25 20 _null_ _null_ _null_ _null_ pg_size_pretty _null_
_null_ _null_ ));
-DESCR(convert a long int to a human readable text using size units);
+DATA(insert OID = 3158 ( pg_size_prettyPGNSP PGUID 12 
1 0 0 0 f f
f t f v 1 0 25 1700 _null_ _null_ _null_ _null_ pg_size_pretty
_null_ _null_ _null_ ));
+DESCR(convert a numeric to a human readable text using size units);

Why OID needs to be reassigned?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-08 Thread Peter Eisentraut
On mån, 2012-02-06 at 21:11 +0100, Michael Meskes wrote:
 On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote:
  Shouldn't these be [7]? You can have up to 6 items, plus the terminator.
 
 I take it only keywords have to be [7], right? Committed, thanks for spotting 
 this.
 
 There seems to be one more problem that I haven't had time to tackle yet. With
 this patch the connection regression test (make checktcp) doesn't run through
 anymore because one test connection cannot be made. It spits out the error
 message:
 
 FATAL:  invalid command-line arguments for server process 
 HINT:  Try postgres --help for more information.
 
 This connection used to work without the patch and besides the error message 
 is
 not really telling in this context.

You can get more information on that with the attached patch, which
might be a useful addition in general.  The problem comes from this:

exec sql connect to unix:postgresql://localhost/connectdb?connect_timeout=14 
user connectuser;

and the updated error message is:

ECPGconnect: could not open database: FATAL:  invalid command-line arguments 
for server process: connect_timeout=14

Because connect_timeout is a separate libpq connection parameter, but
now it's stuck into options.  It might have worked more or less by
accident before.

It's not clear to me why this only appears on checktcp.  And why we
don't run those tests by default.  That should be clarified, because
there are also other failures in that test that show that it hasn't been
run in a while.

diff --git i/src/backend/tcop/postgres.c w/src/backend/tcop/postgres.c
index 49a3969..d5bd73f 100644
--- i/src/backend/tcop/postgres.c
+++ w/src/backend/tcop/postgres.c
@@ -3375,7 +3375,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx)
 		if (IsUnderPostmaster)
 			ereport(FATAL,
 	(errcode(ERRCODE_SYNTAX_ERROR),
- errmsg(invalid command-line arguments for server process),
+ errmsg(invalid command-line arguments for server process: %s, argv[optind]),
 			  errhint(Try \%s --help\ for more information., progname)));
 		else
 			ereport(FATAL,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Can PQstatus() be used by Application to check connection to postgres periodically?

2012-02-08 Thread Robert Haas
2012/2/8 Pavel Golub pa...@microolap.com:
 Hello, sujayr06.

 You wrote:

 s Hello All,

 s            My application has to do a real time data upload to PostgreSQL
 s server.

 s            Every time i have to do a real time upload, i do not wish to 
 open
 s new connection.
 s            I want to open a connection once [when my application comes up]
 s and periodically check if the connection is active.
 s            Can PQstatus() be used by application to check the status of the
 s connection already established?

 s            If PQstatus() cannot be used, does PostgreSQL provide alternate
 s interface to check the status of the                     connection.

 s           Note : I do not wish to open connection on every real time upload
 s as its an overhead for the application.

 s            Appreciate any help!

 You may use PQtransactionStatus for this purpose
 (http://www.postgresql.org/docs/9.1/static/libpq-status.html)

Only if you don't care about getting the right answer.  Neither
PQstatus() nor PQtransactionStatus() make any attempt to determine
whether the server is still there; they just return the cached status
of the connection, without any system calls whatsoever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 4:53 AM, Marti Raudsepp ma...@juffo.org wrote:
 Patch attached (in git am format). Passes all regression tests (except
 'json' which fails on my machine even on git master).

Can you provide the diffs from the json test on your machine?  I don't
see any build-farm failures off-hand...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next. I'm thinking, in particular, of
btree index specialisations, where it could make perfect sense to go
crazy. You cannot have a reasonable discussion about such costs
without considering that they will perhaps never be paid, given any
plausible workload. That's why the percentage that the postgres binary
size has been shown to increase by really isn't pertinent at all. At
best, it's a weak proxy for such costs, assuming you don't have a
demonscene-like preoccupation with reducing binary size, and I don't
believe that we should.

It would be difficult for me to measure such things objectively, but
I'd speculate that the proprietary databases have much larger binaries
than ours, while having far fewer features, precisely because they
started applying tricks like this a long time ago. You could counter
that their code bases probably look terrible, and you'd have a point,
but so do I.

On 8 February 2012 02:38, Robert Haas robertmh...@gmail.com wrote:
 I've been skeptical of this patch for a number of reasons, the major
 one of which is that most workloads spend only a very small amount of
 time in doing qucksorts, and most quicksorts are of very small amounts
 of data and therefore fast anyway.   It is easy to construct an
 artificial test case that does lots and lots of in-memory sorting, but
 in real life I think that's not the great part of what people use
 databases for.

Fair enough, but if that's true, then it's also true that the cost due
to cache marginalisation - the only cost that I think is worth
considering at all - is correspondingly a small fraction of that very
small amount of sorting.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-08 Thread Noah Misch
On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote:
 On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
  On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
   This patch uses FPIs to guard against torn hint writes, even when the
   checksums are disabled. ?One could not simply condition them on the
   page_checksums setting, though. ?Suppose page_checksums=off and we're 
   hinting
   a page previously written with page_checksums=on. ?If that write tears,
   leaving the checksum intact, that block will now fail verification. ?A 
   couple
   of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only 
   if the
   old page had a checksum. ?(b) Add a checksumEnableLSN field to 
   pg_control.
   Whenever the cluster starts with checksums disabled, set the field to
   InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled 
   and the
   field is InvalidXLogRecPtr, set the field to the next LSN. ?When a 
   checksum
   failure occurs in a page with LSN older than the stored one, emit either 
   a
   softer warning or no message at all.
 
  We can only change page_checksums at restart (now) so the above seems moot.
 
  If we crash with FPWs enabled we repair any torn pages.
 
  There's no live bug, but that comes at a high cost: the patch has us emit
  full-page images for hint bit writes regardless of the page_checksums 
  setting.
 
 Sorry, I don't understand what you mean. I don't see any failure cases
 that require that.
 
 page_checksums can only change at a shutdown checkpoint,
 
 The statement If that write tears,
   leaving the checksum intact, that block will now fail verification.
 cannot happen, ISTM.
 
 If we write out a block we update the checksum if page_checksums is
 set, or we clear it. If we experience a torn page at crash, the FPI
 corrects that, so the checksum never does fail verification. We only
 need to write a FPI when we write with checksums.
 
 If that's wrong, please explain a failure case in detail.

In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both.  (This is
just the normal torn-page scenario.)  Given that, consider the following
sequence of events:

1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown

5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
7. begin write of P
8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT

9. startup with page_checksums = off
10. crash recovery.  does not touch P, because step 6 generated no WAL
11. clean shutdown

12. startup with page_checksums = on
13. read P.  checksum failure

Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL.  That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.

   PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
   holding
   is insufficient.
 
  Am serialising this by only writing PageLSN while holding buf hdr lock.
 
  That means also taking the buffer header spinlock in every PageGetLSN() 
  caller
  holding only a shared buffer content lock. ?Do you think that will pay off,
  versus the settled pattern of trading here your shared buffer content lock 
  for
  an exclusive one?
 
 Yes, I think it will pay off. This is the only code location where we
 do that, and we are already taking the buffer header lock, so there is
 effectively no overhead.

The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
spinlock at either of those locations.  If they remain safe under the new
rules, we'll need comments explaining why.  I think they may indeed be safe,
but it's far from obvious.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-08 Thread Robert Haas
On Tue, Feb 7, 2012 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:
 I've committed the numeric and varbit patches and will look at the
 temporal one next.

Committed, after changing the OIDs so they don't conflict.

Here's one more case for you to ponder:

rhaas=# create table noah (i interval day);
CREATE TABLE
rhaas=# alter table noah alter column i set data type interval second(3);
DEBUG:  rewriting table noah
ALTER TABLE

Do we really need a rewrite in that case?  The code acts like the
interval range and precision are separate beasts, but is that really
true?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-02-08 Thread Alvaro Herrera

Excerpts from Bruce Momjian's message of mié feb 08 00:58:58 -0300 2012:

 As much as I hate to poo-poo a patch addition, I have to agree with
 Robert Haas on this one.  Renaming settings really isn't moving us
 forward.  It introduces a migration problem and really doesn't move us
 forward in solving the underlying problem.  Additional monitoring, while
 helpful, also is only a stop-gap.

I think that (part of) the underlying problem is that we have no clear
way to specify how much I/O do you want autovacuum to use.  That's
what this patch is all about, AFAIU; it has nothing to do with
monitoring.  Right now, as has been said, the only way to tweak this is
to change vacuum_cost_delay; the problem with that setting is that
making the calculation is not straightforward.

(Now, I disagree that it's so complex that it cannot ever be explain to
a class; or that it's so obscure that the only way to make it work is to
leave it alone and never touch it.  It's complex, okay, but it's not
exactly rocket science either.)

If the only real downside to this patch is that some people have already
changed vacuum_cost_delay and they will want to migrate those settings
forward, maybe we shouldn't be looking at _replacing_ that one with a
new setting, but rather just add the new setting; and in the code for
each, make sure that only one of them is set, and throw an error if the
other one is.


 On Thu, Jan 19, 2012 at 09:42:52PM -0500, Robert Haas wrote:

  Another problem is that the vacuum algorithm itself could, I think, be
  made much smarter.  We could teach HOT to prune pages that contain no
  HOT chains but do contain dead tuples.  That would leave dead line
  pointers behind, but that's not nearly as bad as leaving the entire
  tuple behind.  We could, as Simon and others have suggested, have one
  threshold for vacuuming the heap (i.e. reclaiming dead tuples) and
  another for vacuuming the indexes (i.e. reclaiming dead line
  pointers).  That would open the door to partial vacuuming: just vacuum
  half a gigabyte or so of the heap, and then move on; the next vacuum
  can pick up where that one left off, at least up to the point where we
  decide we need to make an index pass; it would possibly also allow us
  to permit more than one vacuum on the same table at the same time,
  which is probably needed for very large tables.  We could have
  backends that see dead tuples on a page throw them over to the fence
  to the background writer for immediate pruning.  I blather, but I
  guess my point is that I really hope we're going to do something
  deeper here at some point in the near future, whatever becomes of the
  proposals now on the table.

This is all fine, but what does it have to do with the current patch?  I
mean, if we change vacuum to do some stuff differently, it's still going
to have to read and dirty pages and thus account for I/O.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 It doesn't necessarily matter if we increase the size of the postgres
 binary by 10%, precisely because most of that is not going to be in
 play from one instant to the next.

You've heard of swapping, no?  Basically what I'm hearing from you is
total denial that binary bloat costs anything, and that just does not
square with reality.  Even if the costs in performance terms are
negligible in many common situations (something that you've asserted
but without offering any proof), there are other costs; software
distribution for instance is definitely sensitive to size.  As a Red Hat
person I've had to spend time fighting to keep Postgres included in the
minimum does it fit on a DVD distribution of RHEL/Fedora.  That case
gets harder to make every year, and yet it's pretty critical to the
success of the project --- if you don't get distributed, you lose users.

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added.  I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts.  The other
cases aren't going to offer enough of a win to justify the code space.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 9:38 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 I think that (part of) the underlying problem is that we have no clear
 way to specify how much I/O do you want autovacuum to use.  That's
 what this patch is all about, AFAIU; it has nothing to do with
 monitoring.  Right now, as has been said, the only way to tweak this is
 to change vacuum_cost_delay; the problem with that setting is that
 making the calculation is not straightforward.

I agree.

 (Now, I disagree that it's so complex that it cannot ever be explain to
 a class; or that it's so obscure that the only way to make it work is to
 leave it alone and never touch it.  It's complex, okay, but it's not
 exactly rocket science either.)

I emphatically agree.

 If the only real downside to this patch is that some people have already
 changed vacuum_cost_delay and they will want to migrate those settings
 forward, maybe we shouldn't be looking at _replacing_ that one with a
 new setting, but rather just add the new setting; and in the code for
 each, make sure that only one of them is set, and throw an error if the
 other one is.

I think that part of the confusion here is that the current settings
aren't strictly trying to regulate vacuum's I/O rate so much as the
total amount of work it can do per unit time.  Even if autovac dirties
no pages and has no misses, the pages it hits will eventually fill up
the cost limit and it will sleep; but the work in that case is all CPU
utilization, not I/O.  Similarly, we don't have separate control knobs
for read I/O (misses) and write I/O (dirty pages).  The sleeps happen
due to a blend of factors (CPU, read I/O, write I/O) which are
combined to produce an estimate of the total impact of vacuum on the
rest of the system.

Now, an argument could be made that we ought not to care about that.
We could say: trying to use this blended algorithm to figure out when
vacuum is getting too expensive is a loser.  We just want to limit the
amount of dirty data that's being generated, because that's the only
thing that matters to us.  In that case, I could see adding a new
setting, sitting alongside the existing settings: amount of dirty data
that can be generated per second.  But that's not what Greg is
proposing.  He is proposing, essentially, that we keep the blended
algorithm, but then stack another calculation on top that backs into
the blended cost limit based on the user's tolerance for dirtying
data.  So there will still be limits on the amount of read I/O and CPU
usage, but they'll be derived from the allowable rate of data dirtying
and the ratios between the different cost parameters.  The current
settings aren't exactly intuitive, but I really don't want to have to
explain to a customer that setting the dirty data limit to 8MB/s will
actually limit it to just 5.3MB/s if all the pages are not in shared
buffers (because the page miss costs will account for a third of the
budget) and to 7.8MB/s if all the pages are in shared buffers (because
the page miss costs will account for a twenty-first of the budget) and
somewhere in between if only some of the pages are resident; and that,
further, by setting the dirty data rate to 8MB/s, they've implicitly
set the max read rate from disk at 16MB/s, again because of the 2:1
dirty:miss cost ratio.  Yikes!

Honestly, I think the best place for this work is in something like
pgtune.  It's totally useful to have a calculator for this stuff (for
many of the same reasons that it's useful to have explain.depesz.com)
but the abstraction being proposed is leaky enough that I think it's
bound to cause confusion.

 This is all fine, but what does it have to do with the current patch?  I
 mean, if we change vacuum to do some stuff differently, it's still going
 to have to read and dirty pages and thus account for I/O.

Yeah, I drifted off topic there a bit.  I think the only relevant
point in all that is that even if we all agreed that this is an
improvement, I'd be reluctant to slap a band-aid on something that I
think needs surgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 8:33 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 It doesn't necessarily matter if we increase the size of the postgres
 binary by 10%, precisely because most of that is not going to be in
 play from one instant to the next.

As Tom says, that doesn't jive with my experience.  If you add on
enough binary bloat, you will have more page faults.  It's true (as I
think you pointed out upthread) that packing all the copies of
quicksort into the binary one after the other minimizes the effect on
other things, but it doesn't eliminate it entirely.  If you've got
this:

random other stuff ... a zillion copies of quicksort  more other stuff

...then a branch from the random other stuff section of the binary
to the more other stuff section of the binary may cost more.  For
example, suppose the OS does X amount of readahead.  By stuffing all
those copies of quicksort into the middle there, you increase the
chances that the page you need was beyond the readahead window.  Or,
if it wasn't going to be in the readahead window either way, then you
increase the distance that the disk head needs to move to find the
required block.

These costs are very small, no question about it.  They are almost
impossible to measure individually, in much the same way that the cost
of pollution or greenhouse gas emissions is difficult to measure.  But
it's an error to assume that because the costs are individually small
that they will never add up to anything material.  As long as you keep
insisting on that, it's hard to have a rational conversation.  We can
reasonably debate the magnitude of the costs, but to assert that they
don't exist gets us nowhere.  Suppose we could get a 10% speedup on
sin(numeric) by adding 40GB to the binary size.  Would you be in favor
of that?  Do you think it would hurt performance on any other
workloads?  Would our packagers complain at all?  Surely your same
argument would apply to that case in spades: anyone who is not using
the gigantic hard-coded lookup table will not pay any portion of the
cost of it.

 It would be difficult for me to measure such things objectively, but
 I'd speculate that the proprietary databases have much larger binaries
 than ours, while having far fewer features, precisely because they
 started applying tricks like this a long time ago. You could counter
 that their code bases probably look terrible, and you'd have a point,
 but so do I.

That might be true; I have no idea.  There are probably lots of
reasons why their code bases look terrible, including a long history
of backward compatibility with now-defunct versions, a variety of
commercial pressures, and the fact that they don't have to take flak
in the public square for what their code looks like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO this patch is already well past the point of diminishing returns in
 value-per-byte-added.  I'd like to see it trimmed back to provide a fast
 path for just single-column int4/int8/float4/float8 sorts.  The other
 cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations.  I think I'm
going to go try to characterize that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-08 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
   if (IsUnderPostmaster)
   ereport(FATAL,
   (errcode(ERRCODE_SYNTAX_ERROR),
 -  errmsg(invalid command-line arguments for 
 server process),
 +  errmsg(invalid command-line arguments for 
 server process: %s, argv[optind]),
 errhint(Try \%s --help\ for more information., 
 progname)));
   else
   ereport(FATAL,

+1 for that change, but please s/arguments/argument/ in the text.  Also,
what about the other branch of the if that's just out of sight here?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 01:33:30PM +, Peter Geoghegan wrote:
 It doesn't necessarily matter if we increase the size of the postgres
 binary by 10%, precisely because most of that is not going to be in
 play from one instant to the next. I'm thinking, in particular, of
 btree index specialisations, where it could make perfect sense to go
 crazy. You cannot have a reasonable discussion about such costs
 without considering that they will perhaps never be paid, given any
 plausible workload. That's why the percentage that the postgres binary
 size has been shown to increase by really isn't pertinent at all. At
 best, it's a weak proxy for such costs, assuming you don't have a
 demonscene-like preoccupation with reducing binary size, and I don't
 believe that we should.

When you start a binary, your executable is mapped to a file system
binary, and you page-fault in the pages you need.  Now, if your
optimization was alone in its own 4k (x86) virtual pages, and you never
called the functions, odds are you would not pay a penalty, aside from
distribution penalty, and perhaps a small penalty if useful code was
before and after your block.

The sort code expansion, however, is done in-line, in the middle of the
sort code, you are clearly are filling in 64-byte (x86) CPU cache lines
with type-specific expansion code for every sort case, whether we use
the code or not.  Now, I don't think it is a big problem, and I think
the speedup is worth it for common data types, but we can't say the cost
is zero.  

Saying it another way, having a binary in your file system that you
never call is not overhead except for storage, but in this case, the
sort code expansion is inside critical functions we are already calling.

Frankly, it is the cost that has kept use away from using such
optimizations for a long time.  I recently posted that the zero-cost
optimizations are mostly completed for sort and COPY, and we have to
start considering non-zero-cost optimizations --- sad, but true.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:
 On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  IMO this patch is already well past the point of diminishing returns in
  value-per-byte-added.  I'd like to see it trimmed back to provide a fast
  path for just single-column int4/int8/float4/float8 sorts.  The other
  cases aren't going to offer enough of a win to justify the code space.
 
 I'm curious about how much we're gaining from the single-column
 specializations vs. the type-specific specializations.  I think I'm
 going to go try to characterize that.

Yes, please.  That would be a big help.   Is there no optimization for
strings?  I assume they are sorted a lot.  

We can alway add more data types later.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstat documentation tables

2012-02-08 Thread Bruce Momjian
On Tue, Feb 07, 2012 at 11:29:12PM -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  How do people feel about pulling text out of the SGML docs and loading
  it into the database as table and column comments?
 
 I'm not thrilled by that proposal.  The length limits on comments are
 very much shorter than what is sensible to use in catalogs.sgml (at
 least if you want the comments to look passable in \d displays).  And
 I don't want people trying to serve two different use-cases when they
 write those.

I was thinking of just grabbing the first X characters from the SGML
comments.

 Perhaps it'd work to pull column comments from the C header files
 --- any same-line comment in catalog/pg_*.h is probably short enough to
 serve this purpose usefully.

Yes, that would be simpler.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 8, 2012 at 4:53 AM, Marti Raudsepp ma...@juffo.org wrote:
 Patch attached (in git am format). Passes all regression tests (except
 'json' which fails on my machine even on git master).

 Can you provide the diffs from the json test on your machine?  I don't
 see any build-farm failures off-hand...

I'm seeing diffs too after applying Marti's patch: instead of z, b,
etc, the field labels in the json values look like f1, f2, etc in
the output of queries such as

SELECT row_to_json(q) 
FROM (SELECT $$a$$ || x AS b, 
 y AS c, 
 ARRAY[ROW(x.*,ARRAY[1,2,3]),
   ROW(y.*,ARRAY[4,5,6])] AS z 
  FROM generate_series(1,2) x, 
   generate_series(4,5) y) q;

I believe what is happening is that now that the planner can flatten the
sub-select, the RowExprs are getting expanded differently, and that ties
into the when do we lose column names business that Andrew has been
on about.

However, I was not seeing that before applying the patch, so maybe Marti
has another issue too.

I am going to go ahead and commit the patch with the altered json
results, because IMO it is mere accident that these regression cases
were coming out with nice field labels anyway.  When and if Andrew
gets the RowExpr cases fixed properly, that will show up as these cases
going back to nicer-looking output.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 09:56:17AM -0500, Robert Haas wrote:
  This is all fine, but what does it have to do with the current patch?  I
  mean, if we change vacuum to do some stuff differently, it's still going
  to have to read and dirty pages and thus account for I/O.
 
 Yeah, I drifted off topic there a bit.  I think the only relevant
 point in all that is that even if we all agreed that this is an
 improvement, I'd be reluctant to slap a band-aid on something that I
 think needs surgery.

Agreed.  I see this only as short-term fix that will be changed in the
next release if we really decide to tackle the problem.  What percentage
of our userbase are going to ever adjust these settings --- 1%, for
sure.

What we have now just isn't cutting it for 99% of our users, and we need
to address that if we are going to ever make any real headway here.

Why can't vacuum handle things automatically like checkpoint smoothing? 
Why can't it detect when it is falling behind and speed up?  Why can't
it see as busy background writer and slow down?   Unless we answer
these questions, we are not solving the problem for 99% of our users.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Yes, please.  That would be a big help.   Is there no optimization for
 strings?  I assume they are sorted a lot.  

It seems unlikely that it'd be worth including strings, especially if
your locale is not C.  This whole thing only makes sense for datatypes
that are comparable in approximately 1 machine instruction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 11:35:46AM -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Yes, please.  That would be a big help.   Is there no optimization for
  strings?  I assume they are sorted a lot.  
 
 It seems unlikely that it'd be worth including strings, especially if
 your locale is not C.  This whole thing only makes sense for datatypes
 that are comparable in approximately 1 machine instruction.

Ah, OK, interesting.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-08 Thread Marti Raudsepp
On Wed, Feb 8, 2012 at 18:20, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 8, 2012 at 4:53 AM, Marti Raudsepp ma...@juffo.org wrote:
 Patch attached (in git am format). Passes all regression tests (except
 'json' which fails on my machine even on git master).

 Can you provide the diffs from the json test on your machine?  I don't
 see any build-farm failures off-hand...

 I'm seeing diffs too after applying Marti's patch: instead of z, b,
 etc, the field labels in the json values look like f1, f2, etc in
 the output of queries such as

Sorry, my bad. I guess I got the two versions (patched and unpatched)
mixed up. Indeed this difference disappears when I remove my patch.
I'll have to be more careful when checking regression diffs in the
future. :)

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 15:17, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO this patch is already well past the point of diminishing returns in
 value-per-byte-added.  I'd like to see it trimmed back to provide a fast
 path for just single-column int4/int8/float4/float8 sorts.  The other
 cases aren't going to offer enough of a win to justify the code space.

 I'm curious about how much we're gaining from the single-column
 specializations vs. the type-specific specializations.  I think I'm
 going to go try to characterize that.

I think it might make more sense to lose the type-specific
specialisations for the multi-key case while adding a generic
multi-key specialisation, than to lose all multi-key specialisations,
though I have not considered that question at length, and would think
that we'd still want to keep an int4 version in that case. Note that I
*did not* include a generic multi-key specialisation, though only
because I saw little point, having already covered by far the most
common cases.

While you're at it, I'd like to suggest that you perform a benchmark
on a multi-key specialisation, so we can see just what we're throwing
away before we do so. Better to have those numbers come from you.

I continue to maintain that the most appropriate course of action is
to provisionally commit all specialisations. If it's hard to know what
effect this is going to have on real workloads, let's defer to beta
testers, who presumably try the new release out with their
application. It's a question you could squarely put to them, without
gradually rolling back from that initial position being much of a
problem.

The mysql-server package is 45 MB on Fedora 16. That 1% of Postgres
binary figure is for my earlier patch with btree specialisations,
right? I'm not asking you to look at that right now. I also don't
think that where do we eventually draw the line with specialisations
like this in Postgres generally? is a question that you should expect
me to answer, though I will say that we should look at each case on
its merits.

I have not totally denied binary bloat costs. I have attempted to
quantify them, while acknowledging that such a task is difficult, as
was evident from the fact that Robert wasn't suprised that I could
not demonstrate any regression. Granted, my definition of a regression
is that there is very clearly no net loss in performance at some
reasonable granularity, which is a very practical definition. You can
quite easily contrive a case that HOT handles really badly. Some
people did, I believe, but HOT won out because it was clearly very
useful in the real world.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-08 Thread Andrew Dunstan



On 02/08/2012 11:20 AM, Tom Lane wrote:
I am going to go ahead and commit the patch with the altered json 
results, because IMO it is mere accident that these regression cases 
were coming out with nice field labels anyway. When and if Andrew 
gets the RowExpr cases fixed properly, that will show up as these 
cases going back to nicer-looking output.


I take it this is your way of making me hurry up with that work :-)

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Enable min/max optimization for bool_and/bool_or/every

2012-02-08 Thread Tom Lane
Marti Raudsepp ma...@juffo.org writes:
 On Thu, Dec 22, 2011 at 18:41, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Dec 19, 2011 at 5:16 AM, Marti Raudsepp ma...@juffo.org wrote:
 PS: It seems that the min/max optimization isn't documented in the
 manual (apart from release notes), so I didn't include any doc changes
 in this patch.

 I don't see a patch attached to this email, so either you forgot to
 attach it, or the list ate it somehow.

 I forgot to attach it, sorry. Here it is.

I applied this patch, since I was busy applying catalog changes from you
anyway ;-).

I did think of a possible reason to reject the patch: with this change,
the planner will take longer to plan queries involving bool_and() et al,
since planagg.c will spend time looking (usually fruitlessly) for an
index-based plan.  I tried this simple test case:

create table t (f1 bool);
\timing
explain select bool_and(f1) from t;

Best-case timings for the EXPLAIN were about 0.480 ms without the patch
and 0.500 ms with, so about a 4% penalty.  On more complicated queries
I think the fractional cost would be less.  This seemed acceptable to
me, so I went ahead and applied the change, but if anyone wants to
argue about it now's the time.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-08 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/08/2012 11:20 AM, Tom Lane wrote:
 I am going to go ahead and commit the patch with the altered json 
 results, because IMO it is mere accident that these regression cases 
 were coming out with nice field labels anyway. When and if Andrew 
 gets the RowExpr cases fixed properly, that will show up as these 
 cases going back to nicer-looking output.

 I take it this is your way of making me hurry up with that work :-)

Well, right now the behavior is more consistent than it was before;
we would surely have gotten questions about it as it was.  Whether
you find time to improve it or not isn't my concern at the moment.

BTW, the order of the items in the json output doesn't match the column
order of the sub-select, with or without the patch.  This seems ... odd.
Is it intentional?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 10:59 AM, Bruce Momjian br...@momjian.us wrote:
 On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:
 On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  IMO this patch is already well past the point of diminishing returns in
  value-per-byte-added.  I'd like to see it trimmed back to provide a fast
  path for just single-column int4/int8/float4/float8 sorts.  The other
  cases aren't going to offer enough of a win to justify the code space.

 I'm curious about how much we're gaining from the single-column
 specializations vs. the type-specific specializations.  I think I'm
 going to go try to characterize that.

 Yes, please.  That would be a big help.   Is there no optimization for
 strings?  I assume they are sorted a lot.

They are, but the value of the optimization is expected to diminish
for more complex comparators.

I did a quick test of this using the same setup I mentioned upthread:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,1) g;

This was the actual test query:

select * from nodups order by g offset 10001;

I did a ten-second warmup on after starting each postmaster, followed
by three one-minute test runs:

master:
tps = 298.824321 (including connections establishing)
tps = 301.741619 (including connections establishing)
tps = 302.238016 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 357.553690 (including connections establishing)
tps = 358.765875 (including connections establishing)
tps = 358.825051 (including connections establishing)

Peter's patch, intact:
tps = 394.566994 (including connections establishing)
tps = 394.228667 (including connections establishing)
tps = 394.492677 (including connections establishing)

So that's about a 19% speedup for just the single sort-key
optimizations, and about a 30% speedup in total.  Compared to a
baseline that includes the single sort-key optimizations, the
additional type-specific optimization is buying about 10% on this
test.

I tried changing the test query to return the data, like this:

select * from nodups order by g;

master:
tps = 181.301129 (including connections establishing)
tps = 180.348333 (excluding connections establishing)
tps = 179.600825 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 201.728269 (including connections establishing)
tps = 198.022527 (including connections establishing)
tps = 199.663082 (including connections establishing)

Peter's patch, intact:
tps = 214.037387 (including connections establishing)
tps = 212.895145 (including connections establishing)
tps = 211.596558 (including connections establishing)

So, with the overhead of actually returning the sorted data, we get
10% from the generic single-sortkey optimizations and 18% from the two
combined.  Compared to a baseline that includes the single-sortkey
optimizations, the incremental improvement from adding the
type-specific optimizations is about 6.6%.  It would be less on a
wider table, presumably.

I also tested a two-column sort:

create table twocol (a, b) as select g%101, g%17 from
generate_series(1,1) g;
select * from twocol order by a, b offset 1;

master:
tps = 270.218604 (including connections establishing)
tps = 265.634964 (including connections establishing)
tps = 268.520805 (including connections establishing)

Peter's patch, intact:
tps = 285.528626 (including connections establishing)
tps = 285.140345 (including connections establishing)
tps = 282.388457 (including connections establishing)

So here the type-specific optimizations are buying us even less: only
about 6%, and that's with no client overhead.

And I tested float8:

create table f8 (g) as select random() from generate_series(1,1);
select * from f8 order by g offset 1;

master:
tps = 228.373239 (including connections establishing)
tps = 225.117704 (including connections establishing)
tps = 225.196197 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 263.060820 (including connections establishing)
tps = 262.788926 (including connections establishing)
tps = 263.041186 (including connections establishing)

Peter's patch, intact:
tps = 283.274797 (including connections establishing)
tps = 280.597965 (including connections establishing)
tps = 280.151349 (including connections establishing)

That's +17% for the single-sortkey stuff, +25% for everything, and the
incremental improvement of the type-specific optimizations over the
generic single-key optimizations is about 6.7%.

It seems clear that the single sort-key optimizations are a much
better value per byte of code than the type-specific optimizations.
Ignoring client overhead, we get between half and two-thirds of 

Re: [HACKERS] Vacuum rate limit in KBps

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 11:22 AM, Bruce Momjian br...@momjian.us wrote:
 What we have now just isn't cutting it for 99% of our users, and we need
 to address that if we are going to ever make any real headway here.

 Why can't vacuum handle things automatically like checkpoint smoothing?
 Why can't it detect when it is falling behind and speed up?  Why can't
 it see as busy background writer and slow down?   Unless we answer
 these questions, we are not solving the problem for 99% of our users.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 [ lots of numbers ]

 ... I just can't get excited about that.  However, I
 find the single-key optimizations much more compelling, for the
 reasons stated above, and feel we ought to include those.

This conclusion seems sound to me, for the reasons you stated and one
more: optimizations for a single sort key are going to be applicable
to a very wide variety of queries, whereas all the other cases are
necessarily less widely applicable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-02-08 Thread Robert Haas
On Tue, Feb 7, 2012 at 10:21 PM, Joachim Wieland j...@mcknight.de wrote:
 On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas robertmh...@gmail.com wrote:
 It turns out that (as you anticipated) there are some problems with my
 previous proposal.

 I assume you're talking to Tom, as my powers of anticipation are
 actually quite limited... :-)

No, I was talking to you, actually.  You earlier suggested that an
ArchiveHandle was only meant to contain a single PGconn, and it seems
you're right.  We can refactor that assumption away, but not
trivially.

 In my patch I dealt with exactly the same problem for the error
 handler by creating a separate function that has a static variable (a
 pointer to the ParallelState). The value is set and retrieved through
 the same function, so yes, it's kinda global but then again it can
 only be accessed from this function, which is only called from the
 error handler.

How did you make this thread-safe?

 I'm starting to think it might make sense to press on with this
 refactoring just a bit further and eliminate the distinction between
 Archive and ArchiveHandle.

 How about doing more refactoring after applying the patch, you'd then
 see what is really needed and then we'd also have an actual use case
 for more than one connection (You might have already guessed that this
 proposal is heavily influenced by my self-interest of avoiding too
 much work to make my patch match your refactoring)...

Well, I don't think your patch is going to be too heavily affected
either way, because most of your changes were not in pg_dump.c, and
the need for any changes at all in pg_dump.c should rapidly be going
away.  At any rate, the problem with stopping here is that g_conn is
still floating around, and one way or the other we've got to get rid
of it if you want to have more than one ArchiveHandle floating around
at a time.  So we at least need to press on far enough to get to that
point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-08 Thread Jeff Janes
On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 31.01.2012 17:35, Fujii Masao wrote:

 On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.01.2012 15:32, Robert Haas wrote:


 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.



 This seems to need a rebase.



 Here you go.


 The patch seems to need a rebase again.


 Here you go again. It conflicted with the group commit patch, and the patch
 to WAL-log and track changes to full_page_writes setting.


After applying this patch and then forcing crashes, upon recovery the
database is not correct.

If I make a table with 10,000 rows and then after that intensively
update it using a unique key:

update foo set count=count+1 where foobar=?

Then after the crash there are less than 10,000 visible rows:

select count(*) from foo

This not a subtle thing, it happens every time.  I get counts of
between 1973 and 8827.  Without this patch I always get exactly
10,000.

I don't really know where to start on tracking this down.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


pg_receivexlog and sync rep Re: [HACKERS] Updated version of pg_receivexlog

2012-02-08 Thread Fujii Masao
On Tue, Oct 25, 2011 at 7:37 PM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Oct 24, 2011 at 14:40, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Oct 24, 2011 at 13:46, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 How does this interact with synchronous replication? If a base backup that
 streams WAL is in progress, and you have synchronous_standby_names set to
 '*', I believe the in-progress backup will count as a standby for that
 purpose. That might give a false sense of security.

 Ah yes. Did not think of that. Yes, it will have this problem.

 Actually, thinking more, per other mail, it won't. Because it will
 never report that the data is synced to disk, so it will not be
 considered for sync standby.

Now, new replication mode (synchronous_commit = write) is supported.
In this mode, the in-progress backup will be considered as sync
standby because its periodic status report includes the valid write position.
We should change the report so that it includes only invalid positions.
Patch attached.

While I agree that the backup should not behave as sync standby, ISTM
that pg_receivexlog should, which is very useful. If pg_receivexlog does
so, we can write WAL synchronously in both local and remote, which
would increase the durability of the system. Thus, to allow pg_receivexlog
to behave as sync standby, we should change it so that its status report
includes the write and flush positions?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


invalid_write_pos_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 17:58, Robert Haas robertmh...@gmail.com wrote:
 It seems clear that the single sort-key optimizations are a much
 better value per byte of code than the type-specific optimizations.
 Ignoring client overhead, we get between half and two-thirds of the
 benefit, and it costs us just one extra copy of the quicksort logic
 for all data types.

That was clear from an early stage, and is something that I
acknowledged way back in September - it's pretty obvious that chasing
diminishing returns is the nature of this sort of thing, which is
fine, provided that they are not chased beyond some point that doesn't
make sense. However, I do not see why we should not at least include
full integer specialisations as well (for single-keys at least), given
that we get the additional, not inconsiderable improvements. I do not
accept the premise that we need to find the optimal bytes added to
performance increase ratio, because, as I've already stressed, adding
bytes to the application binary does not have a linear cost.

Here's what might be a good compromise:

singlekey generic, singlekey int4, singlekey int8, multikey generic.

That is a total number of 4 contiguous copies of the qsort, because
we've obsoleted the original qsort_arg (accept for btree index
sorting, which likely matters far less). We get the most benefit for 5
very common types - int4, int8, date, timestamp and timestamptz. A 30%
improvement has to be better than the 19% speedup for just the single
sort-key optimisations, considering that in practice these types are
very commonly used.

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

#define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;

#define thistype_vecswap(a, b, n)   
\
if ((n)  0) inl_swapfunc((a), (b), (size_t)(n), swaptype)

#define thistype_swap(a, b) 
\
if (swaptype == 0) {
\
long t = *(long *)(void *)(a);  
\
*(long *)(void *)(a) = *(long *)(void *)(b);\
*(long *)(void *)(b) = t;   
\
} else  
\
inl_swapfunc(a, b, es, swaptype)

inline static void
inl_swapfunc(char *a, char *b, size_t n, int swaptype)
{
if (swaptype = 1)
swapcode(long, a, b, n);
else
swapcode(char, a, b, n);
}


-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Tom Lane
Jay Levitt jay.lev...@gmail.com writes:
 [Posted at Andres's request]
 TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in 
 various builds.

Not sure yet about most of these, but I know the reason for this one:

 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes 
 it can't work on an unlogged table

That error is thrown in gistbuildempty, which is not called until after
we have finished building the main-fork index.  This is a tad unfriendly
when the table already contains lots of data.

ISTM there are two ways we could fix this:

1. Introduce a duplicative test at the start of gistbuild().

2. Rearrange the order of operations in index_build() so that the init
fork is made first.

Both of these are kinda ugly, but #2 puts the ugliness into someplace
that shouldn't have to know about it, and furthermore someplace that's
unlikely to get reverted if/when gist is fixed to not have this problem.
So I think I favor #1.  Other opinions anyone?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 2:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jay Levitt jay.lev...@gmail.com writes:
 [Posted at Andres's request]
 TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in
 various builds.

 Not sure yet about most of these, but I know the reason for this one:

 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes
 it can't work on an unlogged table

 That error is thrown in gistbuildempty, which is not called until after
 we have finished building the main-fork index.  This is a tad unfriendly
 when the table already contains lots of data.

 ISTM there are two ways we could fix this:

 1. Introduce a duplicative test at the start of gistbuild().

 2. Rearrange the order of operations in index_build() so that the init
 fork is made first.

 Both of these are kinda ugly, but #2 puts the ugliness into someplace
 that shouldn't have to know about it, and furthermore someplace that's
 unlikely to get reverted if/when gist is fixed to not have this problem.
 So I think I favor #1.  Other opinions anyone?

I don't think I understand your object to #2.  It appears to be a
trivial rearrangement?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 8, 2012 at 2:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ISTM there are two ways we could fix this:
 
 1. Introduce a duplicative test at the start of gistbuild().
 
 2. Rearrange the order of operations in index_build() so that the init
 fork is made first.
 
 Both of these are kinda ugly, but #2 puts the ugliness into someplace
 that shouldn't have to know about it, and furthermore someplace that's
 unlikely to get reverted if/when gist is fixed to not have this problem.
 So I think I favor #1.  Other opinions anyone?

 I don't think I understand your object to #2.  It appears to be a
 trivial rearrangement?

Yeah, but then we are wiring into index_build the idea that ambuildempty
is more important, or more likely to throw an error, or something, than
ambuild is.  It seems weird.  And fragile, since somebody could decide
to re-order those two steps again for reasons unrelated to gist.
Basically, I think this problem is gist's to deal with and so that's
where the fix should be.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 18:48, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I think that there may be additional benefits from making the
 qsort_arg specialisation look less like a c stdlib one, like refining
 the swap logic to have compile-time knowledge of the type it is
 sorting. I'm thinking that we could usefully trim quite a bit from
 this:

It seems like while we cannot get any better performance, no doubt
because the code is already using all manner of optimisations,
including perhaps constant propagation, we can simplify this part of
the code quite a bit. That seems fairly incidental though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 2:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 8, 2012 at 2:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ISTM there are two ways we could fix this:

 1. Introduce a duplicative test at the start of gistbuild().

 2. Rearrange the order of operations in index_build() so that the init
 fork is made first.

 Both of these are kinda ugly, but #2 puts the ugliness into someplace
 that shouldn't have to know about it, and furthermore someplace that's
 unlikely to get reverted if/when gist is fixed to not have this problem.
 So I think I favor #1.  Other opinions anyone?

 I don't think I understand your object to #2.  It appears to be a
 trivial rearrangement?

 Yeah, but then we are wiring into index_build the idea that ambuildempty
 is more important, or more likely to throw an error, or something, than
 ambuild is.  It seems weird.  And fragile, since somebody could decide
 to re-order those two steps again for reasons unrelated to gist.

I guess.  I think the compelling reason to do ambuildempty first is
that it's fast.  So might as well.  I think you'e just going to end up
hard-wiring the assumption that ambuild happens before ambuildempty,
which doesn't seem any better than the other way around, but I don't
care enough to argue about if you feel strongly about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Alexander Korotkov
On Tue, Feb 7, 2012 at 11:26 PM, Jay Levitt jay.lev...@gmail.com wrote:

 [*] psql:slowcube.sql:20: ERROR:  node buffer of page being split (121550)
 does not exist

This looks like a bug in buffering GiST index build I've implemented during
my GSoC 2011 project. It looks especially strange with following setting:

 effective_cache_size = 3734MB

because buffering GiST index build just shouldn't turn on in this case when
index fits to cache. I'm goint to take a detailed look on this.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-02-08 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 OK, in this version we simply suppress creation of the TableDataInfo 
 object if it's not wanted.

I applied this with some further adjustments.

 I actually removed the code from 
 makeTableDataInfo - there are only two places it gets called and doing 
 it in those two spots seemed a bit cleaner.

After study it seemed to me that this was the wrong direction to take.
It was already the case that the config-table path was skipping the
filters in getTableData(), none of which seem to be good for it to skip
other than the one on the table's schema-dump flag.  So I've refactored
the code to put all those filters into makeTableDataInfo where they
will be applied to both the normal and config-table cases.

Now, back to the original subject of this thread: both HEAD and 9.1 are
now operating as designed, in that they will dump the (user-provided
portion of the) contents of an extension config table whenever that
extension is dumped, even if --schema is specified.  Do we want to
change that?  I'm not convinced that it's something to mess with
lightly.  I could possibly support a definition that says that we omit
such data in --schema mode, except that I'm not sure what is sensible
for rows that were modified (not inserted) by user actions.  Also, we
probably ought to get some input from the postgis people, because after
all this entire feature was designed for their schema auxiliary tables.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I guess.  I think the compelling reason to do ambuildempty first is
 that it's fast.  So might as well.  I think you'e just going to end up
 hard-wiring the assumption that ambuild happens before ambuildempty,

Well, no, because I'm proposing that both functions throw this error.

 which doesn't seem any better than the other way around, but I don't
 care enough to argue about if you feel strongly about it.

What's ugly about this solution is the duplicative ereport calls.  But
at least the ugliness is confined to its source, ie gist.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Enable min/max optimization for bool_and/bool_or/every

2012-02-08 Thread Marti Raudsepp
On Wed, Feb 8, 2012 at 19:48, Tom Lane t...@sss.pgh.pa.us wrote:
 I applied this patch, since I was busy applying catalog changes from you
 anyway ;-).

Thanks :)

 I did think of a possible reason to reject the patch: with this change,
 the planner will take longer to plan queries involving bool_and() et al,
 since planagg.c will spend time looking (usually fruitlessly) for an
 index-based plan.

Good point, I should have done those measurements up front. Anyway,
since I've often noticed \timing to be unreliable for short queries, I
decided to retry your test with pgbench.

Long story short, I measured 27% overhead in the un-indexed column
case and 33% overhead for an indexed column. That's a lot more than I
expected. I even rebuilt and retried a few times to make sure I hadn't
botched something. The benchmark script is attached.

UNPATCHED
select bool_and(b) from unindexed;
tps = 13787.023113 (excluding connections establishing)
tps = 13880.484788 (excluding connections establishing)
tps = 13784.654542 (excluding connections establishing)
select bool_and(b) from indexed;
tps = 12536.650703 (excluding connections establishing)
tps = 12647.767993 (excluding connections establishing)
tps = 12500.956407 (excluding connections establishing)

PATCHED
select bool_and(b) from unindexed;
tps = 10096.834678 (excluding connections establishing)
tps = 10110.182425 (excluding connections establishing)
tps = 10103.904500 (excluding connections establishing)
select bool_and(b) from indexed;
tps = 8373.631407 (excluding connections establishing)
tps = 8659.917173 (excluding connections establishing)
tps = 8473.899896 (excluding connections establishing)

Regards,
Marti


bench_bool_and.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-02-08 Thread Christian Nicolaisen

Hi

We have some tables with documents and their metadata (filename, 
filetype, etc).
Most of the time we just want to get the metadata (filename, filetype, 
etc) and not the document itself.
In this case it would be nice to have the metadata (which wouldn't end 
up on the TOAST) on a fast array (SSD-based),
and put the filedata on a slow array (HDD-based). It's probably possible 
to have two tables one for metadata and one

for the extra file, but this is a workaround.

--
Mvh
Christian Nicolaisen
Deltasoft AS
Tlf  +47 938 38 596


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Core Extensions relocation

2012-02-08 Thread Josh Berkus
Greg,

 This is currently awaiting a check by gsmith that the 7 named extensions
 do not add any new dependancies.

Are you going to investigate this?  If not, I'll give it a try this weekend.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Tom Lane
Jay Levitt jay.lev...@gmail.com writes:
 [Posted at Andres's request]
 TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in 
 various builds.

 1. In 9.1.2, inserting 10x rows takes 19x the time.
 - 9.1-HEAD and 9.2 fix this; it now slows down linearly
 - but: 10s  8s  5s!
 - but: comparing Ubuntu binary w/vanilla source build on virtual disks, 
 might not be significant

FWIW, I find it really hard to believe that there is any real difference
between 9.1.2 and 9.1 branch tip on this.  There have been no
significant changes in either the gist or contrib/cube code in that
branch.  I suspect you have a measurement issue there.

On my not-at-all-virtual Fedora 16 workstation, with 9.1 tip, your test
case shows index build times of
10 rows 3650 ms
100 rows48400 ms
1000 rows   1917800 ms
which confirms the nonlinear scaling in 9.1, though I'm not sure it's
not just running out of RAM and having to do a lot of I/O in the last
case.  (This is an entirely untuned debug build, which probably doesn't
help.)  It's hard to guess how much available RAM you were working with
on your box -- mine's got 4GB.

 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes 
 it can't work on an unlogged table

Fixed.

 3. In 9.2, creating the 10-million-row index always fails

As Alexander noted, this is probably a bug in his recent patch.  We'll
look at it.  (I duplicated it here, so it's plenty real.)

 4. 9.1-HEAD never successfully indexes 10 million rows (never = at least 
 20 minutes on two runs; I will follow up in a few hours)

Works for me (see above), though it's slower than you might've expected.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstat documentation tables

2012-02-08 Thread Magnus Hagander
On Wed, Feb 8, 2012 at 17:13, Bruce Momjian br...@momjian.us wrote:
 On Tue, Feb 07, 2012 at 11:29:12PM -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  How do people feel about pulling text out of the SGML docs and loading
  it into the database as table and column comments?

 I'm not thrilled by that proposal.  The length limits on comments are
 very much shorter than what is sensible to use in catalogs.sgml (at
 least if you want the comments to look passable in \d displays).  And
 I don't want people trying to serve two different use-cases when they
 write those.

 I was thinking of just grabbing the first X characters from the SGML
 comments.

So that it wouldn't necessarily even make for a complete sentence?
That seems like a really bad idea...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention, part 2

2012-02-08 Thread Robert Haas
On Sun, Jan 29, 2012 at 6:04 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 If I cast to a int, then I see advancement:

 I'll initialise it as 0, rather than -1 and then we don't have a
 problem in any circumstance.


 I've specifically designed the pgbench changes required to simulate
 conditions of clog contention to help in the evaluation of this patch.

 Yep, I've used that one for the testing.

 Most of the current patch is just bookkeeping to keep track of the
 point when we can look at history in read only manner.

 I've isolated the code better to allow you to explore various
 implementation options. I don't see any performance difference between
 any of them really, but you're welcome to look.

 Please everybody note that the clog history doesn't even become active
 until the first checkpoint, so this is dead code until we've hit the
 first checkpoint cycle and completed a million transactions since
 startup. So its designed to tune for real world situations, and is not
 easy to benchmark. (Maybe we could start earlier, but having extra
 code just for first few minutes seems waste of energy, especially
 since we must hit million xids also).

I find that this version does not compile:

clog.c: In function ‘TransactionIdGetStatus’:
clog.c:431: error: ‘clog’ undeclared (first use in this function)
clog.c:431: error: (Each undeclared identifier is reported only once
clog.c:431: error: for each function it appears in.)

Given that, I obviously cannot test this at this point, but let me go
ahead and theorize about how well it's likely to work.  What Tom
suggested before (and after some reflection I think I believe it) is
that the frequency of access will be highest for the newest CLOG page
and then drop off for each page further back you go.  Clearly, if that
drop-off is fast - e.g. each buffer further backward is half as likely
to be accessed as the next newer one - then the fraction of accesses
that will hit pages that are far enough back to benefit from this
optimization will be infinitesmal; 1023 out of every 1024 accesses
will hit the first ten pages, and on a high-velocity system those all
figure to have been populated since the last checkpoint.  The best
case for this patch should be an access pattern that involves a very
long tail; actually, pgbench is a pretty good fit for that, assuming
the scale factor is large enough.  For example, at scale factor 100,
we've got 10,000,000 tuples: choosing one at random, we're almost
exactly 90% likely to find one that hasn't been chosen in the last
1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page).  In terms of
reducing contention on the main CLOG SLRU, that sounds pretty
promising, but depends somewhat on the rate at which transactions are
processed relative to the frequency of checkpoints, since that will
affect how many pages back you have go to use the history path.

However, there is a potential fly in the ointment: in other cases in
which we've reduced contention at the LWLock layer, we've ended up
with very nasty contention at the spinlock layer that can sometimes
eat more CPU time than the LWLock contention did.   In that light, it
strikes me that it would be nice to be able to partition the
contention N ways rather than just 2 ways.  I think we could do that
as follows.  Instead of having one control lock per SLRU, have N
locks, where N is probably a power of 2.  Divide the buffer pool for
the SLRU N ways, and decree that each slice of the buffer pool is
controlled by one of the N locks.  Route all requests for a page P to
slice P mod N.  Unlike this approach, that wouldn't completely
eliminate contention at the LWLock level, but it would reduce it
proportional to the number of partitions, and it would reduce spinlock
contention according to the number of partitions as well.  A down side
is that you'll need more buffers to get the same hit rate, but this
proposal has the same problem: it doubles the amount of memory
allocated for CLOG.  Of course, this approach is all vaporware right
now, so it's anybody's guess whether it would be better than this if
we had code for it.  I'm just throwing it out there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 That was clear from an early stage, and is something that I
 acknowledged way back in September

OK, so why didn't/don't we do and commit that part first, and then
proceed to argue about the remainder once it's in?

 I think that there may be additional benefits from making the
 qsort_arg specialisation look less like a c stdlib one, like refining
 the swap logic to have compile-time knowledge of the type it is
 sorting. I'm thinking that we could usefully trim quite a bit from
 this:

That's an interesting idea, which seems worth pursuing, though
possibly not for 9.2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-08 Thread Dan Scales
 Is there any problem if the double-write happens only by bgwriter or 
 checkpoint. 
 Something like whenever backend process has to evict the buffer, it will do 
 same as you have described that write in a double-write buffer, but  
 bgwriter  will check this double-buffer and flush from it.
 Also whenever any backend will see that the double buffer is more than 2/3rd 
 or some threshhold value full it will tell bgwriter to flush from  
 double-write buffer.
 This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends 
hand off the responsibility to flush batches of the double-write buffer to the 
bgwriter whenever possible.  This would avoid some long IO waits in the 
backends, though the backends may of course eventually wait anyways for the 
bgwriter if IO is not fast enough.  I did write the code so that any process 
can write a completed batch if the batch is not currently being flushed (so as 
to deal with crashes by backends).  Having the backends flush the batches as 
they fill them up was just simpler for a first prototype.

Dan

- Original Message -
From: Amit Kapila amit.kap...@huawei.com
To: Dan Scales sca...@vmware.com, PG Hackers 
pgsql-hackers@postgresql.org
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

 I think it is a good idea, and can help double-writes perform better in the 
 case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using double-write buffer approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
rarely happens).

To get good performance, I needed to have two double-write buffers, one for the 
checkpointer and one for all other processes.  The double-write buffers are 
circular buffers.  The checkpointer double-write buffer is just a single batch 
of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches 
of 64 pages each.  Each batch goes to a different double-write file, so that 
they can be issued independently as soon as each batch is completed.  Also, I 
need to sort the buffers being checkpointed by 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 23:33, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 That was clear from an early stage, and is something that I
 acknowledged way back in September

 OK, so why didn't/don't we do and commit that part first, and then
 proceed to argue about the remainder once it's in?

I have no objection to that. I'm not sure that it's going to be
possible to agree to anything beyond what has already been agreed. We
don't seem to be covering new ground.

What would it take to convince you to be more inclusive, and have at
least a few full specialisations, in particular, int4 and int8 single
key full specialisations? I'm sure that you'll agree that they're the
second most compelling case. There doesn't seem to be a standard that
I can meet that you can even /describe/ to have those additional
specialisations committed, which is a shame, since they appear to be
pretty decent wins above and beyond the generic single key
specialisation, all things considered.

I suppose that the standard refrain here is it's your patch; you must
prove the case for committing it. That would be fine by me, but this
isn't just about me, and it seems to be a practical impossibility in
this case. It may be quite impossible even with far greater resources
than I can afford to apply here, since it seems like you're hinting at
some kind of rigorous proof that this cannot cause a regression for a
single client, even though in order to see that regression multiple
other clients would have to be benefiting. I cannot provide such a
proof, since it would probably have to consider all commercially
available CPUs and all possible workloads - I doubt that anyone can.
That being the case, I'm eager not to waste any more time on this. I
bear no ill will; that just seems to be the situation that we find
ourselves in.

That said, if you can describe the standard, I'll try and meet it -
you seem to be suggesting that months and months of benchmarks
wouldn't really change things, since this is the first step on the
road to binary bloat ruin. The only fundamentally new argument that I
can offer is that by applying the standard that the community does
here, it severely limits the number of performance improvements that
can be added, and the aggregate effect is that we're quite certainly
worse off. It's a pity this isn't a really easy decision for you to
make, but that's the nature of these things, and we should expect that
to increasingly be the case. Is it really so unacceptable for us to
risk doing a little worse under some rare circumstances in order to do
better under some common circumstances? Why should it be an easy
decision - when are important decisions ever easy?

 I think that there may be additional benefits from making the
 qsort_arg specialisation look less like a c stdlib one, like refining
 the swap logic to have compile-time knowledge of the type it is
 sorting. I'm thinking that we could usefully trim quite a bit from
 this:

 That's an interesting idea, which seems worth pursuing, though
 possibly not for 9.2.

Well, it's really just a code clean-up. I suspect that qsort_arg is a
minimally modified version of the NetBSD one that wasn't necessarily
thoroughly understood when it was initially added (not that I'm
suggesting that it ought to have been). Then again, you might prefer
to keep it as consistent as possible with qsort (the other copy of
that sort function that we already have).

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-02-08 Thread Joachim Wieland
On Wed, Feb 8, 2012 at 1:21 PM, Robert Haas robertmh...@gmail.com wrote:
 In my patch I dealt with exactly the same problem for the error
 handler by creating a separate function that has a static variable (a
 pointer to the ParallelState). The value is set and retrieved through
 the same function, so yes, it's kinda global but then again it can
 only be accessed from this function, which is only called from the
 error handler.

 How did you make this thread-safe?

The ParallelState has a ParallelSlot for each worker process which
contains that worker process's thread id. So a worker process just
walks through the slots until it finds its own thread id and then goes
from there.

In particular, it only gets the file descriptors to and from the
parent from this structure, to communicate the error it encountered,
but it doesn't get the PGconn. This is because that error handler is
only called when a worker calls die_horribly(AH, ...) in which case
the connection is already known through AH.

Termination via a signal just sets a variable that is checked in the
I/O routines and there we also have AH to shut down the connection
(actually to call die_horribly()).


 So we at least need to press on far enough to get to that point.

Sure, let me know if I can help you with something.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-02-08 Thread Greg Smith

On 02/08/2012 11:22 AM, Bruce Momjian wrote:

Why can't vacuum handle things automatically like checkpoint smoothing?
Why can't it detect when it is falling behind and speed up?  Why can't
it see as busy background writer and slow down?   Unless we answer
these questions, we are not solving the problem for 99% of our users.


Scheduling VACUUM is a much harder problem than checkpoint smoothing.  
Checkpoints can define ahead and behind simply; defining what those 
terms mean for vacuum scheduling across a set of tables is tricky.  A 
possible downside to not vacuuming hard enough is that you'll have 
tables reach their thresholds faster than you're cleaning them up.  
That's a future bad event possible if you don't do the right thing now.  
Predicting when that will happen is a whole different class of problem 
than tracking whether checkpoints are running on schedule.


I have a design sketched out for something that adjusts the vacuum rate 
based on background writer activity.  I'm not optimistic that will ever 
be automatic to the level you'd like here either.  That looks like it 
will be prone to feedback loops where autovacuum can starve forever.  If 
it ever lags enough that commonly accessed tables have become 
inefficiently stored, they will then generate more buffer I/O to access, 
and thus continue to block future vacuum work.  That's the exact 
situation where vacuum is most needed, even though it seems too 
expensive to do.  That's one trap people already fall into when doing 
this by hand.  Everyone who's ever said autovacuum uses too much I/O 
when it pops up unexpectedly during the day, I'm going to turn that off 
has seen how that works out.  Hilarity ensues when the inevitable and 
gigantic wraparound vacuums happen in the middle of the day instead.


Just trying to set the expectations bar realistically here.  I don't 
consider the easier problem of checkpoint smoothing a solved one, 
either.  Saying autovacuum needs to reach even that level of automation 
to be a useful improvement over now is a slippery goal.  Regardless, the 
simple idea I submitted to this CF is clearly dead for now.  I'll take 
the feedback of this level of change can live in a user-side tuning 
tool and do that instead.  Since I was already thinking in the 
direction of background activity monitoring, I have a good idea how I'd 
need to approach this next, to be more likely to gain community buy-in 
as an automated improvement.  That's a longer term project though, which 
I'll hopefully be able to revisit for 9.3.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql NUL record and field separator

2012-02-08 Thread Abhijit Menon-Sen
At 2012-02-07 13:20:43 +0200, pete...@gmx.net wrote:

 Should we rename the options and/or add that to the documentation, or is
 the new behavior obvious and any new terminology would be too confusing?

I agree there is potential for confusion either way. I tried to come up
with a complete and not-confusing wording for all four options, but did
not manage to improve on the current state of affairs significantly. I
think it can stay the way it is. The reference to xargs -0 is probably
enough to set the right expectations about how it works.

We can always add a sentence later to clarify the special-case behaviour
of -0 if anyone is actually confused (and the best wording will be more
clear in that situation too).

-- Abhijit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers