Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-02-05 Thread Robert Haas
On Sat, Feb 4, 2012 at 11:59 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 01/24/2012 08:58 AM, Robert Haas wrote:

 One somewhat odd thing about these numbers is that, on permanent
 tables, all of the patches seemed to show regressions vs. master in
 single-client throughput.  That's a slightly difficult result to
 believe, though, so it's probably a testing artifact of some kind.

 It looks like you may have run the ones against master first, then the ones
 applying various patches.  The one test artifact I have to be very careful
 to avoid in that situation is that later files on the physical disk are
 slower than earlier ones.  There's a 30% differences between the fastest
 part of a regular hard drive, the logical beginning, and its end.  Multiple
 test runs tend to creep forward onto later sections of disk, and be biased
 toward the earlier run in that case.  To eliminate that bias when it gets
 bad, I normally either a) run each test 3 times, interleaved, or b) rebuild
 the filesystem in between each initdb.

 I'm not sure that's the problem you're running into, but it's the only one
 I've been hit by that matches the suspicious part of your results.

I don't think that's it, because tests on various branches were
interleaved; moreover, I don't believe master was the first one in the
rotation.  I think I had then in alphabetical order by branch name,
actually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-05 Thread Jeff Davis
On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
 If we add fsync calls to the initdb process, they should cover the entire data
 directory tree.  This patch syncs files that initdb.c writes, but we ought to
 also sync files that bootstrap-mode backends had written.

It doesn't make sense for initdb to take responsibility to sync files
created by the backend. If there are important files that the backend
creates, it should be the backend's responsibility to fsync them (and
their parent directory, if needed). And if they are unimportant to the
backend, then there is no reason for initdb to fsync them.

 An optimization
 like the pg_flush_data() call in copy_file() may reduce the speed penalty.

That worked pretty well. It took it down about 100ms on my machine,
which closes the gap significantly.

 initdb should do these syncs by default and offer an option to disable them.

For test frameworks that run initdb often, that makes sense.

But for developers, it doesn't make sense to spend 0.5s typing an option
that saves you 0.3s. So, we'd need some more convenient way to choose
the no-fsync option, like an environment variable that developers can
set. Or maybe developers don't care about 0.3s?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pl/python long-lived allocations in datum-dict transformation

2012-02-05 Thread Jan Urbański
Consider this:

create table arrays as select array[random(), random(), random(),
random(), random(), random()] as a from generate_series(1, 100);

create or replace function plpython_outputfunc() returns void as $$
c = plpy.cursor('select a from arrays')
for row in c:
pass
$$ language plpythonu;

When running the function, every datum will get transformed into a
Python dict, which includes calling the type's output function,
resulting in a memory allocation. The memory is allocated in the SPI
context, so it accumulates until the function is finished.

This is annoying for functions that plough through large tables, doing
some calculation. Attached is a patch that does the conversion of
PostgreSQL Datums into Python dict objects in a scratch memory context
that gets reset every time.

Cheers,
Jan
diff --git a/src/pl/plpython/plpy_main.c b/src/pl/plpython/plpy_main.c
index ae9d87e..79d7784 100644
*** a/src/pl/plpython/plpy_main.c
--- b/src/pl/plpython/plpy_main.c
***
*** 12,17 
--- 12,18 
  #include executor/spi.h
  #include miscadmin.h
  #include utils/guc.h
+ #include utils/memutils.h
  #include utils/syscache.h
  
  #include plpython.h
*** plpython_inline_handler(PG_FUNCTION_ARGS
*** 268,273 
--- 269,279 
  
  	MemSet(proc, 0, sizeof(PLyProcedure));
  	proc.pyname = PLy_strdup(__plpython_inline_block);
+ 	proc.tmp_ctx = AllocSetContextCreate(TopMemoryContext,
+ 		 PL/Python temporary ctx,
+ 		 ALLOCSET_DEFAULT_MINSIZE,
+ 		 ALLOCSET_DEFAULT_INITSIZE,
+ 		 ALLOCSET_DEFAULT_MAXSIZE);
  	proc.result.out.d.typoid = VOIDOID;
  
  	PG_TRY();
diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c
index 229966a..f539cec 100644
*** a/src/pl/plpython/plpy_procedure.c
--- b/src/pl/plpython/plpy_procedure.c
***
*** 12,17 
--- 12,18 
  #include catalog/pg_type.h
  #include utils/builtins.h
  #include utils/hsearch.h
+ #include utils/memutils.h
  #include utils/syscache.h
  
  #include plpython.h
*** PLy_procedure_create(HeapTuple procTup,
*** 169,174 
--- 170,180 
  	proc-setof = NULL;
  	proc-src = NULL;
  	proc-argnames = NULL;
+ 	proc-tmp_ctx = AllocSetContextCreate(TopMemoryContext,
+ 		  PL/Python temporary ctx,
+ 		  ALLOCSET_DEFAULT_MINSIZE,
+ 		  ALLOCSET_DEFAULT_INITSIZE,
+ 		  ALLOCSET_DEFAULT_MAXSIZE);
  
  	PG_TRY();
  	{
*** PLy_procedure_delete(PLyProcedure *proc)
*** 411,416 
--- 417,424 
  		PLy_free(proc-src);
  	if (proc-argnames)
  		PLy_free(proc-argnames);
+ 	if (proc-tmp_ctx)
+ 		MemoryContextDelete(proc-tmp_ctx);
  }
  
  /*
diff --git a/src/pl/plpython/plpy_procedure.h b/src/pl/plpython/plpy_procedure.h
index e986c7e..5ee73fa 100644
*** a/src/pl/plpython/plpy_procedure.h
--- b/src/pl/plpython/plpy_procedure.h
*** typedef struct PLyProcedure
*** 30,35 
--- 30,36 
  	PyObject   *code;			/* compiled procedure code */
  	PyObject   *statics;		/* data saved across calls, local scope */
  	PyObject   *globals;		/* data saved across calls, global scope */
+ 	MemoryContext tmp_ctx;
  } PLyProcedure;
  
  /* the procedure cache entry */
diff --git a/src/pl/plpython/plpy_typeio.c b/src/pl/plpython/plpy_typeio.c
index d5cac9f..6f0eb46 100644
*** a/src/pl/plpython/plpy_typeio.c
--- b/src/pl/plpython/plpy_typeio.c
***
*** 23,28 
--- 23,29 
  #include plpy_typeio.h
  
  #include plpy_elog.h
+ #include plpy_procedure.h
  
  
  /* I/O function caching */
*** PLy_output_record_funcs(PLyTypeInfo *arg
*** 258,268 
  	Assert(arg-is_rowtype == 1);
  }
  
  PyObject *
  PLyDict_FromTuple(PLyTypeInfo *info, HeapTuple tuple, TupleDesc desc)
  {
! 	PyObject   *volatile dict;
! 	int			i;
  
  	if (info-is_rowtype != 1)
  		elog(ERROR, PLyTypeInfo structure describes a datum);
--- 259,274 
  	Assert(arg-is_rowtype == 1);
  }
  
+ /*
+  * Transform a tuple into a Python dict object. The transformation can result
+  * in memory allocation, so it's done in a scratch memory context.
+  */
  PyObject *
  PLyDict_FromTuple(PLyTypeInfo *info, HeapTuple tuple, TupleDesc desc)
  {
! 	PyObject		*volatile dict;
! 	MemoryContext	oldcontext;
! 	inti;
  
  	if (info-is_rowtype != 1)
  		elog(ERROR, PLyTypeInfo structure describes a datum);
*** PLyDict_FromTuple(PLyTypeInfo *info, Hea
*** 271,276 
--- 277,284 
  	if (dict == NULL)
  		PLy_elog(ERROR, could not create new dictionary);
  
+ 	oldcontext = MemoryContextSwitchTo(PLy_curr_procedure-tmp_ctx);
+ 
  	PG_TRY();
  	{
  		for (i = 0; i  info-in.r.natts; i++)
*** PLyDict_FromTuple(PLyTypeInfo *info, Hea
*** 298,308 
--- 306,322 
  	}
  	PG_CATCH();
  	{
+ 		MemoryContextSwitchTo(oldcontext);
+ 		MemoryContextReset(PLy_curr_procedure-tmp_ctx);
+ 
  		Py_DECREF(dict);
  		PG_RE_THROW();
  	}
  	PG_END_TRY();
  
+ 	MemoryContextSwitchTo(oldcontext);
+ 	

[HACKERS] plpgsql leaking memory when stringifying datums

2012-02-05 Thread Jan Urbański
While chasing a PL/Python memory leak, I did a few tests with PL/pgSQL
and I think there are places where memory is not freed sufficiently early.

Attached are two functions that when run will make the backend's memory
consumption increase until they finish. With both, the cause is
convert_value_to_string that calls a datum's output function, which for
toasted data results in an allocation.

The proposed patch changes convert_value_to_string to call the output
function in the per-tuple memory context and then copy the result string
back to the original context.

The comment in that function says that callers generally pfree its
result, but that wasn't the case with exec_stmt_raise, so I added a
pfree() there as well.

With that I was still left with a leak in the typecast() test function
and it turns out that sticking a exec_eval_cleanup into exec_move_row
fixed it. The regression tests pass, but I'm not 100% sure if it's
actually safe.

Since convert_value_to_string needed to access the PL/pgSQL's execution
state to get its hands on the per-tuple context, I needed to pass it to
every caller that didn't have it already, which means exec_cast_value
and exec_simple_cast_value. Anyone has a better idea?

The initial diagnosis and proposed solution are by Andres Freund - thanks!

Cheers,
Jan
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index bf952b6..a691905 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** static void exec_move_row(PLpgSQL_execst
*** 188,201 
  static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate,
  	PLpgSQL_row *row,
  	TupleDesc tupdesc);
! static char *convert_value_to_string(Datum value, Oid valtype);
! static Datum exec_cast_value(Datum value, Oid valtype,
  Oid reqtype,
  FmgrInfo *reqinput,
  Oid reqtypioparam,
  int32 reqtypmod,
  bool isnull);
! static Datum exec_simple_cast_value(Datum value, Oid valtype,
  	   Oid reqtype, int32 reqtypmod,
  	   bool isnull);
  static void exec_init_tuple_store(PLpgSQL_execstate *estate);
--- 188,204 
  static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate,
  	PLpgSQL_row *row,
  	TupleDesc tupdesc);
! static char *convert_value_to_string(PLpgSQL_execstate *estate,
! 	Datum value, Oid valtype);
! static Datum exec_cast_value(PLpgSQL_execstate *estate,
! Datum value, Oid valtype,
  Oid reqtype,
  FmgrInfo *reqinput,
  Oid reqtypioparam,
  int32 reqtypmod,
  bool isnull);
! static Datum exec_simple_cast_value(PLpgSQL_execstate *estate,
! 	   Datum value, Oid valtype,
  	   Oid reqtype, int32 reqtypmod,
  	   bool isnull);
  static void exec_init_tuple_store(PLpgSQL_execstate *estate);
*** plpgsql_exec_function(PLpgSQL_function *
*** 430,436 
  		else
  		{
  			/* Cast value to proper type */
! 			estate.retval = exec_cast_value(estate.retval, estate.rettype,
  			func-fn_rettype,
  			(func-fn_retinput),
  			func-fn_rettypioparam,
--- 433,440 
  		else
  		{
  			/* Cast value to proper type */
! 			estate.retval = exec_cast_value(estate, estate.retval,
! 			estate.rettype,
  			func-fn_rettype,
  			(func-fn_retinput),
  			func-fn_rettypioparam,
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1757,1763 
  	 * Get the value of the lower bound
  	 */
  	value = exec_eval_expr(estate, stmt-lower, isnull, valtype);
! 	value = exec_cast_value(value, valtype, var-datatype-typoid,
  			(var-datatype-typinput),
  			var-datatype-typioparam,
  			var-datatype-atttypmod, isnull);
--- 1761,1767 
  	 * Get the value of the lower bound
  	 */
  	value = exec_eval_expr(estate, stmt-lower, isnull, valtype);
! 	value = exec_cast_value(estate, value, valtype, var-datatype-typoid,
  			(var-datatype-typinput),
  			var-datatype-typioparam,
  			var-datatype-atttypmod, isnull);
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1772,1778 
  	 * Get the value of the upper bound
  	 */
  	value = exec_eval_expr(estate, stmt-upper, isnull, valtype);
! 	value = exec_cast_value(value, valtype, var-datatype-typoid,
  			(var-datatype-typinput),
  			var-datatype-typioparam,
  			var-datatype-atttypmod, isnull);
--- 1776,1782 
  	 * Get the value of the upper bound
  	 */
  	value = exec_eval_expr(estate, stmt-upper, isnull, valtype);
! 	value = exec_cast_value(estate, value, valtype, var-datatype-typoid,
  			(var-datatype-typinput),
  			var-datatype-typioparam,
  			var-datatype-atttypmod, isnull);
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1789,1795 
  	if (stmt-step)
  	{
  		value = exec_eval_expr(estate, stmt-step, isnull, valtype);
! 		value = exec_cast_value(value, valtype, var-datatype-typoid,
  (var-datatype-typinput),
  var-datatype-typioparam,
  var-datatype-atttypmod, 

[HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Pursuant to the recent discussion about bugs 6200 and 6245, I went
trawling through all the WAL redo routines looking for bugs such as
inadequate locking or transiently trashing shared buffers.  Here's
what I found:

* As we already knew, RestoreBkpBlocks is broken because it transiently
trashes a shared buffer, which another process could be accessing while
holding only a pin.

* seq_redo() has the same disease, since we allow SELECT * FROM
sequences.

* Everything else seems to be free of that specific issue; in particular
the index-related replay routines are at fairly low risk since we don't
have any coding rules allowing index pages to be examined without
holding a buffer lock.

* There are assorted replay routines that assume they can whack fields
of ShmemVariableCache around without any lock.  However, it's pretty
inconsistent; about half do it like that, while the other half assume
that they can read ShmemVariableCache without lock but should acquire
lock to modify it.  I think the latter coding rule is a whole lot safer
in the presence of Hot Standby and should be adopted throughout.

* Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
It's not entirely clear to me that no read-only transaction can ever
examine the shared-memory variables they change.  In any case, if there
is in fact no other process examining those variables, there can be no
contention for the lock so it should be cheap to get.

* Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
if it fails to clean out tablespace directories.  This seems to me to be
the height of folly, especially when the failure is more or less an
expected case.  If the error occurs the database is dead in the water,
because that error is actually a PANIC and will recur on subsequent
restart attempts.  Therefore there is no way to recover short of manual
intervention to clean out the non-empty directory.  And why are we
pulling the fire alarm like this?  Well, uh, it's because we might fail
to recover some disk space in the dropped tablespace.  Seems to me to be
a lot better to just elog(LOG) and move on.  This is quite analogous to
the case of failing to unlink a file after commit --- wasting disk space
might be bad, but it's very much the lesser evil compared to this.

Barring objections I'm going to fix all this stuff and back-patch as
far as 9.0 where hot standby was added.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON output functions.

2012-02-05 Thread Stefan Keller
Hi Andrew

Nice work!

Just for completeness: Did you also think of including geometry types
in JSON output functions in later releases? There's a nice extension
of JSON called GeoJSON for a starting point.

Yours, Stefan


2012/2/3 Andrew Dunstan and...@dunslane.net:


 On 02/02/2012 12:20 PM, Pavel Stehule wrote:

 2012/2/2 Andrew Dunstanand...@dunslane.net:


 On 02/02/2012 04:35 AM, Abhijit Menon-Sen wrote:

 At 2012-02-01 18:48:28 -0500, andrew.duns...@pgexperts.com wrote:

 For now I'm inclined not to proceed with that, and leave it as an
 optimization to be considered later if necessary. Thoughts?

 I agree, there doesn't seem to be a pressing need to do it now.


 OK, here's my final version of the patch for constructor functions. If
 there's no further comment I'll go with this.

 These function are super, Thank you

 Do you plan to fix a issue with row attribute names in 9.2?




 Yeah. Tom did some initial work which he published here:
 http://archives.postgresql.org/message-id/28413.1321500388%40sss.pgh.pa.us,
 noting:

   It's not really ideal with respect to
   the ValuesScan case, because what you get seems to always be the
   hard-wired columnN names for VALUES columns, even if you try to
   override that with an alias
   ...
   Curiously, it works just fine if the VALUES can be folded

 and later he said:

   Upon further review, this patch would need some more work even for the
   RowExpr case, because there are several places that build RowExprs
   without bothering to build a valid colnames list.  It's clearly soluble
   if anyone cares to put in the work, but I'm not personally excited
   enough to pursue it ..

 I'm going to look at that issue first, since the unfolded VALUES clause
 seems like something of an obscure corner case. Feel free to chime in if you
 can.


 cheers


 andrew


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-05 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
 If we add fsync calls to the initdb process, they should cover the entire 
 data
 directory tree.  This patch syncs files that initdb.c writes, but we ought to
 also sync files that bootstrap-mode backends had written.

 It doesn't make sense for initdb to take responsibility to sync files
 created by the backend.

No, but the more interesting question is whether bootstrap mode troubles
to fsync its writes.  I'm not too sure about that either way ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON output functions.

2012-02-05 Thread Andrew Dunstan



On 02/05/2012 02:31 PM, Stefan Keller wrote:

Hi Andrew

Nice work!

Just for completeness: Did you also think of including geometry types
in JSON output functions in later releases? There's a nice extension
of JSON called GeoJSON for a starting point.



[side note: please don't top-reply on -hackers. See 
http://idallen.com/topposting.html]


Currently, in array_to_json and row_to_json the only special cases are:

 * record types are output as JSON records
 * array types are output as JSON arrays
 * numeric types are output without quoting
 * boolean types are output as unquoted true or false
 * NULLs are output as NULL


Everything else is output as its text representation, suitably quoted 
and escaped.


If you want to change how those operate, now rather than later would be 
the best time. But later we could certainly add various other 
foo_to_json functions for things like geometry types and hstores.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-05 Thread Kohei KaiGai
2012年2月1日12:15 Shigeru Hanada shigeru.han...@gmail.com:
 (2012/01/30 4:39), Kohei KaiGai wrote:
 I checked the fdw_helper_funcs_v3.patch, pgsql_fdw_v5.patch and
 pgsql_fdw_pushdown_v1.patch. My comments are below.

 Thanks for the review!

 [BUG]
 Even though pgsql_fdw tries to push-down qualifiers being executable
 on the remove side at the deparseSql(), it does not remove qualifiers
 being pushed down from the baserel-baserestrictinfo, thus, these
 qualifiers are eventually executed twice.

 See the result of this EXPLAIN.
postgres=# EXPLAIN SELECT * FROM ft1 WHERE a  2 AND f_leak(b);
  QUERY PLAN

 --
 Foreign Scan on ft1  (cost=107.43..122.55 rows=410 width=36)
   Filter: (f_leak(b) AND (a  2))
   Remote SQL: DECLARE pgsql_fdw_cursor_0 SCROLL CURSOR FOR SELECT
 a, b FROM public.t1 WHERE (a  2)
(3 rows)

 My expectation is (a  2) being executed on the remote-side and f_leak(b)
 being executed on the local-side. But, filter of foreign-scan on ft1 has both
 of qualifiers. It has to be removed, if a RestrictInfo get pushed-down.

 It's intentional that pgsql_fdw keeps pushed-down qualifier in
 baserestrictinfo, because I saw some undesirable behavior when I
 implemented so that with such optimization when plan is reused, but it's
 not clear to me now.  I'll try to recall what I saw...

 BTW, I think evaluating pushed-down qualifiers again on local side is
 safe and has no semantic problem, though we must pay little for such
 overhead.  Do you have concern about performance?

Yes. In my opinion, one significant benefit of pgsql_fdw is to execute
qualifiers on the distributed nodes; that enables to utilize multiple
CPU resources efficiently.
Duplicate checks are reliable way to keep invisible tuples being filtered
out, indeed. But it shall degrade one competitive characteristics of the
pgsql_fdw.

https://github.com/kaigai/pg_strom/blob/master/plan.c#L693
In my module, qualifiers being executable on device side are detached
from the baserel-baserestrictinfo, and remaining qualifiers are chained
to the list.
The is_device_executable_qual() is equivalent to is_foreign_expr() in
the pgsql_fdw module.

Of course, it is your decision, and I might miss something.

BTW, what is the undesirable behavior on your previous implementation?

 [Design comment]
 I'm not sure the reason why store_result() uses MessageContext to save
 the Tuplestorestate within PgsqlFdwExecutionState.
 The source code comment says it is used to avoid memory leaks in error
 cases. I also have a similar experience on implementation of my fdw module,
 so, I could understand per-scan context is already cleared at the timing of
 resource-release-callback, thus, handlers to external resource have to be
 saved on separated memory context.
 In my personal opinion, the right design is to declare a memory context for
 pgsql_fdw itself, instead of the abuse of existing memory context.
 (More wise design is to define sub-memory-context for each foreign-scan,
 then, remove the sub-memory-context after release handlers.)

 I simply chose built-in context which has enough lifespan, but now I
 think that using MessageContext directly is not recommended way.  As you
 say, creating new context as child of MessageContext for each scan in
 BeginForeignScan (or first IterateForeignScan) would be better.  Please
 see attached patch.

 One other option is getting rid of tuplestore by holding result rows as
 PGresult, and track it for error cases which might happen.
 ResourceOwner callback can be used to release PGresult on error, similar
 to PGconn.

If we could have set of results on per-query memory context (thus,
no need to explicit release on error timing), it is more ideal design.
It it possible to implement based on the libpq APIs?

Please note that per-query memory context is already released on
ResourceOwner callback is launched, so, it is unavailable to implement
if libpq requires to release some resource.

 [Design comment]
 When BEGIN should be issued on the remote-side?
 The connect_pg_server() is an only chance to issue BEGIN command
 at the remote-side on connection being opened. However, the connection
 shall be kept unless an error is not raised. Thus, the remote-side will
 continue to work within a single transaction block, even if local-side 
 iterates
 a pair of BEGIN and COMMIT.
 I'd like to suggest to close the transaction block at the timing of either
 end of the scan, transaction or sub-transaction.

 Indeed, remote transactions should be terminated at some timing.
 Terminating at the end of a scan seems troublesome because a connection
 might be shared by multiple scans in a query.  I'd prefer aborting
 remote transaction at the end of local query.  Please see
 abort_remote_tx in attached patch.

It seems to me abort_remote_tx in ReleaseConnection() is reasonable.

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
 On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
  Also, as far as I can see this patch usurps the page version field,
  which I find unacceptably short-sighted.  Do you really think this is
  the last page layout change we'll ever make?

 No, I don't. I hope and expect the next page layout change to
 reintroduce such a field.

 But since we're agreed now that upgrading is important, changing page
 format isn't likely to be happening until we get an online upgrade
 process. So future changes are much less likely. If they do happen, we
 have some flag bits spare that can be used to indicate later versions.
 It's not the prettiest thing in the world, but it's a small ugliness
 in return for an important feature. If there was a way without that, I
 would have chosen it.

 Have you considered the CRC might match a valuid page version number?
 Is that safe?

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple

2012-02-05 Thread Simon Riggs
On Sat, Feb 4, 2012 at 6:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 The cause here is data changing underneath the user. Your patch solves
 the most obvious error, but it still allows other problems if applying
 the backup block changes data. If the backup block doesn't do anything
 at all then we don't need to apply it either.

 This is nonsense.  What applying the backup block does is to apply the
 change that the WAL record would otherwise have applied, except we
 decided to make it store a full-page image instead.

Yep, you're right, my bad.

Got a head cold, so will lay off a few days from too much thinking.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Pursuant to the recent discussion about bugs 6200 and 6245, I went
 trawling through all the WAL redo routines looking for bugs such as
 inadequate locking or transiently trashing shared buffers.  Here's
 what I found:

 * As we already knew, RestoreBkpBlocks is broken because it transiently
 trashes a shared buffer, which another process could be accessing while
 holding only a pin.

Agreed

 * seq_redo() has the same disease, since we allow SELECT * FROM
 sequences.

Why do we do that?

 * Everything else seems to be free of that specific issue; in particular
 the index-related replay routines are at fairly low risk since we don't
 have any coding rules allowing index pages to be examined without
 holding a buffer lock.

Yep

 * There are assorted replay routines that assume they can whack fields
 of ShmemVariableCache around without any lock.  However, it's pretty
 inconsistent; about half do it like that, while the other half assume
 that they can read ShmemVariableCache without lock but should acquire
 lock to modify it.  I think the latter coding rule is a whole lot safer
 in the presence of Hot Standby and should be adopted throughout.

Agreed

 * Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
 It's not entirely clear to me that no read-only transaction can ever
 examine the shared-memory variables they change.  In any case, if there
 is in fact no other process examining those variables, there can be no
 contention for the lock so it should be cheap to get.

Row locking requires a WAL record to be written, so that whole path is
dead during HS.

 * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
 if it fails to clean out tablespace directories.  This seems to me to be
 the height of folly, especially when the failure is more or less an
 expected case.  If the error occurs the database is dead in the water,
 because that error is actually a PANIC and will recur on subsequent
 restart attempts.  Therefore there is no way to recover short of manual
 intervention to clean out the non-empty directory.  And why are we
 pulling the fire alarm like this?  Well, uh, it's because we might fail
 to recover some disk space in the dropped tablespace.  Seems to me to be
 a lot better to just elog(LOG) and move on.  This is quite analogous to
 the case of failing to unlink a file after commit --- wasting disk space
 might be bad, but it's very much the lesser evil compared to this.

If the sysadmin is managing the db properly then this shouldn't ever
happen - the only cause is if the tablespace being dropped is being
used as a temp tablespace on the standby.

The ERROR is appropriate because we first try to remove the files. If
they won't go we then raise an all-session conflict and then try
again. Only when we fail the second time does it ERROR, which seems
OK.

If you just LOG, when exactly would we get rid of the tablespace?

 Barring objections I'm going to fix all this stuff and back-patch as
 far as 9.0 where hot standby was added.

Please post the patch rather than fixing directly. There's some subtle
stuff there and it would be best to discuss first.

Thanks

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * seq_redo() has the same disease, since we allow SELECT * FROM
 sequences.

 Why do we do that?

It's the only existing way to obtain the parameters of a sequence.
Even if we felt like inventing a different API for doing that, it'd take
years to change every client that knows about this.

 * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
 if it fails to clean out tablespace directories.  This seems to me to be
 the height of folly, especially when the failure is more or less an
 expected case.  If the error occurs the database is dead in the water,
 because that error is actually a PANIC and will recur on subsequent
 restart attempts.  Therefore there is no way to recover short of manual
 intervention to clean out the non-empty directory.  And why are we
 pulling the fire alarm like this?  Well, uh, it's because we might fail
 to recover some disk space in the dropped tablespace.  Seems to me to be
 a lot better to just elog(LOG) and move on.  This is quite analogous to
 the case of failing to unlink a file after commit --- wasting disk space
 might be bad, but it's very much the lesser evil compared to this.

 If the sysadmin is managing the db properly then this shouldn't ever
 happen - the only cause is if the tablespace being dropped is being
 used as a temp tablespace on the standby.

Right, but that is an expected/foreseeable situation.  It should not
lead to a dead-and-unrestartable database.

 If you just LOG, when exactly would we get rid of the tablespace?

The tablespace *is* gone, or at least its catalog entries are.  All we
are trying to do here is release some underlying disk space.  It's
exactly analogous to the case where we drop a table and then find (post
commit) that unlinking the disk file fails for some weird reason.
We've done what we can to clean the disk space and should just let it
go --- there is no risk to database integrity in leaving some files
behind, so killing the server is a huge overreaction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-05 Thread Dan Scales
Thanks for the detailed followup.  I do see how Postgres is tuned for
having a bunch of memory available that is not in shared_buffers, both
for the OS buffer cache and other memory allocations.  However, Postgres
seems to run fine in many large shared_memory configurations that I
gave performance numbers for, including 5G shared_buffers for an 8G
machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
sufficient extra memory beyond the shared_buffers cache.

I think the pgbench run is pointing out a problem that this double_writes
implementation has with BULK_WRITEs.  As you point out, the
BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.
I'm not sure if there is a great solution that always works for that
issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
archiving/replication is happening.  As I understand it, if the
BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
double-written either.  The BULK_WRITE data is not officially synced and
committed until it is all written, so there doesn't have to be any
torn-page protection for that data, which is why the WAL logging can be
omitted.  The double-write implementation can be improved by marking each
buffer if it doesn't need torn-page protection.  These buffers would be
those new pages that are explicitly not WAL-logged, even when
full_page_writes is enabled.  When such a buffer is eventually synced
(perhaps because of an eviction), it would not be double-written.  This
would often avoid double-writes for BULK_WRITE, etc., especially since
the administrator is often not archiving or doing replication when doing
bulk loads.

However, overall, I think the idea is that double writes are an optional
optimization.  The user would only turn it on in existing configurations
where it helps or only slightly hurts performance, and where greatly
reducing the size of the WAL logs is beneficial.  It might also be
especially beneficial when there is a small amount of FLASH or other
kind of fast storage that the double-write files can be stored on.

Thanks,

Dan


- Original Message -
From: Robert Haas robertmh...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: PG Hackers pgsql-hackers@postgresql.org
Sent: Friday, February 3, 2012 1:48:54 PM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales sca...@vmware.com wrote:
 Thanks for the feedback!  I think you make a good point about the small size 
 of dirty data in the OS cache.  I think what you can say about this 
 double-write patch is that it will work not work well for configurations that 
 have a small Postgres cache and a large OS cache, since every write from the 
 Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
 The entire database fits in shared_buffers, with most of it left
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment.  And here's yet another one: we
assume that it is better to 

Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
 if it fails to clean out tablespace directories.  This seems to me to be
 the height of folly, especially when the failure is more or less an
 expected case.  If the error occurs the database is dead in the water,
 because that error is actually a PANIC and will recur on subsequent
 restart attempts.  Therefore there is no way to recover short of manual
 intervention to clean out the non-empty directory.  And why are we
 pulling the fire alarm like this?  Well, uh, it's because we might fail
 to recover some disk space in the dropped tablespace.  Seems to me to be
 a lot better to just elog(LOG) and move on.  This is quite analogous to
 the case of failing to unlink a file after commit --- wasting disk space
 might be bad, but it's very much the lesser evil compared to this.

 If the sysadmin is managing the db properly then this shouldn't ever
 happen - the only cause is if the tablespace being dropped is being
 used as a temp tablespace on the standby.

 Right, but that is an expected/foreseeable situation.  It should not
 lead to a dead-and-unrestartable database.

 If you just LOG, when exactly would we get rid of the tablespace?

 The tablespace *is* gone, or at least its catalog entries are.  All we
 are trying to do here is release some underlying disk space.  It's
 exactly analogous to the case where we drop a table and then find (post
 commit) that unlinking the disk file fails for some weird reason.
 We've done what we can to clean the disk space and should just let it
 go --- there is no risk to database integrity in leaving some files
 behind, so killing the server is a huge overreaction.

I agree the tablespace entries are gone, but that won't stop existing
users from continuing.

If we're not sure of the reason why tablespace removal fails it
doesn't seem safe to continue to me.

But since this is a rare corner case, and we already try to remove
users, then LOG seems OK.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Please post the patch rather than fixing directly. There's some subtle
 stuff there and it would be best to discuss first.

Here's a proposed patch for the issues around unlocked updates of
shared-memory state.  After going through this I believe that there is
little risk of any real problems in the current state of the code; this
is more in the nature of future-proofing against foreseeable changes.
(One such is that we'd discussed fixing the age() function to work
during Hot Standby.)  So I suggest applying this to HEAD but not
back-patching.

Except for one thing.  I realized while looking at the NEXTOID replay
code that it is completely broken: it only advances
ShmemVariableCache-nextOid when that's less than the value in the WAL
record.  So that comparison fails if the OID counter wraps around during
replay.  I've fixed this in the attached patch by just forcibly
assigning the new value instead of trying to be smart, and I think
probably that aspect of it needs to be back-patched.

regards, tom lane

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 454ca310f339db34a4cc352899fea1d663abf93b..0f4cea124d7436380c730203e6cfd1518bc5d3b2 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*** CheckPointMultiXact(void)
*** 1655,1662 
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  We need no locking since it is only called
!  * during bootstrap and XLog replay.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
--- 1655,1663 
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  Although this is only called during bootstrap
!  * and XLog replay, we take the lock in case any hot-standby backends are
!  * examining the values.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
*** MultiXactSetNextMXact(MultiXactId nextMu
*** 1664,1671 
--- 1665,1674 
  {
  	debug_elog4(DEBUG2, MultiXact: setting next multi to %u offset %u,
  nextMulti, nextMultiOffset);
+ 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
  	MultiXactState-nextMXact = nextMulti;
  	MultiXactState-nextOffset = nextMultiOffset;
+ 	LWLockRelease(MultiXactGenLock);
  }
  
  /*
*** MultiXactSetNextMXact(MultiXactId nextMu
*** 1674,1685 
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * We need no locking since it is only called during XLog replay.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
  		  MultiXactOffset minMultiOffset)
  {
  	if (MultiXactIdPrecedes(MultiXactState-nextMXact, minMulti))
  	{
  		debug_elog3(DEBUG2, MultiXact: setting next multi to %u, minMulti);
--- 1677,1690 
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * Although this is only called during XLog replay, we take the lock in case
!  * any hot-standby backends are examining the values.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
  		  MultiXactOffset minMultiOffset)
  {
+ 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
  	if (MultiXactIdPrecedes(MultiXactState-nextMXact, minMulti))
  	{
  		debug_elog3(DEBUG2, MultiXact: setting next multi to %u, minMulti);
*** MultiXactAdvanceNextMXact(MultiXactId mi
*** 1691,1696 
--- 1696,1702 
  	minMultiOffset);
  		MultiXactState-nextOffset = minMultiOffset;
  	}
+ 	LWLockRelease(MultiXactGenLock);
  }
  
  /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6e84cd0a21671486693e7f94d5fda8efdedf4bb4..3d08e92d3a747cc9426206dc0623ff390f18b09c 100644
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
*** PrescanPreparedTransactions(TransactionI
*** 1715,1720 
--- 1715,1724 
  			/*
  			 * Examine subtransaction XIDs ... they should all follow main
  			 * XID, and they may force us to advance nextXid.
+ 			 *
+ 			 * We don't expect anyone else to modify nextXid, hence we don't
+ 			 * need to hold a lock while examining it.  We still acquire the
+ 			 * lock to modify it, though.
  			 */
  			subxids = (TransactionId *)
  (buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
*** PrescanPreparedTransactions(TransactionI
*** 1726,1733 
--- 1730,1739 
  if (TransactionIdFollowsOrEquals(subxid,
   ShmemVariableCache-nextXid))
  {
+ 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
  	ShmemVariableCache-nextXid = subxid;
  	

Re: [HACKERS] initdb and fsync

2012-02-05 Thread Noah Misch
On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote:
 On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
  If we add fsync calls to the initdb process, they should cover the entire 
  data
  directory tree.  This patch syncs files that initdb.c writes, but we ought 
  to
  also sync files that bootstrap-mode backends had written.
 
 It doesn't make sense for initdb to take responsibility to sync files
 created by the backend. If there are important files that the backend
 creates, it should be the backend's responsibility to fsync them (and
 their parent directory, if needed). And if they are unimportant to the
 backend, then there is no reason for initdb to fsync them.

I meant primarily to illustrate the need to be comprehensive, not comment on
which executable should fsync a particular file.  Bootstrap-mode backends do
not sync anything during an initdb run on my system.  With your patch, we'll
fsync a small handful of files and leave nearly everything else vulnerable.

That being said, having each backend fsync its own writes will mean syncing
certain files several times within a single initdb run.  If the penalty from
that proves high enough, we may do well to instead have initdb.c sync
everything just once.

  initdb should do these syncs by default and offer an option to disable them.
 
 For test frameworks that run initdb often, that makes sense.
 
 But for developers, it doesn't make sense to spend 0.5s typing an option
 that saves you 0.3s. So, we'd need some more convenient way to choose
 the no-fsync option, like an environment variable that developers can
 set. Or maybe developers don't care about 0.3s?

Developers have shell aliases/functions/scripts and command history.  I
wouldn't object to having an environment variable control it, but I would not
personally find that more convenient than a command-line switch.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Please post the patch rather than fixing directly. There's some subtle
 stuff there and it would be best to discuss first.

And here's a proposed patch for not throwing ERROR during replay of DROP
TABLESPACE.  I had originally thought this would be a one-liner
s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
needs to be changed too, so that it doesn't throw error for unremovable
directories.

regards, tom lane

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 8ed6d06bb2370a54349f2a86ee13e08a381eacf2..ed0c9ea3ee5a372c6926f41fc0fc93ed7d80d6cd 100644
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
*** create_tablespace_directories(const char
*** 626,634 
  /*
   * destroy_tablespace_directories
   *
!  * Attempt to remove filesystem infrastructure
   *
!  * 'redo' indicates we are redoing a drop from XLOG; okay if nothing there
   *
   * Returns TRUE if successful, FALSE if some subdirectory is not empty
   */
--- 626,638 
  /*
   * destroy_tablespace_directories
   *
!  * Attempt to remove filesystem infrastructure for the tablespace.
   *
!  * 'redo' indicates we are redoing a drop from XLOG; in that case we should
!  * not throw an ERROR for problems, just LOG them.  The worst consequence of
!  * not removing files here would be failure to release some disk space, which
!  * does not justify throwing an error that would require manual intervention
!  * to get the database running again.
   *
   * Returns TRUE if successful, FALSE if some subdirectory is not empty
   */
*** destroy_tablespace_directories(Oid table
*** 681,686 
--- 685,700 
  			pfree(linkloc_with_version_dir);
  			return true;
  		}
+ 		else if (redo)
+ 		{
+ 			/* in redo, just log other types of error */
+ 			ereport(LOG,
+ 	(errcode_for_file_access(),
+ 	 errmsg(could not open directory \%s\: %m,
+ 			linkloc_with_version_dir)));
+ 			pfree(linkloc_with_version_dir);
+ 			return false;
+ 		}
  		/* else let ReadDir report the error */
  	}
  
*** destroy_tablespace_directories(Oid table
*** 704,710 
  
  		/* remove empty directory */
  		if (rmdir(subfile)  0)
! 			ereport(ERROR,
  	(errcode_for_file_access(),
  	 errmsg(could not remove directory \%s\: %m,
  			subfile)));
--- 718,724 
  
  		/* remove empty directory */
  		if (rmdir(subfile)  0)
! 			ereport(redo ? LOG : ERROR,
  	(errcode_for_file_access(),
  	 errmsg(could not remove directory \%s\: %m,
  			subfile)));
*** destroy_tablespace_directories(Oid table
*** 716,738 
  
  	/* remove version directory */
  	if (rmdir(linkloc_with_version_dir)  0)
! 		ereport(ERROR,
  (errcode_for_file_access(),
   errmsg(could not remove directory \%s\: %m,
  		linkloc_with_version_dir)));
  
  	/*
  	 * Try to remove the symlink.  We must however deal with the possibility
  	 * that it's a directory instead of a symlink --- this could happen during
  	 * WAL replay (see TablespaceCreateDbspace), and it is also the case on
  	 * Windows where junction points lstat() as directories.
  	 */
  	linkloc = pstrdup(linkloc_with_version_dir);
  	get_parent_directory(linkloc);
  	if (lstat(linkloc, st) == 0  S_ISDIR(st.st_mode))
  	{
  		if (rmdir(linkloc)  0)
! 			ereport(ERROR,
  	(errcode_for_file_access(),
  	 errmsg(could not remove directory \%s\: %m,
  			linkloc)));
--- 730,759 
  
  	/* remove version directory */
  	if (rmdir(linkloc_with_version_dir)  0)
! 	{
! 		ereport(redo ? LOG : ERROR,
  (errcode_for_file_access(),
   errmsg(could not remove directory \%s\: %m,
  		linkloc_with_version_dir)));
+ 		pfree(linkloc_with_version_dir);
+ 		return false;
+ 	}
  
  	/*
  	 * Try to remove the symlink.  We must however deal with the possibility
  	 * that it's a directory instead of a symlink --- this could happen during
  	 * WAL replay (see TablespaceCreateDbspace), and it is also the case on
  	 * Windows where junction points lstat() as directories.
+ 	 *
+ 	 * Note: in the redo case, we'll return true if this final step fails;
+ 	 * there's no point in retrying it.
  	 */
  	linkloc = pstrdup(linkloc_with_version_dir);
  	get_parent_directory(linkloc);
  	if (lstat(linkloc, st) == 0  S_ISDIR(st.st_mode))
  	{
  		if (rmdir(linkloc)  0)
! 			ereport(redo ? LOG : ERROR,
  	(errcode_for_file_access(),
  	 errmsg(could not remove directory \%s\: %m,
  			linkloc)));
*** destroy_tablespace_directories(Oid table
*** 740,746 
  	else
  	{
  		if (unlink(linkloc)  0)
! 			ereport(ERROR,
  	(errcode_for_file_access(),
  	 errmsg(could not remove symbolic link \%s\: %m,
  			linkloc)));
--- 761,767 
  	else
  	{
  		if (unlink(linkloc)  0)
! 			ereport(redo ? LOG : ERROR,
  	(errcode_for_file_access(),
  	 errmsg(could not remove 

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-05 Thread Shigeru Hanada

2012/2/2 Marko Kreen mark...@gmail.com:

I think you want this instead:

 https://commitfest.postgresql.org/action/patch_view?id=769


Somehow I've missed this cool feature.  Thanks for the suggestion!

--
Shigeru Hanada


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] disable prompting by default in createuser

2012-02-05 Thread Josh Kupershmidt
On Wed, Feb 1, 2012 at 1:13 PM, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2012-01-15 at 18:14 -0500, Josh Kupershmidt wrote:
 I see this patch includes a small change to dropuser, to make the
 'username' argument mandatory if --interactive is not set, for
 symmetry with createuser's new behavior. That's dandy, though IMO we
 shouldn't have -i be shorthand for --interactive with dropuser,
 and something different with createuser (i.e. we should just get rid
 of the i alias for dropuser).

 Well, all the other tools also support -i for prompting.

Taking a look at the current ./src/bin/scripts executables, I see only
2 out of 9 (`dropdb` and `dropuser`) which have -i mean
--interactive, and `reindexdb` has another meaning for -i
entirely. So I'm not sure there's such a clear precedent for having
-i mean --interactive within our scripts, at least.

 I'd rather get
 rid of -i for --inherit, but I fear that will break things as well.  I'm
 not sure what to do.

I think breaking backwards compatibility probably won't fly (and
should probably be handled by another patch, anyway). I guess it's OK
to keep the patch's current behavior, given we are already
inconsistent about what -i means.

 i.e. createuser tries taking either $PGUSER or the current username as
 a default user to create, while dropuser just bails out. Personally, I
 prefer just bailing out if no create/drop user is specified, but
 either way I think they should be consistent.

 That is intentional long-standing behavior.  createdb/dropdb work the
 same way.

OK.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Bruce Momjian
On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
 On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
  On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
   Also, as far as I can see this patch usurps the page version field,
   which I find unacceptably short-sighted.  Do you really think this is
   the last page layout change we'll ever make?
 
  No, I don't. I hope and expect the next page layout change to
  reintroduce such a field.
 
  But since we're agreed now that upgrading is important, changing page
  format isn't likely to be happening until we get an online upgrade
  process. So future changes are much less likely. If they do happen, we
  have some flag bits spare that can be used to indicate later versions.
  It's not the prettiest thing in the world, but it's a small ugliness
  in return for an important feature. If there was a way without that, I
  would have chosen it.
 
  Have you considered the CRC might match a valuid page version number?
  Is that safe?
 
 In the proposed scheme there are two flag bits set on the page to
 indicate whether the field is used as a checksum rather than a version
 number. So its possible the checksum could look like a valid page
 version number, but we'd still be able to tell the difference.

Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it.  The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.  

I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.

I don't want to get to a point where we need a bit per page number
version.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-05 Thread Robert Haas
On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales sca...@vmware.com wrote:
 Thanks for the detailed followup.  I do see how Postgres is tuned for
 having a bunch of memory available that is not in shared_buffers, both
 for the OS buffer cache and other memory allocations.  However, Postgres
 seems to run fine in many large shared_memory configurations that I
 gave performance numbers for, including 5G shared_buffers for an 8G
 machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
 sufficient extra memory beyond the shared_buffers cache.

I agree that you could probably set shared_buffers to 3GB on a 6GB
machine and get decent performance - but would it be the optimal
performance, and for what workload?  To really figure out whether this
patch is a win, you need to get the system optimally tuned for the
unpatched sources (which we can't tell whether you've done, since you
haven't posted the configuration settings or any comparative figures
for different settings, or any details on which commit you tested
against) and then get the system optimally tuned for the patched
sources with double_writes=on, and then see whether there's a gain.

 I think the pgbench run is pointing out a problem that this double_writes
 implementation has with BULK_WRITEs.  As you point out, the
 BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.

Bulk reads will have the same problem.  Consider loading a bunch of
data into a new data with COPY, and then scanning the table.  The
table scan will be a bulk read and every page will be dirtied
setting hint bits.  Another thing to worry about is vacuum, which also
uses a BufferAccessStrategy.  Greg Smith has done some previous
benchmarking showing that when the kernel is too aggressive about
flushing dirty data to disk, vacuum becomes painfully slow.  I suspect
this patch is going to have that problem in spades (but it would be
good to test that).  Checkpoints might be a problem, too, since they
flush a lot of dirty data, and that's going to require a lot of extra
fsyncing with this implementation.  It certainly seems that unless you
have a pg_xlog and the data separated and a battery-backed write cache
for each, checkpoints might be really slow.  I'm not entirely
convinced they'll be fast even if you have all that (but it would be
good to test that, too).

 I'm not sure if there is a great solution that always works for that
 issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
 archiving/replication is happening.  As I understand it, if the
 BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
 double-written either.  The BULK_WRITE data is not officially synced and
 committed until it is all written, so there doesn't have to be any
 torn-page protection for that data, which is why the WAL logging can be
 omitted.  The double-write implementation can be improved by marking each
 buffer if it doesn't need torn-page protection.  These buffers would be
 those new pages that are explicitly not WAL-logged, even when
 full_page_writes is enabled.  When such a buffer is eventually synced
 (perhaps because of an eviction), it would not be double-written.  This
 would often avoid double-writes for BULK_WRITE, etc., especially since
 the administrator is often not archiving or doing replication when doing
 bulk loads.

I agree - this optimization seems like a must.  I'm not sure that it's
sufficient, but it certainly seems necessary.  It's not going to help
with VACUUM, though, so I think that case needs some careful looking
at to determine how bad the regression is and what can be done to
mitigate it.  In particular, I note that I suggested an idea that
might help in the final paragraph of my last email.

My general feeling about this patch is that it needs a lot more work
before we should consider committing it.  Your tests so far overlook
quite a few important problem cases (bulk loads, SELECT on large
unhinted tables, vacuum speed, checkpoint duration, and others) and
still mostly show it losing to full_page_writes, sometimes by large
margins.  Even in the one case where you got an 8% speedup, it's not
really clear that the same speedup (or an even bigger one) couldn't
have been gotten by some other kind of tuning.  I think you really
need to spend some more time thinking about how to blunt the negative
impact on the cases where it hurts, and increase the benefit in the
cases where it helps.  The approach seems to have potential, but it
seems way to immature to think about shipping it at this point.  (You
may have been thinking along similar lines since I note that the patch
is marked WIP.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Heikki Linnakangas

On 06.02.2012 05:48, Bruce Momjian wrote:

On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.


Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.


There's this idea that I'd like to see implemented: 
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


In any case, I think it's a very bad idea to remove the page version 
field. How are we supposed to ever be able to change the page format 
again if we don't even have a version number on the page? I strongly 
oppose removing it.


I'm also not very comfortable with the idea of having flags on the page 
indicating whether it has a checksum or not. It's not hard to imagine a 
software of firmware bug or hardware failure that would cause pd_flags 
field to be zeroed out altogether. It would be more robust if the 
expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to 
deal with that at upgrade somehow. And it still feels a bit whacky anyway.



I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.


Three bits seems short-sighted, but yeah, something like 6-8 bits should 
be enough. On the whole, though. I think we should bite the bullet and 
invent a way to extend the page header at upgrade.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers