Re: [HACKERS] basic pgbench runs with various performance-related patches
On Sat, Feb 4, 2012 at 11:59 AM, Greg Smith g...@2ndquadrant.com wrote: On 01/24/2012 08:58 AM, Robert Haas wrote: One somewhat odd thing about these numbers is that, on permanent tables, all of the patches seemed to show regressions vs. master in single-client throughput. That's a slightly difficult result to believe, though, so it's probably a testing artifact of some kind. It looks like you may have run the ones against master first, then the ones applying various patches. The one test artifact I have to be very careful to avoid in that situation is that later files on the physical disk are slower than earlier ones. There's a 30% differences between the fastest part of a regular hard drive, the logical beginning, and its end. Multiple test runs tend to creep forward onto later sections of disk, and be biased toward the earlier run in that case. To eliminate that bias when it gets bad, I normally either a) run each test 3 times, interleaved, or b) rebuild the filesystem in between each initdb. I'm not sure that's the problem you're running into, but it's the only one I've been hit by that matches the suspicious part of your results. I don't think that's it, because tests on various branches were interleaved; moreover, I don't believe master was the first one in the rotation. I think I had then in alphabetical order by branch name, actually. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. It doesn't make sense for initdb to take responsibility to sync files created by the backend. If there are important files that the backend creates, it should be the backend's responsibility to fsync them (and their parent directory, if needed). And if they are unimportant to the backend, then there is no reason for initdb to fsync them. An optimization like the pg_flush_data() call in copy_file() may reduce the speed penalty. That worked pretty well. It took it down about 100ms on my machine, which closes the gap significantly. initdb should do these syncs by default and offer an option to disable them. For test frameworks that run initdb often, that makes sense. But for developers, it doesn't make sense to spend 0.5s typing an option that saves you 0.3s. So, we'd need some more convenient way to choose the no-fsync option, like an environment variable that developers can set. Or maybe developers don't care about 0.3s? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pl/python long-lived allocations in datum-dict transformation
Consider this: create table arrays as select array[random(), random(), random(), random(), random(), random()] as a from generate_series(1, 100); create or replace function plpython_outputfunc() returns void as $$ c = plpy.cursor('select a from arrays') for row in c: pass $$ language plpythonu; When running the function, every datum will get transformed into a Python dict, which includes calling the type's output function, resulting in a memory allocation. The memory is allocated in the SPI context, so it accumulates until the function is finished. This is annoying for functions that plough through large tables, doing some calculation. Attached is a patch that does the conversion of PostgreSQL Datums into Python dict objects in a scratch memory context that gets reset every time. Cheers, Jan diff --git a/src/pl/plpython/plpy_main.c b/src/pl/plpython/plpy_main.c index ae9d87e..79d7784 100644 *** a/src/pl/plpython/plpy_main.c --- b/src/pl/plpython/plpy_main.c *** *** 12,17 --- 12,18 #include executor/spi.h #include miscadmin.h #include utils/guc.h + #include utils/memutils.h #include utils/syscache.h #include plpython.h *** plpython_inline_handler(PG_FUNCTION_ARGS *** 268,273 --- 269,279 MemSet(proc, 0, sizeof(PLyProcedure)); proc.pyname = PLy_strdup(__plpython_inline_block); + proc.tmp_ctx = AllocSetContextCreate(TopMemoryContext, + PL/Python temporary ctx, + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); proc.result.out.d.typoid = VOIDOID; PG_TRY(); diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c index 229966a..f539cec 100644 *** a/src/pl/plpython/plpy_procedure.c --- b/src/pl/plpython/plpy_procedure.c *** *** 12,17 --- 12,18 #include catalog/pg_type.h #include utils/builtins.h #include utils/hsearch.h + #include utils/memutils.h #include utils/syscache.h #include plpython.h *** PLy_procedure_create(HeapTuple procTup, *** 169,174 --- 170,180 proc-setof = NULL; proc-src = NULL; proc-argnames = NULL; + proc-tmp_ctx = AllocSetContextCreate(TopMemoryContext, + PL/Python temporary ctx, + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); PG_TRY(); { *** PLy_procedure_delete(PLyProcedure *proc) *** 411,416 --- 417,424 PLy_free(proc-src); if (proc-argnames) PLy_free(proc-argnames); + if (proc-tmp_ctx) + MemoryContextDelete(proc-tmp_ctx); } /* diff --git a/src/pl/plpython/plpy_procedure.h b/src/pl/plpython/plpy_procedure.h index e986c7e..5ee73fa 100644 *** a/src/pl/plpython/plpy_procedure.h --- b/src/pl/plpython/plpy_procedure.h *** typedef struct PLyProcedure *** 30,35 --- 30,36 PyObject *code; /* compiled procedure code */ PyObject *statics; /* data saved across calls, local scope */ PyObject *globals; /* data saved across calls, global scope */ + MemoryContext tmp_ctx; } PLyProcedure; /* the procedure cache entry */ diff --git a/src/pl/plpython/plpy_typeio.c b/src/pl/plpython/plpy_typeio.c index d5cac9f..6f0eb46 100644 *** a/src/pl/plpython/plpy_typeio.c --- b/src/pl/plpython/plpy_typeio.c *** *** 23,28 --- 23,29 #include plpy_typeio.h #include plpy_elog.h + #include plpy_procedure.h /* I/O function caching */ *** PLy_output_record_funcs(PLyTypeInfo *arg *** 258,268 Assert(arg-is_rowtype == 1); } PyObject * PLyDict_FromTuple(PLyTypeInfo *info, HeapTuple tuple, TupleDesc desc) { ! PyObject *volatile dict; ! int i; if (info-is_rowtype != 1) elog(ERROR, PLyTypeInfo structure describes a datum); --- 259,274 Assert(arg-is_rowtype == 1); } + /* + * Transform a tuple into a Python dict object. The transformation can result + * in memory allocation, so it's done in a scratch memory context. + */ PyObject * PLyDict_FromTuple(PLyTypeInfo *info, HeapTuple tuple, TupleDesc desc) { ! PyObject *volatile dict; ! MemoryContext oldcontext; ! inti; if (info-is_rowtype != 1) elog(ERROR, PLyTypeInfo structure describes a datum); *** PLyDict_FromTuple(PLyTypeInfo *info, Hea *** 271,276 --- 277,284 if (dict == NULL) PLy_elog(ERROR, could not create new dictionary); + oldcontext = MemoryContextSwitchTo(PLy_curr_procedure-tmp_ctx); + PG_TRY(); { for (i = 0; i info-in.r.natts; i++) *** PLyDict_FromTuple(PLyTypeInfo *info, Hea *** 298,308 --- 306,322 } PG_CATCH(); { + MemoryContextSwitchTo(oldcontext); + MemoryContextReset(PLy_curr_procedure-tmp_ctx); + Py_DECREF(dict); PG_RE_THROW(); } PG_END_TRY(); + MemoryContextSwitchTo(oldcontext); +
[HACKERS] plpgsql leaking memory when stringifying datums
While chasing a PL/Python memory leak, I did a few tests with PL/pgSQL and I think there are places where memory is not freed sufficiently early. Attached are two functions that when run will make the backend's memory consumption increase until they finish. With both, the cause is convert_value_to_string that calls a datum's output function, which for toasted data results in an allocation. The proposed patch changes convert_value_to_string to call the output function in the per-tuple memory context and then copy the result string back to the original context. The comment in that function says that callers generally pfree its result, but that wasn't the case with exec_stmt_raise, so I added a pfree() there as well. With that I was still left with a leak in the typecast() test function and it turns out that sticking a exec_eval_cleanup into exec_move_row fixed it. The regression tests pass, but I'm not 100% sure if it's actually safe. Since convert_value_to_string needed to access the PL/pgSQL's execution state to get its hands on the per-tuple context, I needed to pass it to every caller that didn't have it already, which means exec_cast_value and exec_simple_cast_value. Anyone has a better idea? The initial diagnosis and proposed solution are by Andres Freund - thanks! Cheers, Jan diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index bf952b6..a691905 100644 *** a/src/pl/plpgsql/src/pl_exec.c --- b/src/pl/plpgsql/src/pl_exec.c *** static void exec_move_row(PLpgSQL_execst *** 188,201 static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate, PLpgSQL_row *row, TupleDesc tupdesc); ! static char *convert_value_to_string(Datum value, Oid valtype); ! static Datum exec_cast_value(Datum value, Oid valtype, Oid reqtype, FmgrInfo *reqinput, Oid reqtypioparam, int32 reqtypmod, bool isnull); ! static Datum exec_simple_cast_value(Datum value, Oid valtype, Oid reqtype, int32 reqtypmod, bool isnull); static void exec_init_tuple_store(PLpgSQL_execstate *estate); --- 188,204 static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate, PLpgSQL_row *row, TupleDesc tupdesc); ! static char *convert_value_to_string(PLpgSQL_execstate *estate, ! Datum value, Oid valtype); ! static Datum exec_cast_value(PLpgSQL_execstate *estate, ! Datum value, Oid valtype, Oid reqtype, FmgrInfo *reqinput, Oid reqtypioparam, int32 reqtypmod, bool isnull); ! static Datum exec_simple_cast_value(PLpgSQL_execstate *estate, ! Datum value, Oid valtype, Oid reqtype, int32 reqtypmod, bool isnull); static void exec_init_tuple_store(PLpgSQL_execstate *estate); *** plpgsql_exec_function(PLpgSQL_function * *** 430,436 else { /* Cast value to proper type */ ! estate.retval = exec_cast_value(estate.retval, estate.rettype, func-fn_rettype, (func-fn_retinput), func-fn_rettypioparam, --- 433,440 else { /* Cast value to proper type */ ! estate.retval = exec_cast_value(estate, estate.retval, ! estate.rettype, func-fn_rettype, (func-fn_retinput), func-fn_rettypioparam, *** exec_stmt_fori(PLpgSQL_execstate *estate *** 1757,1763 * Get the value of the lower bound */ value = exec_eval_expr(estate, stmt-lower, isnull, valtype); ! value = exec_cast_value(value, valtype, var-datatype-typoid, (var-datatype-typinput), var-datatype-typioparam, var-datatype-atttypmod, isnull); --- 1761,1767 * Get the value of the lower bound */ value = exec_eval_expr(estate, stmt-lower, isnull, valtype); ! value = exec_cast_value(estate, value, valtype, var-datatype-typoid, (var-datatype-typinput), var-datatype-typioparam, var-datatype-atttypmod, isnull); *** exec_stmt_fori(PLpgSQL_execstate *estate *** 1772,1778 * Get the value of the upper bound */ value = exec_eval_expr(estate, stmt-upper, isnull, valtype); ! value = exec_cast_value(value, valtype, var-datatype-typoid, (var-datatype-typinput), var-datatype-typioparam, var-datatype-atttypmod, isnull); --- 1776,1782 * Get the value of the upper bound */ value = exec_eval_expr(estate, stmt-upper, isnull, valtype); ! value = exec_cast_value(estate, value, valtype, var-datatype-typoid, (var-datatype-typinput), var-datatype-typioparam, var-datatype-atttypmod, isnull); *** exec_stmt_fori(PLpgSQL_execstate *estate *** 1789,1795 if (stmt-step) { value = exec_eval_expr(estate, stmt-step, isnull, valtype); ! value = exec_cast_value(value, valtype, var-datatype-typoid, (var-datatype-typinput), var-datatype-typioparam, var-datatype-atttypmod,
[HACKERS] Report: race conditions in WAL replay routines
Pursuant to the recent discussion about bugs 6200 and 6245, I went trawling through all the WAL redo routines looking for bugs such as inadequate locking or transiently trashing shared buffers. Here's what I found: * As we already knew, RestoreBkpBlocks is broken because it transiently trashes a shared buffer, which another process could be accessing while holding only a pin. * seq_redo() has the same disease, since we allow SELECT * FROM sequences. * Everything else seems to be free of that specific issue; in particular the index-related replay routines are at fairly low risk since we don't have any coding rules allowing index pages to be examined without holding a buffer lock. * There are assorted replay routines that assume they can whack fields of ShmemVariableCache around without any lock. However, it's pretty inconsistent; about half do it like that, while the other half assume that they can read ShmemVariableCache without lock but should acquire lock to modify it. I think the latter coding rule is a whole lot safer in the presence of Hot Standby and should be adopted throughout. * Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact. It's not entirely clear to me that no read-only transaction can ever examine the shared-memory variables they change. In any case, if there is in fact no other process examining those variables, there can be no contention for the lock so it should be cheap to get. * Not exactly a race condition, but: tblspc_redo does ereport(ERROR) if it fails to clean out tablespace directories. This seems to me to be the height of folly, especially when the failure is more or less an expected case. If the error occurs the database is dead in the water, because that error is actually a PANIC and will recur on subsequent restart attempts. Therefore there is no way to recover short of manual intervention to clean out the non-empty directory. And why are we pulling the fire alarm like this? Well, uh, it's because we might fail to recover some disk space in the dropped tablespace. Seems to me to be a lot better to just elog(LOG) and move on. This is quite analogous to the case of failing to unlink a file after commit --- wasting disk space might be bad, but it's very much the lesser evil compared to this. Barring objections I'm going to fix all this stuff and back-patch as far as 9.0 where hot standby was added. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON output functions.
Hi Andrew Nice work! Just for completeness: Did you also think of including geometry types in JSON output functions in later releases? There's a nice extension of JSON called GeoJSON for a starting point. Yours, Stefan 2012/2/3 Andrew Dunstan and...@dunslane.net: On 02/02/2012 12:20 PM, Pavel Stehule wrote: 2012/2/2 Andrew Dunstanand...@dunslane.net: On 02/02/2012 04:35 AM, Abhijit Menon-Sen wrote: At 2012-02-01 18:48:28 -0500, andrew.duns...@pgexperts.com wrote: For now I'm inclined not to proceed with that, and leave it as an optimization to be considered later if necessary. Thoughts? I agree, there doesn't seem to be a pressing need to do it now. OK, here's my final version of the patch for constructor functions. If there's no further comment I'll go with this. These function are super, Thank you Do you plan to fix a issue with row attribute names in 9.2? Yeah. Tom did some initial work which he published here: http://archives.postgresql.org/message-id/28413.1321500388%40sss.pgh.pa.us, noting: It's not really ideal with respect to the ValuesScan case, because what you get seems to always be the hard-wired columnN names for VALUES columns, even if you try to override that with an alias ... Curiously, it works just fine if the VALUES can be folded and later he said: Upon further review, this patch would need some more work even for the RowExpr case, because there are several places that build RowExprs without bothering to build a valid colnames list. It's clearly soluble if anyone cares to put in the work, but I'm not personally excited enough to pursue it .. I'm going to look at that issue first, since the unfolded VALUES clause seems like something of an obscure corner case. Feel free to chime in if you can. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
Jeff Davis pg...@j-davis.com writes: On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. It doesn't make sense for initdb to take responsibility to sync files created by the backend. No, but the more interesting question is whether bootstrap mode troubles to fsync its writes. I'm not too sure about that either way ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON output functions.
On 02/05/2012 02:31 PM, Stefan Keller wrote: Hi Andrew Nice work! Just for completeness: Did you also think of including geometry types in JSON output functions in later releases? There's a nice extension of JSON called GeoJSON for a starting point. [side note: please don't top-reply on -hackers. See http://idallen.com/topposting.html] Currently, in array_to_json and row_to_json the only special cases are: * record types are output as JSON records * array types are output as JSON arrays * numeric types are output without quoting * boolean types are output as unquoted true or false * NULLs are output as NULL Everything else is output as its text representation, suitably quoted and escaped. If you want to change how those operate, now rather than later would be the best time. But later we could certainly add various other foo_to_json functions for things like geometry types and hstores. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
2012年2月1日12:15 Shigeru Hanada shigeru.han...@gmail.com: (2012/01/30 4:39), Kohei KaiGai wrote: I checked the fdw_helper_funcs_v3.patch, pgsql_fdw_v5.patch and pgsql_fdw_pushdown_v1.patch. My comments are below. Thanks for the review! [BUG] Even though pgsql_fdw tries to push-down qualifiers being executable on the remove side at the deparseSql(), it does not remove qualifiers being pushed down from the baserel-baserestrictinfo, thus, these qualifiers are eventually executed twice. See the result of this EXPLAIN. postgres=# EXPLAIN SELECT * FROM ft1 WHERE a 2 AND f_leak(b); QUERY PLAN -- Foreign Scan on ft1 (cost=107.43..122.55 rows=410 width=36) Filter: (f_leak(b) AND (a 2)) Remote SQL: DECLARE pgsql_fdw_cursor_0 SCROLL CURSOR FOR SELECT a, b FROM public.t1 WHERE (a 2) (3 rows) My expectation is (a 2) being executed on the remote-side and f_leak(b) being executed on the local-side. But, filter of foreign-scan on ft1 has both of qualifiers. It has to be removed, if a RestrictInfo get pushed-down. It's intentional that pgsql_fdw keeps pushed-down qualifier in baserestrictinfo, because I saw some undesirable behavior when I implemented so that with such optimization when plan is reused, but it's not clear to me now. I'll try to recall what I saw... BTW, I think evaluating pushed-down qualifiers again on local side is safe and has no semantic problem, though we must pay little for such overhead. Do you have concern about performance? Yes. In my opinion, one significant benefit of pgsql_fdw is to execute qualifiers on the distributed nodes; that enables to utilize multiple CPU resources efficiently. Duplicate checks are reliable way to keep invisible tuples being filtered out, indeed. But it shall degrade one competitive characteristics of the pgsql_fdw. https://github.com/kaigai/pg_strom/blob/master/plan.c#L693 In my module, qualifiers being executable on device side are detached from the baserel-baserestrictinfo, and remaining qualifiers are chained to the list. The is_device_executable_qual() is equivalent to is_foreign_expr() in the pgsql_fdw module. Of course, it is your decision, and I might miss something. BTW, what is the undesirable behavior on your previous implementation? [Design comment] I'm not sure the reason why store_result() uses MessageContext to save the Tuplestorestate within PgsqlFdwExecutionState. The source code comment says it is used to avoid memory leaks in error cases. I also have a similar experience on implementation of my fdw module, so, I could understand per-scan context is already cleared at the timing of resource-release-callback, thus, handlers to external resource have to be saved on separated memory context. In my personal opinion, the right design is to declare a memory context for pgsql_fdw itself, instead of the abuse of existing memory context. (More wise design is to define sub-memory-context for each foreign-scan, then, remove the sub-memory-context after release handlers.) I simply chose built-in context which has enough lifespan, but now I think that using MessageContext directly is not recommended way. As you say, creating new context as child of MessageContext for each scan in BeginForeignScan (or first IterateForeignScan) would be better. Please see attached patch. One other option is getting rid of tuplestore by holding result rows as PGresult, and track it for error cases which might happen. ResourceOwner callback can be used to release PGresult on error, similar to PGconn. If we could have set of results on per-query memory context (thus, no need to explicit release on error timing), it is more ideal design. It it possible to implement based on the libpq APIs? Please note that per-query memory context is already released on ResourceOwner callback is launched, so, it is unavailable to implement if libpq requires to release some resource. [Design comment] When BEGIN should be issued on the remote-side? The connect_pg_server() is an only chance to issue BEGIN command at the remote-side on connection being opened. However, the connection shall be kept unless an error is not raised. Thus, the remote-side will continue to work within a single transaction block, even if local-side iterates a pair of BEGIN and COMMIT. I'd like to suggest to close the transaction block at the timing of either end of the scan, transaction or sub-transaction. Indeed, remote transactions should be terminated at some timing. Terminating at the end of a scan seems troublesome because a connection might be shared by multiple scans in a query. I'd prefer aborting remote transaction at the end of local query. Please see abort_remote_tx in attached patch. It seems to me abort_remote_tx in ReleaseConnection() is reasonable.
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote: On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote: Also, as far as I can see this patch usurps the page version field, which I find unacceptably short-sighted. Do you really think this is the last page layout change we'll ever make? No, I don't. I hope and expect the next page layout change to reintroduce such a field. But since we're agreed now that upgrading is important, changing page format isn't likely to be happening until we get an online upgrade process. So future changes are much less likely. If they do happen, we have some flag bits spare that can be used to indicate later versions. It's not the prettiest thing in the world, but it's a small ugliness in return for an important feature. If there was a way without that, I would have chosen it. Have you considered the CRC might match a valuid page version number? Is that safe? In the proposed scheme there are two flag bits set on the page to indicate whether the field is used as a checksum rather than a version number. So its possible the checksum could look like a valid page version number, but we'd still be able to tell the difference. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple
On Sat, Feb 4, 2012 at 6:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: The cause here is data changing underneath the user. Your patch solves the most obvious error, but it still allows other problems if applying the backup block changes data. If the backup block doesn't do anything at all then we don't need to apply it either. This is nonsense. What applying the backup block does is to apply the change that the WAL record would otherwise have applied, except we decided to make it store a full-page image instead. Yep, you're right, my bad. Got a head cold, so will lay off a few days from too much thinking. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: Pursuant to the recent discussion about bugs 6200 and 6245, I went trawling through all the WAL redo routines looking for bugs such as inadequate locking or transiently trashing shared buffers. Here's what I found: * As we already knew, RestoreBkpBlocks is broken because it transiently trashes a shared buffer, which another process could be accessing while holding only a pin. Agreed * seq_redo() has the same disease, since we allow SELECT * FROM sequences. Why do we do that? * Everything else seems to be free of that specific issue; in particular the index-related replay routines are at fairly low risk since we don't have any coding rules allowing index pages to be examined without holding a buffer lock. Yep * There are assorted replay routines that assume they can whack fields of ShmemVariableCache around without any lock. However, it's pretty inconsistent; about half do it like that, while the other half assume that they can read ShmemVariableCache without lock but should acquire lock to modify it. I think the latter coding rule is a whole lot safer in the presence of Hot Standby and should be adopted throughout. Agreed * Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact. It's not entirely clear to me that no read-only transaction can ever examine the shared-memory variables they change. In any case, if there is in fact no other process examining those variables, there can be no contention for the lock so it should be cheap to get. Row locking requires a WAL record to be written, so that whole path is dead during HS. * Not exactly a race condition, but: tblspc_redo does ereport(ERROR) if it fails to clean out tablespace directories. This seems to me to be the height of folly, especially when the failure is more or less an expected case. If the error occurs the database is dead in the water, because that error is actually a PANIC and will recur on subsequent restart attempts. Therefore there is no way to recover short of manual intervention to clean out the non-empty directory. And why are we pulling the fire alarm like this? Well, uh, it's because we might fail to recover some disk space in the dropped tablespace. Seems to me to be a lot better to just elog(LOG) and move on. This is quite analogous to the case of failing to unlink a file after commit --- wasting disk space might be bad, but it's very much the lesser evil compared to this. If the sysadmin is managing the db properly then this shouldn't ever happen - the only cause is if the tablespace being dropped is being used as a temp tablespace on the standby. The ERROR is appropriate because we first try to remove the files. If they won't go we then raise an all-session conflict and then try again. Only when we fail the second time does it ERROR, which seems OK. If you just LOG, when exactly would we get rid of the tablespace? Barring objections I'm going to fix all this stuff and back-patch as far as 9.0 where hot standby was added. Please post the patch rather than fixing directly. There's some subtle stuff there and it would be best to discuss first. Thanks -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
Simon Riggs si...@2ndquadrant.com writes: On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: * seq_redo() has the same disease, since we allow SELECT * FROM sequences. Why do we do that? It's the only existing way to obtain the parameters of a sequence. Even if we felt like inventing a different API for doing that, it'd take years to change every client that knows about this. * Not exactly a race condition, but: tblspc_redo does ereport(ERROR) if it fails to clean out tablespace directories. This seems to me to be the height of folly, especially when the failure is more or less an expected case. If the error occurs the database is dead in the water, because that error is actually a PANIC and will recur on subsequent restart attempts. Therefore there is no way to recover short of manual intervention to clean out the non-empty directory. And why are we pulling the fire alarm like this? Well, uh, it's because we might fail to recover some disk space in the dropped tablespace. Seems to me to be a lot better to just elog(LOG) and move on. This is quite analogous to the case of failing to unlink a file after commit --- wasting disk space might be bad, but it's very much the lesser evil compared to this. If the sysadmin is managing the db properly then this shouldn't ever happen - the only cause is if the tablespace being dropped is being used as a temp tablespace on the standby. Right, but that is an expected/foreseeable situation. It should not lead to a dead-and-unrestartable database. If you just LOG, when exactly would we get rid of the tablespace? The tablespace *is* gone, or at least its catalog entries are. All we are trying to do here is release some underlying disk space. It's exactly analogous to the case where we drop a table and then find (post commit) that unlinking the disk file fails for some weird reason. We've done what we can to clean the disk space and should just let it go --- there is no risk to database integrity in leaving some files behind, so killing the server is a huge overreaction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double writes using double-write buffer approach [WIP]
Thanks for the detailed followup. I do see how Postgres is tuned for having a bunch of memory available that is not in shared_buffers, both for the OS buffer cache and other memory allocations. However, Postgres seems to run fine in many large shared_memory configurations that I gave performance numbers for, including 5G shared_buffers for an 8G machine, 3G shared_buffers for a 6G machine, etc. There just has to be sufficient extra memory beyond the shared_buffers cache. I think the pgbench run is pointing out a problem that this double_writes implementation has with BULK_WRITEs. As you point out, the BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions. I'm not sure if there is a great solution that always works for that issue. However, I do notice that BULK_WRITE data isn't WAL-logged unless archiving/replication is happening. As I understand it, if the BULK_WRITE data isn't being WAL-logged, then it doesn't have to be double-written either. The BULK_WRITE data is not officially synced and committed until it is all written, so there doesn't have to be any torn-page protection for that data, which is why the WAL logging can be omitted. The double-write implementation can be improved by marking each buffer if it doesn't need torn-page protection. These buffers would be those new pages that are explicitly not WAL-logged, even when full_page_writes is enabled. When such a buffer is eventually synced (perhaps because of an eviction), it would not be double-written. This would often avoid double-writes for BULK_WRITE, etc., especially since the administrator is often not archiving or doing replication when doing bulk loads. However, overall, I think the idea is that double writes are an optional optimization. The user would only turn it on in existing configurations where it helps or only slightly hurts performance, and where greatly reducing the size of the WAL logs is beneficial. It might also be especially beneficial when there is a small amount of FLASH or other kind of fast storage that the double-write files can be stored on. Thanks, Dan - Original Message - From: Robert Haas robertmh...@gmail.com To: Dan Scales sca...@vmware.com Cc: PG Hackers pgsql-hackers@postgresql.org Sent: Friday, February 3, 2012 1:48:54 PM Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP] On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales sca...@vmware.com wrote: Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think what you can say about this double-write patch is that it will work not work well for configurations that have a small Postgres cache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. The general guidance for setting shared_buffers these days is 25% of RAM up to a maximum of 8GB, so the configuration that you're describing as not optimal for this patch is the one normally used when running PostgreSQL. I've run across several cases where larger values of shared_buffers are a huge win, because the entire working set can then be accommodated in shared_buffers. But it's certainly not the case that all working sets fit. And in this case, I think that's beside the point anyway. I had shared_buffers set to 8GB on a machine with much more memory than that, but the database created by pgbench -i -s 10 is about 156 MB, so the problem isn't that there is too little PostgreSQL cache available. The entire database fits in shared_buffers, with most of it left over. However, because of the BufferAccessStrategy stuff, pages start to get forced out to the OS pretty quickly. Of course, we could disable the BufferAccessStrategy stuff when double_writes is in use, but bear in mind that the reason we have it in the first place is to prevent cache trashing effects. It would be imprudent of us to throw that out the window without replacing it with something else that would provide similar protection. And even if we did, that would just delay the day of reckoning. You'd be able to blast through and dirty the entirety of shared_buffers at top speed, but then as soon as you started replacing pages performance would slow to an utter crawl, just as it did here, only you'd need a bigger scale factor to trigger the problem. The more general point here is that there are MANY aspects of PostgreSQL's design that assume that shared_buffers accounts for a relatively small percentage of system memory. Here's another one: we assume that backends that need temporary memory for sorts and hashes (i.e. work_mem) can just allocate it from the OS. If we were to start recommending setting shared_buffers to large percentages of the available memory, we'd probably have to rethink that. Most likely, we'd need some kind of in-core mechanism for allocating temporary memory from the shared memory segment. And here's yet another one: we assume that it is better to
Re: [HACKERS] Report: race conditions in WAL replay routines
On Sun, Feb 5, 2012 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: * Not exactly a race condition, but: tblspc_redo does ereport(ERROR) if it fails to clean out tablespace directories. This seems to me to be the height of folly, especially when the failure is more or less an expected case. If the error occurs the database is dead in the water, because that error is actually a PANIC and will recur on subsequent restart attempts. Therefore there is no way to recover short of manual intervention to clean out the non-empty directory. And why are we pulling the fire alarm like this? Well, uh, it's because we might fail to recover some disk space in the dropped tablespace. Seems to me to be a lot better to just elog(LOG) and move on. This is quite analogous to the case of failing to unlink a file after commit --- wasting disk space might be bad, but it's very much the lesser evil compared to this. If the sysadmin is managing the db properly then this shouldn't ever happen - the only cause is if the tablespace being dropped is being used as a temp tablespace on the standby. Right, but that is an expected/foreseeable situation. It should not lead to a dead-and-unrestartable database. If you just LOG, when exactly would we get rid of the tablespace? The tablespace *is* gone, or at least its catalog entries are. All we are trying to do here is release some underlying disk space. It's exactly analogous to the case where we drop a table and then find (post commit) that unlinking the disk file fails for some weird reason. We've done what we can to clean the disk space and should just let it go --- there is no risk to database integrity in leaving some files behind, so killing the server is a huge overreaction. I agree the tablespace entries are gone, but that won't stop existing users from continuing. If we're not sure of the reason why tablespace removal fails it doesn't seem safe to continue to me. But since this is a rare corner case, and we already try to remove users, then LOG seems OK. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
Simon Riggs si...@2ndquadrant.com writes: Please post the patch rather than fixing directly. There's some subtle stuff there and it would be best to discuss first. Here's a proposed patch for the issues around unlocked updates of shared-memory state. After going through this I believe that there is little risk of any real problems in the current state of the code; this is more in the nature of future-proofing against foreseeable changes. (One such is that we'd discussed fixing the age() function to work during Hot Standby.) So I suggest applying this to HEAD but not back-patching. Except for one thing. I realized while looking at the NEXTOID replay code that it is completely broken: it only advances ShmemVariableCache-nextOid when that's less than the value in the WAL record. So that comparison fails if the OID counter wraps around during replay. I've fixed this in the attached patch by just forcibly assigning the new value instead of trying to be smart, and I think probably that aspect of it needs to be back-patched. regards, tom lane diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 454ca310f339db34a4cc352899fea1d663abf93b..0f4cea124d7436380c730203e6cfd1518bc5d3b2 100644 *** a/src/backend/access/transam/multixact.c --- b/src/backend/access/transam/multixact.c *** CheckPointMultiXact(void) *** 1655,1662 * Set the next-to-be-assigned MultiXactId and offset * * This is used when we can determine the correct next ID/offset exactly ! * from a checkpoint record. We need no locking since it is only called ! * during bootstrap and XLog replay. */ void MultiXactSetNextMXact(MultiXactId nextMulti, --- 1655,1663 * Set the next-to-be-assigned MultiXactId and offset * * This is used when we can determine the correct next ID/offset exactly ! * from a checkpoint record. Although this is only called during bootstrap ! * and XLog replay, we take the lock in case any hot-standby backends are ! * examining the values. */ void MultiXactSetNextMXact(MultiXactId nextMulti, *** MultiXactSetNextMXact(MultiXactId nextMu *** 1664,1671 --- 1665,1674 { debug_elog4(DEBUG2, MultiXact: setting next multi to %u offset %u, nextMulti, nextMultiOffset); + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); MultiXactState-nextMXact = nextMulti; MultiXactState-nextOffset = nextMultiOffset; + LWLockRelease(MultiXactGenLock); } /* *** MultiXactSetNextMXact(MultiXactId nextMu *** 1674,1685 * * This is used when we can determine minimum safe values from an XLog * record (either an on-line checkpoint or an mxact creation log entry). ! * We need no locking since it is only called during XLog replay. */ void MultiXactAdvanceNextMXact(MultiXactId minMulti, MultiXactOffset minMultiOffset) { if (MultiXactIdPrecedes(MultiXactState-nextMXact, minMulti)) { debug_elog3(DEBUG2, MultiXact: setting next multi to %u, minMulti); --- 1677,1690 * * This is used when we can determine minimum safe values from an XLog * record (either an on-line checkpoint or an mxact creation log entry). ! * Although this is only called during XLog replay, we take the lock in case ! * any hot-standby backends are examining the values. */ void MultiXactAdvanceNextMXact(MultiXactId minMulti, MultiXactOffset minMultiOffset) { + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); if (MultiXactIdPrecedes(MultiXactState-nextMXact, minMulti)) { debug_elog3(DEBUG2, MultiXact: setting next multi to %u, minMulti); *** MultiXactAdvanceNextMXact(MultiXactId mi *** 1691,1696 --- 1696,1702 minMultiOffset); MultiXactState-nextOffset = minMultiOffset; } + LWLockRelease(MultiXactGenLock); } /* diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 6e84cd0a21671486693e7f94d5fda8efdedf4bb4..3d08e92d3a747cc9426206dc0623ff390f18b09c 100644 *** a/src/backend/access/transam/twophase.c --- b/src/backend/access/transam/twophase.c *** PrescanPreparedTransactions(TransactionI *** 1715,1720 --- 1715,1724 /* * Examine subtransaction XIDs ... they should all follow main * XID, and they may force us to advance nextXid. + * + * We don't expect anyone else to modify nextXid, hence we don't + * need to hold a lock while examining it. We still acquire the + * lock to modify it, though. */ subxids = (TransactionId *) (buf + MAXALIGN(sizeof(TwoPhaseFileHeader))); *** PrescanPreparedTransactions(TransactionI *** 1726,1733 --- 1730,1739 if (TransactionIdFollowsOrEquals(subxid, ShmemVariableCache-nextXid)) { + LWLockAcquire(XidGenLock, LW_EXCLUSIVE); ShmemVariableCache-nextXid = subxid;
Re: [HACKERS] initdb and fsync
On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote: On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. It doesn't make sense for initdb to take responsibility to sync files created by the backend. If there are important files that the backend creates, it should be the backend's responsibility to fsync them (and their parent directory, if needed). And if they are unimportant to the backend, then there is no reason for initdb to fsync them. I meant primarily to illustrate the need to be comprehensive, not comment on which executable should fsync a particular file. Bootstrap-mode backends do not sync anything during an initdb run on my system. With your patch, we'll fsync a small handful of files and leave nearly everything else vulnerable. That being said, having each backend fsync its own writes will mean syncing certain files several times within a single initdb run. If the penalty from that proves high enough, we may do well to instead have initdb.c sync everything just once. initdb should do these syncs by default and offer an option to disable them. For test frameworks that run initdb often, that makes sense. But for developers, it doesn't make sense to spend 0.5s typing an option that saves you 0.3s. So, we'd need some more convenient way to choose the no-fsync option, like an environment variable that developers can set. Or maybe developers don't care about 0.3s? Developers have shell aliases/functions/scripts and command history. I wouldn't object to having an environment variable control it, but I would not personally find that more convenient than a command-line switch. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
Simon Riggs si...@2ndquadrant.com writes: Please post the patch rather than fixing directly. There's some subtle stuff there and it would be best to discuss first. And here's a proposed patch for not throwing ERROR during replay of DROP TABLESPACE. I had originally thought this would be a one-liner s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really needs to be changed too, so that it doesn't throw error for unremovable directories. regards, tom lane diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index 8ed6d06bb2370a54349f2a86ee13e08a381eacf2..ed0c9ea3ee5a372c6926f41fc0fc93ed7d80d6cd 100644 *** a/src/backend/commands/tablespace.c --- b/src/backend/commands/tablespace.c *** create_tablespace_directories(const char *** 626,634 /* * destroy_tablespace_directories * ! * Attempt to remove filesystem infrastructure * ! * 'redo' indicates we are redoing a drop from XLOG; okay if nothing there * * Returns TRUE if successful, FALSE if some subdirectory is not empty */ --- 626,638 /* * destroy_tablespace_directories * ! * Attempt to remove filesystem infrastructure for the tablespace. * ! * 'redo' indicates we are redoing a drop from XLOG; in that case we should ! * not throw an ERROR for problems, just LOG them. The worst consequence of ! * not removing files here would be failure to release some disk space, which ! * does not justify throwing an error that would require manual intervention ! * to get the database running again. * * Returns TRUE if successful, FALSE if some subdirectory is not empty */ *** destroy_tablespace_directories(Oid table *** 681,686 --- 685,700 pfree(linkloc_with_version_dir); return true; } + else if (redo) + { + /* in redo, just log other types of error */ + ereport(LOG, + (errcode_for_file_access(), + errmsg(could not open directory \%s\: %m, + linkloc_with_version_dir))); + pfree(linkloc_with_version_dir); + return false; + } /* else let ReadDir report the error */ } *** destroy_tablespace_directories(Oid table *** 704,710 /* remove empty directory */ if (rmdir(subfile) 0) ! ereport(ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, subfile))); --- 718,724 /* remove empty directory */ if (rmdir(subfile) 0) ! ereport(redo ? LOG : ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, subfile))); *** destroy_tablespace_directories(Oid table *** 716,738 /* remove version directory */ if (rmdir(linkloc_with_version_dir) 0) ! ereport(ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, linkloc_with_version_dir))); /* * Try to remove the symlink. We must however deal with the possibility * that it's a directory instead of a symlink --- this could happen during * WAL replay (see TablespaceCreateDbspace), and it is also the case on * Windows where junction points lstat() as directories. */ linkloc = pstrdup(linkloc_with_version_dir); get_parent_directory(linkloc); if (lstat(linkloc, st) == 0 S_ISDIR(st.st_mode)) { if (rmdir(linkloc) 0) ! ereport(ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, linkloc))); --- 730,759 /* remove version directory */ if (rmdir(linkloc_with_version_dir) 0) ! { ! ereport(redo ? LOG : ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, linkloc_with_version_dir))); + pfree(linkloc_with_version_dir); + return false; + } /* * Try to remove the symlink. We must however deal with the possibility * that it's a directory instead of a symlink --- this could happen during * WAL replay (see TablespaceCreateDbspace), and it is also the case on * Windows where junction points lstat() as directories. + * + * Note: in the redo case, we'll return true if this final step fails; + * there's no point in retrying it. */ linkloc = pstrdup(linkloc_with_version_dir); get_parent_directory(linkloc); if (lstat(linkloc, st) == 0 S_ISDIR(st.st_mode)) { if (rmdir(linkloc) 0) ! ereport(redo ? LOG : ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, linkloc))); *** destroy_tablespace_directories(Oid table *** 740,746 else { if (unlink(linkloc) 0) ! ereport(ERROR, (errcode_for_file_access(), errmsg(could not remove symbolic link \%s\: %m, linkloc))); --- 761,767 else { if (unlink(linkloc) 0) ! ereport(redo ? LOG : ERROR, (errcode_for_file_access(), errmsg(could not remove
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
2012/2/2 Marko Kreen mark...@gmail.com: I think you want this instead: Â https://commitfest.postgresql.org/action/patch_view?id=769 Somehow I've missed this cool feature. Thanks for the suggestion! -- Shigeru Hanada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] disable prompting by default in createuser
On Wed, Feb 1, 2012 at 1:13 PM, Peter Eisentraut pete...@gmx.net wrote: On sön, 2012-01-15 at 18:14 -0500, Josh Kupershmidt wrote: I see this patch includes a small change to dropuser, to make the 'username' argument mandatory if --interactive is not set, for symmetry with createuser's new behavior. That's dandy, though IMO we shouldn't have -i be shorthand for --interactive with dropuser, and something different with createuser (i.e. we should just get rid of the i alias for dropuser). Well, all the other tools also support -i for prompting. Taking a look at the current ./src/bin/scripts executables, I see only 2 out of 9 (`dropdb` and `dropuser`) which have -i mean --interactive, and `reindexdb` has another meaning for -i entirely. So I'm not sure there's such a clear precedent for having -i mean --interactive within our scripts, at least. I'd rather get rid of -i for --inherit, but I fear that will break things as well. I'm not sure what to do. I think breaking backwards compatibility probably won't fly (and should probably be handled by another patch, anyway). I guess it's OK to keep the patch's current behavior, given we are already inconsistent about what -i means. i.e. createuser tries taking either $PGUSER or the current username as a default user to create, while dropuser just bails out. Personally, I prefer just bailing out if no create/drop user is specified, but either way I think they should be consistent. That is intentional long-standing behavior. createdb/dropdb work the same way. OK. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote: On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote: On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote: Also, as far as I can see this patch usurps the page version field, which I find unacceptably short-sighted. Do you really think this is the last page layout change we'll ever make? No, I don't. I hope and expect the next page layout change to reintroduce such a field. But since we're agreed now that upgrading is important, changing page format isn't likely to be happening until we get an online upgrade process. So future changes are much less likely. If they do happen, we have some flag bits spare that can be used to indicate later versions. It's not the prettiest thing in the world, but it's a small ugliness in return for an important feature. If there was a way without that, I would have chosen it. Have you considered the CRC might match a valuid page version number? Is that safe? In the proposed scheme there are two flag bits set on the page to indicate whether the field is used as a checksum rather than a version number. So its possible the checksum could look like a valid page version number, but we'd still be able to tell the difference. Thanks. Clearly we don't need 16 bits to represent our page version number because we rarely change it. The other good thing is I don't remember anyone wanting additional per-page storage in the past few years except for a checksum. I wonder if we should just dedicate 3 page header bits, call that the page version number, and set this new version number to 1, and assume all previous versions were zero, and have them look in the old page version location if the new version number is zero. I am basically thinking of how we can plan ahead to move the version number to a new location and have a defined way of finding the page version number using old and new schemes. I don't want to get to a point where we need a bit per page number version. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double writes using double-write buffer approach [WIP]
On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales sca...@vmware.com wrote: Thanks for the detailed followup. I do see how Postgres is tuned for having a bunch of memory available that is not in shared_buffers, both for the OS buffer cache and other memory allocations. However, Postgres seems to run fine in many large shared_memory configurations that I gave performance numbers for, including 5G shared_buffers for an 8G machine, 3G shared_buffers for a 6G machine, etc. There just has to be sufficient extra memory beyond the shared_buffers cache. I agree that you could probably set shared_buffers to 3GB on a 6GB machine and get decent performance - but would it be the optimal performance, and for what workload? To really figure out whether this patch is a win, you need to get the system optimally tuned for the unpatched sources (which we can't tell whether you've done, since you haven't posted the configuration settings or any comparative figures for different settings, or any details on which commit you tested against) and then get the system optimally tuned for the patched sources with double_writes=on, and then see whether there's a gain. I think the pgbench run is pointing out a problem that this double_writes implementation has with BULK_WRITEs. As you point out, the BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions. Bulk reads will have the same problem. Consider loading a bunch of data into a new data with COPY, and then scanning the table. The table scan will be a bulk read and every page will be dirtied setting hint bits. Another thing to worry about is vacuum, which also uses a BufferAccessStrategy. Greg Smith has done some previous benchmarking showing that when the kernel is too aggressive about flushing dirty data to disk, vacuum becomes painfully slow. I suspect this patch is going to have that problem in spades (but it would be good to test that). Checkpoints might be a problem, too, since they flush a lot of dirty data, and that's going to require a lot of extra fsyncing with this implementation. It certainly seems that unless you have a pg_xlog and the data separated and a battery-backed write cache for each, checkpoints might be really slow. I'm not entirely convinced they'll be fast even if you have all that (but it would be good to test that, too). I'm not sure if there is a great solution that always works for that issue. However, I do notice that BULK_WRITE data isn't WAL-logged unless archiving/replication is happening. As I understand it, if the BULK_WRITE data isn't being WAL-logged, then it doesn't have to be double-written either. The BULK_WRITE data is not officially synced and committed until it is all written, so there doesn't have to be any torn-page protection for that data, which is why the WAL logging can be omitted. The double-write implementation can be improved by marking each buffer if it doesn't need torn-page protection. These buffers would be those new pages that are explicitly not WAL-logged, even when full_page_writes is enabled. When such a buffer is eventually synced (perhaps because of an eviction), it would not be double-written. This would often avoid double-writes for BULK_WRITE, etc., especially since the administrator is often not archiving or doing replication when doing bulk loads. I agree - this optimization seems like a must. I'm not sure that it's sufficient, but it certainly seems necessary. It's not going to help with VACUUM, though, so I think that case needs some careful looking at to determine how bad the regression is and what can be done to mitigate it. In particular, I note that I suggested an idea that might help in the final paragraph of my last email. My general feeling about this patch is that it needs a lot more work before we should consider committing it. Your tests so far overlook quite a few important problem cases (bulk loads, SELECT on large unhinted tables, vacuum speed, checkpoint duration, and others) and still mostly show it losing to full_page_writes, sometimes by large margins. Even in the one case where you got an 8% speedup, it's not really clear that the same speedup (or an even bigger one) couldn't have been gotten by some other kind of tuning. I think you really need to spend some more time thinking about how to blunt the negative impact on the cases where it hurts, and increase the benefit in the cases where it helps. The approach seems to have potential, but it seems way to immature to think about shipping it at this point. (You may have been thinking along similar lines since I note that the patch is marked WIP.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On 06.02.2012 05:48, Bruce Momjian wrote: On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote: In the proposed scheme there are two flag bits set on the page to indicate whether the field is used as a checksum rather than a version number. So its possible the checksum could look like a valid page version number, but we'd still be able to tell the difference. Thanks. Clearly we don't need 16 bits to represent our page version number because we rarely change it. The other good thing is I don't remember anyone wanting additional per-page storage in the past few years except for a checksum. There's this idea that I'd like to see implemented: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php In any case, I think it's a very bad idea to remove the page version field. How are we supposed to ever be able to change the page format again if we don't even have a version number on the page? I strongly oppose removing it. I'm also not very comfortable with the idea of having flags on the page indicating whether it has a checksum or not. It's not hard to imagine a software of firmware bug or hardware failure that would cause pd_flags field to be zeroed out altogether. It would be more robust if the expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that at upgrade somehow. And it still feels a bit whacky anyway. I wonder if we should just dedicate 3 page header bits, call that the page version number, and set this new version number to 1, and assume all previous versions were zero, and have them look in the old page version location if the new version number is zero. I am basically thinking of how we can plan ahead to move the version number to a new location and have a defined way of finding the page version number using old and new schemes. Three bits seems short-sighted, but yeah, something like 6-8 bits should be enough. On the whole, though. I think we should bite the bullet and invent a way to extend the page header at upgrade. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers