[HACKERS] Review of SQLDA support for ECPG
I took a look at 2-pg85-sqlda-10-ctxdiff.patch. Starting from CVS HEAD of roughly 2009-10-03 05:00 UTC, prerequisite patches 1a-1h applied cleanly. 2-pg85-sqlda hit a trivial whitespace reject in ecpg.trailer along with a more substantive reject at ecpg.addons:407 (FetchStmtMOVEfetch_args). Fixing it up by hand leads to my first question - why did the transition from `opt_ecpg_into' to `opt_ecpg_fetch_into' affect FETCH FORWARD and not FETCH BACKWARD? The main test suite acquired no regressions, but I get failures in two tests of the ecpg test suite (make -C src/interfaces/ecpg/test check). I attach regression.{out,diff} and postmaster.log from the test run. The abort in sqlda.pgc looks like the interesting failure, but I exhausted time with which to dig into it further. preproc/cursor.pgc creates (and ideally drops) the same table `t1' as compat_informix/{sqlda,cursor}.pgc, so a crash in either of the others makes it fail too. Could they all use temp tables, use different table names, or `DROP TABLE IF EXISTS t1' first? Do those logs suggest the cause of the sqlda.pgc failure? If not, I will look into it further. Otherwise, I'm happy to review a future iteration of the patch. As a side note, with patch 1* but not patch 2, test_informix entered an infinite loop with this error: ERROR: syntax error at or near c at character 15 STATEMENT: fetch forward c Thank you, nm test compat_informix/dec_test ... ok test compat_informix/charfuncs ... ok test compat_informix/rfmtdate ... ok test compat_informix/rfmtlong ... ok test compat_informix/rnull ... ok test compat_informix/cursor ... ok test compat_informix/sqlda ... FAILED (test process was terminated by signal 6: Aborted) test compat_informix/test_informix ... ok test compat_informix/test_informix2 ... ok test connect/test2... ok test connect/test3... ok test connect/test4... ok test connect/test5... ok test pgtypeslib/dt_test ... ok test pgtypeslib/dt_test2 ... ok test pgtypeslib/num_test ... ok test pgtypeslib/num_test2 ... ok test preproc/array_of_struct ... ok test preproc/autoprep ... ok test preproc/comment ... ok test preproc/cursor ... FAILED (test process exited with exit code 1) test preproc/define ... ok test preproc/init ... ok test preproc/strings ... ok test preproc/type ... ok test preproc/variable ... ok test preproc/whenever ... ok test sql/array... ok test sql/binary ... ok test sql/code100 ... ok test sql/copystdout ... ok test sql/define ... ok test sql/desc ... ok test sql/dynalloc ... ok test sql/dynalloc2... ok test sql/dyntest ... ok test sql/execute ... ok test sql/fetch... ok test sql/func ... ok test sql/indicators ... ok test sql/oldexec ... ok test sql/quote... ok test sql/show ... ok test sql/insupd ... ok test sql/parser ... ok test thread/thread... ok test thread/thread_implicit ... ok test thread/prep ... ok test thread/alloc ... ok test thread/descriptor... ok *** /home/nm/src/pg/bd-sqlda/src/interfaces/ecpg/test/expected/compat_informix-sqlda.stdout 2009-10-03 04:23:59.0 -0400 --- /home/nm/src/pg/bd-sqlda/src/interfaces/ecpg/test/results/compat_informix-sqlda.stdout 2009-10-04 01:52:52.0 -0400 *** *** 1,60 - FETCH RECORD 1 - name sqlda descriptor: 'id' value 1 - name sqlda descriptor: 't' value 'a' - name sqlda descriptor: 'd1' value DECIMAL '1.0' - name sqlda descriptor: 'd2' value 1.00 - name sqlda descriptor: 'c' value 'a ' - FETCH RECORD 2 - name sqlda descriptor: 'id' value 2 - name sqlda descriptor: 't' value NULL' - name sqlda descriptor: 'd1' value NULL' - name sqlda descriptor: 'd2' value NULL' - name sqlda descriptor: 'c' value NULL' - FETCH RECORD 3 - name sqlda descriptor: 'id' value 3 - name sqlda descriptor: 't' value 'c' - name sqlda descriptor: 'd1' value DECIMAL '-3' - name sqlda descriptor: 'd2' value nan - name sqlda descriptor: 'c' value 'c ' - FETCH RECORD 4 - name sqlda descriptor: 'id' value 4 - name sqlda descriptor: 't' value 'd' - name sqlda descriptor: 'd1' value DECIMAL '4.0' - name sqlda descriptor: 'd2' value 4.00 - name sqlda descriptor: 'c' value 'd ' - FETCH RECORD 1 - name sqlda descriptor: 'id' value 1 - name sqlda descriptor: 't' value 'a' - name sqlda descriptor: 'd1' value DECIMAL '1.0' - name sqlda descriptor: 'd2' value 1.00 - name sqlda descriptor: 'c' value 'a ' - FETCH RECORD 2 - name sqlda descriptor: 'id' value 2 - name sqlda descriptor: 't' value NULL' - name sqlda descriptor: 'd1' value NULL' - name sqlda descriptor: 'd2' value NULL' - name sqlda descriptor: 'c' value NULL' - FETCH RECORD 3 - name sqlda descriptor: 'id' value 3 - name sqlda descriptor: 't' value 'c' - name sqlda descriptor:
Re: [HACKERS] Review of SQLDA support for ECPG
On Mon, Oct 05, 2009 at 02:23:57PM +0200, Boszormenyi Zoltan wrote: Noah Misch ?rta: I will post a new patch for SQLDA and for all others that need updating. Thanks; that one does apply cleanly. The main test suite acquired no regressions, but I get failures in two tests of the ecpg test suite (make -C src/interfaces/ecpg/test check). No, this is not a real error. I have run into this as well, which is quickly solved if you execute make install before running make check under the ecpg directory. I guess you already have an installed PG tree at the same prefix as where you have pointed the new one with the SQLDA patch applied. Another solution may be to use a different prefix for the SQLDA source tree. This was exactly the problem; with an unoccupied prefix, all the tests do pass. As a side note, with patch 1* but not patch 2, test_informix entered an infinite loop with this error: ERROR: syntax error at or near c at character 15 STATEMENT: fetch forward c Do you mean that you applied all the split-up patches posted for the dynamic cursorname extension? I didn't get this error. What did you do exactly? Having started over from a clean base tree, I can no longer reproduce this. Another operator error, no doubt. All tests now pass here with 1a-1h of Dynamic cursor support for ECPG alone, with SQLDA support for ECPG also applied, and with DESCRIBE [OUTPUT] support for ECPG additionally. I will report on the sqlda patch in more detail on 2009-10-10. The one concern that's clear now is a lack of documentation update. Thank you, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] *_collapse_limit, geqo_threshold
On Tue, Jul 07, 2009 at 09:31:14AM -0500, Kevin Grittner wrote: I don't remember any clear resolution to the wild variations in plan time mentioned here: http://archives.postgresql.org/pgsql-hackers/2009-06/msg00743.php I think it would be prudent to try to figure out why small changes in the query caused the large changes in the plan times Andres was seeing. Has anyone else ever seen such behavior? Can we get examples? (It should be enough to get the statistics and the schema, since this is about planning time, not run time.) With joins between statistically indistinguishable columns, I see planning times change by a factor of ~4 for each join added or removed (postgres 8.3). Varying join_collapse_limit in the neighborhood of the actual number of joins has a similar effect. See attachment with annotated timings. The example uses a single table joined to itself, but using distinct tables with identical contents yields the same figures. The expontential factor seems smaller for real queries. I have a query of sixteen joins that takes 71s to plan deterministically; it looks like this: SELECT 1 FROM fact JOIN dim0 ... JOIN dim6 JOIN t t0 ON fact.key = t.key AND t.x = MCV0 LEFT JOIN t t1 ON fact.key = t.key AND t.x = MCV1 JOIN t t2 ON fact.key = t.key AND t.x = MCV2 LEFT JOIN t t3 ON fact.key = t.key AND t.x = NON-MCV0 LEFT JOIN t t4 ON fact.key = t.key AND t.x = NON-MCV1 LEFT JOIN t t5 ON fact.key = t.key AND t.x = NON-MCV2 LEFT JOIN t t6 ON fact.key = t.key AND t.x = NON-MCV3 LEFT JOIN t t7 ON fact.key = t.key AND t.x = NON-MCV4 For the real query, removing one join drops plan time to 26s, and removing two drops the time to 11s. I don't have a good theory for the multiplier changing from 4 for the trivial demonstration to ~2.5 for this real query. Re-enabling geqo drops plan time to .5s. These tests used default_statistics_target = 1000, but dropping that to 100 does not change anything dramatically. I guess the question is whether there is anyone who has had a contrary experience. (There must have been some benchmarks to justify adding geqo at some point?) I have queries with a few more joins (19-21), and I cancelled attempts to plan them deterministically after 600+ seconds and 10+ GiB of memory usage. Even with geqo_effort = 10, they plan within 5-15s with good results. All that being said, I've never encountered a situation where a value other than 1 or inf for *_collapse_limit appeared optimal. nm SET geqo = off; SET join_collapse_limit = 100; CREATE TEMP TABLE t AS SELECT * FROM generate_series(1, 1000) f(n); ANALYZE t; --- Vary join count -- 242.4s EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10 NATURAL JOIN t t11 NATURAL JOIN t t12; -- 31.2s EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10 NATURAL JOIN t t11; -- 8.1s EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10; -- 2.0s EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09; -- 0.5s EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08; --- Vary join_collapse_limit -- 8.1s SET join_collapse_limit = 100; EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10; -- 8.0s SET join_collapse_limit = 11; EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10; -- 2.2s SET join_collapse_limit = 10; EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10; -- 0.5s SET join_collapse_limit = 9; EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07 NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10; -- 0.1s SET join_collapse_limit = 8; EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t
Re: [HACKERS] *_collapse_limit, geqo_threshold
On Wed, Jul 08, 2009 at 05:23:16PM -0400, Tom Lane wrote: Noah Misch n...@leadboat.com writes: The expontential factor seems smaller for real queries. I have a query of sixteen joins that takes 71s to plan deterministically; it looks like this: SELECT 1 FROM fact JOIN dim0 ... JOIN dim6 JOIN t t0 ON fact.key = t.key AND t.x = MCV0 LEFT JOIN t t1 ON fact.key = t.key AND t.x = MCV1 JOIN t t2 ON fact.key = t.key AND t.x = MCV2 LEFT JOIN t t3 ON fact.key = t.key AND t.x = NON-MCV0 LEFT JOIN t t4 ON fact.key = t.key AND t.x = NON-MCV1 LEFT JOIN t t5 ON fact.key = t.key AND t.x = NON-MCV2 LEFT JOIN t t6 ON fact.key = t.key AND t.x = NON-MCV3 LEFT JOIN t t7 ON fact.key = t.key AND t.x = NON-MCV4 I'm confused here --- I think you must have over-anonymized your query. Surely the ON conditions for the left joins should be referencing t3, t4, etc? Yes. Aside from that error, I picked the wrong features to emphasize and gloss over. Only two relations (amidst `dim0 ... dim6') resemble dimensions, having an implied relative position on the plan tree. The other fourteen relations share a join key. That explains the distinctive plan cost; this query resembles the utterly unconstrained artificial query to a larger degree than most. For the real query, removing one join drops plan time to 26s, and removing two drops the time to 11s. I don't have a good theory for the multiplier changing from 4 for the trivial demonstration to ~2.5 for this real query. The rule of thumb that says that an n-way join requires 2^n work is only true if we consider every single combination of possible joins, which normally we don't. The planner prefers join paths that follow join clauses, and will only consider clauseless joins when it has no other choice. I believe that real queries tend to be pretty non-flat in this space and so the number of join paths to consider is a lot less than 2^n. Your synthesized query, on the other hand, allows any relation to be joined to any other --- it might not look that way, but after creating derived equalities there will be a potential join clause linking every relation to every other one. So I think you were testing the worst case, and I'm not surprised that more-typical queries would show a slower growth curve. Describing in those terms illuminates much. While the concepts do suggest 2^N worst-case planning cost, my artificial test case showed a rigid 4^N pattern; what could explain that? Thanks, nm pgpuxZFLTTeZz.pgp Description: PGP signature
Re: [HACKERS] *_collapse_limit, geqo_threshold
On Wed, Jul 08, 2009 at 10:28:02AM -0500, Robert Haas wrote: On Jul 8, 2009, at 8:43 AM, Noah Misch n...@leadboat.com wrote: The expontential factor seems smaller for real queries. I have a query of sixteen joins that takes 71s to plan deterministically; it looks like this: SELECT 1 FROM fact JOIN dim0 ... JOIN dim6 JOIN t t0 ON fact.key = t.key AND t.x = MCV0 LEFT JOIN t t1 ON fact.key = t.key AND t.x = MCV1 JOIN t t2 ON fact.key = t.key AND t.x = MCV2 LEFT JOIN t t3 ON fact.key = t.key AND t.x = NON-MCV0 LEFT JOIN t t4 ON fact.key = t.key AND t.x = NON-MCV1 LEFT JOIN t t5 ON fact.key = t.key AND t.x = NON-MCV2 LEFT JOIN t t6 ON fact.key = t.key AND t.x = NON-MCV3 LEFT JOIN t t7 ON fact.key = t.key AND t.x = NON-MCV4 Very interesting! I am guessing that the problem here is that all the inner joins commute, as do all the left joins, so there are a lot of possibilities to explore. I also suspect that the actual join order doesn't matter much, so it's a good candidate for GEQO. Even if you had some restriction clauses against your dimension/t tables, that would probably just mean that you want to do those joins first, and the rest afterwards, which still seems like it ought to be OK for GEQO. But, in many queries this size, the join order is more constrained. Some of the later joins depend on the tables added by the earlier ones, rather than all referring back to the base table. Is there some way we could estimate the number of join offerings we'll have to consider relatively cheaply? That might be a better metric than # of tables. Observing the pathological case taking plan time proportional to 4^N, apply geqo_threshold as use GEQO when comparing more than geqo_threshold * 4^N join order possibilities? I'm not sure whether that figure is available (early enough) to factor into the decision. Such a change would probably imply a lower default geqo_threshold, around 9 to 11. The number of typical queries subject to GEQO would nonetheless decrease. nm pgphBueR95eY4.pgp Description: PGP signature
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Mon, Dec 19, 2011 at 11:13:57PM -0500, Tom Lane wrote: Noah Misch n...@leadboat.com writes: I created a function that does this in a loop: HeapTuple t; CatalogCacheFlushCatalog(ProcedureRelationId); t = SearchSysCache1(PROCOID, ObjectIdGetDatum(42) /* int4in */); if (!HeapTupleIsValid(t)) elog(ERROR, cache lookup failed for function 42); ReleaseSysCache(t); ... but this performance test seems to me to be entirely misguided, because it's testing a situation that isn't going to occur much in the field, precisely because the syscache should prevent constant reloads of the same syscache entry. [ideas for more-realistic tests] Granted, but I don't hope to reliably measure a change in a macro-benchmark after seeing a rickety 2% change in a micro-benchmark. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RangeVarGetRelid()
On Mon, Dec 19, 2011 at 11:52:54PM -0500, Robert Haas wrote: After staring at this for quite a while longer, it seemed to me that the logic for renaming a relation was similar enough to the logic for changing a schema that the two calbacks could reasonably be combined using a bit of conditional logic; and that, further, the same callback could be used, with a small amount of additional modification, for ALTER TABLE. Here's a patch to do that. Nice. I also notice that cluster() - which doesn't have a callback - has exactly the same needs as ReindexRelation() - which does. So that case can certainly share code; though I'm not quite sure what to call the shared callback, or which file to put it in. RangeVarCallbackForStorageRewrite? I'd put it in tablecmds.c and name it RangeVarCallbackOwnsTable. A few things on the patch: --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -2560,90 +2500,26 @@ CheckTableNotInUse(Relation rel, const char *stmt) * Thanks to the magic of MVCC, an error anywhere along the way rolls back * the whole operation; we don't have to do anything special to clean up. * - * We lock the table as the first action, with an appropriate lock level + * The caller must lock the relation, with an appropriate lock level * for the subcommands requested. Any subcommand that needs to rewrite * tuples in the table forces the whole command to be executed with - * AccessExclusiveLock. If all subcommands do not require rewrite table - * then we may be able to use lower lock levels. We pass the lock level down + * AccessExclusiveLock (actually, that is currently required always, but + * we hope to relax it at some point). We pass the lock level down * so that we can apply it recursively to inherited tables. Note that the - * lock level we want as we recurse may well be higher than required for + * lock level we want as we recurse might well be higher than required for * that specific subcommand. So we pass down the overall lock requirement, * rather than reassess it at lower levels. */ void -AlterTable(AlterTableStmt *stmt) +AlterTable(Oid relid, LOCKMODE lockmode, AlterTableStmt *stmt) { Relationrel; - LOCKMODElockmode = AlterTableGetLockLevel(stmt-cmds); - /* - * Acquire same level of lock as already acquired during parsing. - */ - rel = relation_openrv(stmt-relation, lockmode); + /* Caller is required to provide an adequate lock. */ + rel = relation_open(relid, NoLock); CheckTableNotInUse(rel, ALTER TABLE); - /* Check relation type against type specified in the ALTER command */ - switch (stmt-relkind) - { - case OBJECT_TABLE: - - /* - * For mostly-historical reasons, we allow ALTER TABLE to apply to - * almost all relation types. - */ - if (rel-rd_rel-relkind == RELKIND_COMPOSITE_TYPE - || rel-rd_rel-relkind == RELKIND_FOREIGN_TABLE) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg(\%s\ is not a table, - RelationGetRelationName(rel; RangeVarCallbackForAlterRelation() does not preserve ALTER TABLE's refusal to operate on foreign tables. - break; - - case OBJECT_INDEX: - if (rel-rd_rel-relkind != RELKIND_INDEX) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg(\%s\ is not an index, - RelationGetRelationName(rel; - break; - - case OBJECT_SEQUENCE: - if (rel-rd_rel-relkind != RELKIND_SEQUENCE) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg(\%s\ is not a sequence, - RelationGetRelationName(rel; - break; - - case OBJECT_TYPE: - if (rel-rd_rel-relkind != RELKIND_COMPOSITE_TYPE) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg(\%s\ is not a composite type, - RelationGetRelationName(rel; - break; - - case OBJECT_VIEW: - if (rel-rd_rel-relkind
Re: [HACKERS] RangeVarGetRelid()
On Wed, Dec 21, 2011 at 03:16:39PM -0500, Robert Haas wrote: On Tue, Dec 20, 2011 at 8:14 PM, Noah Misch n...@leadboat.com wrote: RangeVarCallbackForAlterRelation() does not preserve ALTER TABLE's refusal to operate on foreign tables. I should probably fix that, but I'm wondering if I ought to fix it by disallowing the use of ALTER TABLE on FOREIGN TABLEs for the other commands that share the callback as well. Allowing ALTER TABLE to apply to any relation type is mostly a legacy thing, I think, and any code that's new enough to know about foreign tables isn't old enough to know about the time when you had to use ALTER TABLE to rename views. Maybe. Now that we have a release with these semantics, I'd lean toward preserving the wart and being more careful next time. It's certainly a borderline case, though. RangeVarCallbackForAlterRelation() does not preserve the check for unexpected object types. I don't feel a strong need to retain that. Okay. utility.c doesn't take locks for any other command; parse analysis usually does that. ?To preserve that modularity, you could add a bool toplevel argument to transformAlterTableStmt(). ?Pass true here, false in ATPostAlterTypeParse(). ?If true, use AlterTableLookupRelation() to get full security checks. ?Otherwise, just call relation_openrv() as now. ?Would that be an improvement? Not sure. I feel that it's unwise to pass relation names all over the backend and assume that nothing will change meanwhile; no locking we do will prevent that, at least in the case of search path interposition. Ultimately I think this ought to be restructured somehow so that we look up each name ONCE and ever-after refer only to the resulting OID (except for error message text). But I'm not sure how to do that, and thought it might make sense to commit this much independently of such a refactoring. I agree with all that, though my suggestion would not have increased the number of by-name lookups. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Tue, Dec 20, 2011 at 04:37:37PM +0400, Alexander Korotkov wrote: On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley npbo...@gmail.com wrote: FYI, I've added myself as the reviewer for the current commitfest. How is going review now? I will examine this patch within the week. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Thu, Dec 29, 2011 at 11:08:43AM -0600, Kevin Grittner wrote: Heikki Linnakangas wrote: Simon Riggs wrote: OK, then we are talking at cross purposes. Double write buffers, in the way you explain them allow us to remove full page writes. They clearly don't do anything to check page validity on read. Torn pages are not the only fault we wish to correct against... and the double writes idea is orthogonal to the idea of checksums. The reason we're talking about double write buffers in this thread is that double write buffers can be used to solve the problem with hint bits and checksums. Of course, it will be a big plus if we can roll this out in 9.2 in conjunction with a double-write feature. Not only will double-write probably be a bit faster than full_page_writes in the WAL log, but it will allow protection against torn pages on hint-bit-only writes without adding those writes to the WAL or doing any major rearrangement of where they sit that would break pg_upgrade. [Thanks for your recent thread summaries.] A double-write buffer, like a WAL-logged full-page image, is a technique for performing atomic writes wider than those automatically provided by components further down the storage stack. The two strategies have different performance characteristics, and we're told that a double-write buffer would better serve us overall. However, its benefits would not be *greater* for hint-only writes than for any other write. For that reason, I think we should consider these changes independently. With page checksums enabled, remove the hazard of torn hint-only writes by ensuring that a WAL FPI has flushed since the last checkpoint. When necessary, emit an FPI-only record. Separately, optimize first-since-checkpoint writes by replacing FPIs with double-write buffers. The double-write patch will reduce the added WAL of the checksum/safe-hint-updates patch to zero. If the double-writes patch founders, we'll just have more-costly, yet equally reliable, page checksums. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] get_fn_expr_argtype() vs. internal calls
We document that a polymorphic C-language function may identify the concrete data type of each argument using calls to get_fn_expr_argtype(). That relies on FmgrInfo.fn_expr, which only the executor sets. Calls of internal origin, by way of {Direct,,Oid}FunctionCall*(), don't cons up an fn_expr, so get_fn_expr_argtype() just returns InvalidOid every time. (Indeed, we couldn't easily do better in many cases.) To what extent is it safe to rely on this situation remaining as it is? I ask on account of some second thoughts I had about CheckIndexCompatible(). When writing it, I did not explicitly consider operator classes having polymorphic opcintype. If get_fn_expr_argtype() were to work in a function called from the btree search code, CheckIndexCompatible() should impose stricter checks on indexes having opclasses of polymorphic opcintype. If that's not too likely to happen, I might just add a comment instead. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Add protransform for numeric, varbit, and temporal types
Building on commit 8f9fe6edce358f7904e0db119416b4d1080a83aa, this adds protransform functions to the length coercions for numeric, varbit, timestamp, timestamptz, time, timetz and interval. This mostly serves to make more ALTER TABLE ALTER TYPE operations avoid a rewrite, including numeric(10,2) - numeric(12,2), varbit(4) - varbit(8) and timestamptz(2) - timestamptz(4). The rules for varbit are exactly the same as for varchar. Numeric is slightly more complex: * Flatten calls to our length coercion function that solely represent * increases in allowable precision. Scale changes mutate every datum, so * they are unoptimizable. Some values, e.g. 1E-1001, can only fit into an * unconstrained numeric, so a change from an unconstrained numeric to any * constrained numeric is also unoptimizable. time{,stamp}{,tz} are similar to varchar for these purposes, except that, for example, plain timestamptz is equivalent to timestamptz(6). interval has a vastly different typmod format, but the principles applicable to length coercion remain the same. Under --disable-integer-datetimes, I'm not positive that timestamp_scale() is always a no-op when one would logically expect as much. Does there exist a timestamp such that v::timestamp(2) differs from v:timestamp(2)::timestamp(4) due to floating point rounding? Even if so, I'm fairly comfortable calling it a feature rather than a bug to avoid perturbing values that way. After these patches, the only core length coercion casts not having protransform functions are those for bpchar and bit. For those, we could only optimize trivial cases of no length change. I'm not planning to do so. Thanks, nm diff --git a/src/backend/utils/adt/varbit.c b/src/backend/utils/adt/varbit.c index 3fa8117..e5a03b9 100644 *** a/src/backend/utils/adt/varbit.c --- b/src/backend/utils/adt/varbit.c *** *** 18,23 --- 18,25 #include access/htup.h #include libpq/pqformat.h + #include nodes/nodeFuncs.h + #include parser/parse_clause.h #include utils/array.h #include utils/varbit.h *** *** 646,651 varbit_send(PG_FUNCTION_ARGS) --- 648,686 } /* + * varbit_transform() + * Flatten calls to our length coercion function that leave the new maximum + * length = the previous maximum length. We ignore the isExplicit argument, + * which only affects truncation. + */ + Datum + varbit_transform(PG_FUNCTION_ARGS) + { + FuncExpr *expr = (FuncExpr *) PG_GETARG_POINTER(0); + Node *typmod; + Node *ret = NULL; + + if (!IsA(expr, FuncExpr)) + PG_RETURN_POINTER(ret); + + Assert(list_length(expr-args) == 3); + typmod = lsecond(expr-args); + + if (IsA(typmod, Const)) + { + Node *source = linitial(expr-args); + int32 new_typmod = DatumGetInt32(((Const *) typmod)-constvalue); + int32 old_max = exprTypmod(source); + int32 new_max = new_typmod; + + if (new_max = 0 || (old_max = 0 old_max = new_max)) + ret = relabel_to_typmod(source, new_typmod); + } + + PG_RETURN_POINTER(ret); + } + + /* * varbit() * Converts a varbit() type to a specific internal length. * len is the maximum bitlength specified in the column definition. diff --git a/src/include/catalog/pg_pindex c893c3a..2a71f82 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *** *** 2005,2011 DESCR(convert bitstring to int4); DATA(insert OID = 1685 ( bitPGNSP PGUID 12 1 0 0 0 f f f t f i 3 0 1560 1560 23 16 _null_ _null_ _null_ _null_ bit _null_ _null_ _null_ )); DESCR(adjust bit() to typmod length); ! DATA(insert OID = 1687 ( varbit PGNSP PGUID 12 1 0 0 0 f f f t f i 3 0 1562 1562 23 16 _null_ _null_ _null_ _null_ varbit _null_ _null_ _null_ )); DESCR(adjust varbit() to typmod length); DATA(insert OID = 1698 ( position PGNSP PGUID 12 1 0 0 0 f f f t f i 2 0 23 1560 1560 _null_ _null_ _null_ _null_ bitposition _null_ _null_ _null_ )); --- 2005,2013 DATA(insert OID = 1685 ( bitPGNSP PGUID 12 1 0 0 0 f f f t f i 3 0 1560 1560 23 16 _null_ _null_ _null_ _null_ bit _null_ _null_ _null_ )); DESCR(adjust bit() to typmod length); ! DATA(insert OID = 3145 ( varbit_transform PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ varbit_transform _null_ _null_ _null_ )); ! DESCR(transform a varbit length coercion); ! DATA(insert OID = 1687 ( varbit PGNSP PGUID 12 1 0 0 3145 f f f t f i 3 0 1562 1562 23 16 _null_ _null_ _null_ _null_ varbit _null_ _null_ _null_ )); DESCR(adjust varbit() to typmod length); DATA(insert OID = 1698 ( position PGNSP PGUID 12 1 0 0 0 f f f t f i 2 0 23 1560 1560 _null_ _null_ _null_ _null_ bitposition _null_ _null_ _null_ ));
Re: [HACKERS] pg_upgrade and relkind filtering
On Mon, Dec 05, 2011 at 05:06:37PM -0500, Bruce Momjian wrote: Pg_upgrade has the following check to make sure the cluster is safe for upgrading: res = executeQueryOrDie(conn, SELECT n.nspname, c.relname, a.attname FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n, pg_catalog.pg_attribute a WHERE c.oid = a.attrelid AND NOT a.attisdropped AND a.atttypid IN ( 'pg_catalog.regproc'::pg_catalog.regtype, 'pg_catalog.regprocedure'::pg_catalog.regtype, 'pg_catalog.regoper'::pg_catalog.regtype, 'pg_catalog.regoperator'::pg_catalog.regtype, /* regclass.oid is preserved, so 'regclass' is OK */ /* regtype.oid is preserved, so 'regtype' is OK */ 'pg_catalog.regconfig'::pg_catalog.regtype, 'pg_catalog.regdictionary'::pg_catalog.regtype) AND c.relnamespace = n.oid AND n.nspname != 'pg_catalog' AND n.nspname != 'information_schema'); Based on a report from EnterpriseDB, I noticed that we check all pg_class entries, while there are cases where this is unnecessary because there is no data behind the entry, e.g. views. Here are the relkinds supported: #define RELKIND_RELATION'r' /* ordinary table */ #define RELKIND_INDEX 'i' /* secondary index */ #define RELKIND_SEQUENCE'S' /* sequence object */ #define RELKIND_TOASTVALUE 't' /* for out-of-line values */ #define RELKIND_VIEW'v' /* view */ #define RELKIND_COMPOSITE_TYPE 'c' /* composite type */ #define RELKIND_FOREIGN_TABLE 'f' /* foreign table */ #define RELKIND_UNCATALOGED 'u' /* not yet cataloged */ What types, other than views, can we skip in this query? RELKIND_UNCATALOGED should never appear on disk, and RELKIND_SEQUENCE and RELKIND_TOASTVALUE do not allow adding columns or changing column types. We might as well keep validating them. RELKIND_RELATION and RELKIND_INDEX have storage, so we must check those. The remaining three relkinds (RELKIND_VIEW, RELKIND_COMPOSITE_TYPE, RELKIND_FOREIGN_TABLE) have no storage, but all are usable as column types in other relations that do have storage. You could skip them iff they're unused that way, per a check like find_composite_type_dependencies(). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Mon, Jan 02, 2012 at 05:09:16PM +, Simon Riggs wrote: Attached patch makes SnapshotNow into an MVCC snapshot, initialised at the start of each scan iff SnapshotNow is passed as the scan's snapshot. It's fairly brief but seems to do the trick. That's a neat trick. However, if you start a new SnapshotNow scan while one is ongoing, the primordial scan's snapshot will change mid-stream. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Mon, Jan 02, 2012 at 06:41:31PM +, Simon Riggs wrote: On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch n...@leadboat.com wrote: On Mon, Jan 02, 2012 at 05:09:16PM +, Simon Riggs wrote: Attached patch makes SnapshotNow into an MVCC snapshot, initialised at the start of each scan iff SnapshotNow is passed as the scan's snapshot. It's fairly brief but seems to do the trick. That's a neat trick. ?However, if you start a new SnapshotNow scan while one is ongoing, the primordial scan's snapshot will change mid-stream. Do we ever do that? (and if so, Why?!? or perhaps just Where?) I hacked up your patch a bit, as attached, to emit a WARNING for any nested use of SnapshotNow. This made 97/127 test files fail. As one example, RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its SnapshotNow scan. That may need a detoast, which itself runs a scan. We can use more complex code if required, but we'll be adding complexity and code into the main path that I'd like to avoid. Agreed. *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 1201,1206 heap_beginscan_internal(Relation relation, Snapshot snapshot, --- 1201,1208 */ scan-rs_pageatatime = IsMVCCSnapshot(snapshot); + InitSnapshotNowIfNeeded(snapshot); + /* * For a seqscan in a serializable transaction, acquire a predicate lock * on the entire relation. This is required not only to lock all the *** *** 1270,1275 heap_endscan(HeapScanDesc scan) --- 1272,1279 if (BufferIsValid(scan-rs_cbuf)) ReleaseBuffer(scan-rs_cbuf); + FinishSnapshotNow(scan-rs_snapshot); + /* * decrement relation reference count and free scan descriptor storage */ *** *** 1680,1685 heap_get_latest_tid(Relation relation, --- 1684,1691 if (!ItemPointerIsValid(tid)) return; + InitSnapshotNowIfNeeded(snapshot); + /* * Since this can be called with user-supplied TID, don't trust the input * too much. (RelationGetNumberOfBlocks is an expensive check, so we *** *** 1775,1780 heap_get_latest_tid(Relation relation, --- 1781,1788 priorXmax = HeapTupleHeaderGetXmax(tp.t_data); UnlockReleaseBuffer(buffer); } /* end of loop */ + + FinishSnapshotNow(snapshot); } *** a/src/backend/access/index/indexam.c --- b/src/backend/access/index/indexam.c *** *** 292,297 index_beginscan_internal(Relation indexRelation, --- 292,299 */ RelationIncrementReferenceCount(indexRelation); + InitSnapshotNowIfNeeded(snapshot); + /* * Tell the AM to open a scan. */ *** *** 370,375 index_endscan(IndexScanDesc scan) --- 372,379 /* End the AM's scan */ FunctionCall1(procedure, PointerGetDatum(scan)); + FinishSnapshotNow(scan-xs_snapshot); + /* Release index refcount acquired by index_beginscan */ RelationDecrementReferenceCount(scan-indexRelation); *** a/src/backend/utils/time/snapmgr.c --- b/src/backend/utils/time/snapmgr.c *** *** 205,210 GetLatestSnapshot(void) --- 205,235 return SecondarySnapshot; } + static bool SnapshotNowActive; + + /* + * InitSnapshotNowIfNeeded + *If passed snapshot is SnapshotNow then refresh it with latest info. + */ + void + InitSnapshotNowIfNeeded(Snapshot snap) + { + if (!IsSnapshotNow(snap)) + return; + + if (SnapshotNowActive) + elog(WARNING, SnapshotNow used concurrently); + + snap = GetSnapshotData(SnapshotNowData); + SnapshotNowActive = true; + } + + void FinishSnapshotNow(Snapshot snap) + { + if (IsSnapshotNow(snap)) + SnapshotNowActive = false; + } + /* * SnapshotSetCommandId *Propagate CommandCounterIncrement into the static snapshots, if set *** a/src/backend/utils/time/tqual.c --- b/src/backend/utils/time/tqual.c *** *** 67,73 /* Static variables representing various special snapshot semantics */ ! SnapshotData SnapshotNowData = {HeapTupleSatisfiesNow}; SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf}; SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny}; SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast}; --- 67,73 /* Static variables representing various special snapshot semantics */ ! SnapshotData SnapshotNowData = {HeapTupleSatisfiesMVCC}; SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf}; SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny}; SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast}; *** *** 293,298 HeapTupleSatisfiesSelf(HeapTupleHeader tuple
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Mon, Jan 02, 2012 at 04:33:28PM -0300, Alvaro Herrera wrote: Excerpts from Noah Misch's message of lun ene 02 16:25:25 -0300 2012: On Mon, Jan 02, 2012 at 06:41:31PM +, Simon Riggs wrote: On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch n...@leadboat.com wrote: On Mon, Jan 02, 2012 at 05:09:16PM +, Simon Riggs wrote: Attached patch makes SnapshotNow into an MVCC snapshot, initialised at the start of each scan iff SnapshotNow is passed as the scan's snapshot. It's fairly brief but seems to do the trick. That's a neat trick. ?However, if you start a new SnapshotNow scan while one is ongoing, the primordial scan's snapshot will change mid-stream. Do we ever do that? (and if so, Why?!? or perhaps just Where?) I hacked up your patch a bit, as attached, to emit a WARNING for any nested use of SnapshotNow. This made 97/127 test files fail. As one example, RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its SnapshotNow scan. That may need a detoast, which itself runs a scan. Uh, I thought detoasting had its own visibility test function .. I mean, otherwise, what is HeapTupleSatisfiesToast for? The SnapshotNow scan was actually to build the relcache entry for the toast table before scanning the toast table itself. Stack trace: #0 InitSnapshotNowIfNeeded (snap=0xb7e8c0) at snapmgr.c:216 #1 0x0048f773 in index_beginscan_internal (indexRelation=0x7f9a091e5cc8, nkeys=2, norderbys=0, snapshot=0xb7e8c0) at indexam.c:295 #2 0x0048f7e9 in index_beginscan (heapRelation=0x7f9a091a6c60, indexRelation=0xb7e8c0, snapshot=0xb7e8c0, nkeys=152984776, norderbys=0) at indexam.c:238 #3 0x0048e1de in systable_beginscan (heapRelation=0x7f9a091a6c60, indexId=value optimized out, indexOK=value optimized out, snapshot=0xb7e8c0, nkeys=2, key=0x7fffe211b610) at genam.c:289 #4 0x007287db in RelationBuildTupleDesc (targetRelId=value optimized out, insertIt=1 '\001') at relcache.c:455 #5 RelationBuildDesc (targetRelId=value optimized out, insertIt=1 '\001') at relcache.c:880 #6 0x0072a050 in RelationIdGetRelation (relationId=2838) at relcache.c:1567 #7 0x004872c0 in relation_open (relationId=2838, lockmode=1) at heapam.c:910 #8 0x00487328 in heap_open (relationId=12052672, lockmode=152984776) at heapam.c:1052 #9 0x0048a737 in toast_fetch_datum (attr=value optimized out) at tuptoaster.c:1586 #10 0x0048bf74 in heap_tuple_untoast_attr (attr=0xb7e8c0) at tuptoaster.c:135 #11 0x00737d77 in pg_detoast_datum_packed (datum=0xb7e8c0) at fmgr.c:2266 #12 0x006f0fb0 in text_to_cstring (t=0xb7e8c0) at varlena.c:135 #13 0x00727083 in RelationBuildRuleLock (relation=0x7f9a091b3818) at relcache.c:673 #14 0x0072909e in RelationBuildDesc (targetRelId=value optimized out, insertIt=1 '\001') at relcache.c:886 #15 0x0072a050 in RelationIdGetRelation (relationId=9) at relcache.c:1567 #16 0x004872c0 in relation_open (relationId=9, lockmode=0) at heapam.c:910 #17 0x00487483 in relation_openrv_extended (relation=value optimized out, lockmode=value optimized out, missing_ok=value optimized out) at heapam.c:1011 #18 0x0048749c in heap_openrv_extended (relation=0xb7e8c0, lockmode=152984776, missing_ok=5 '\005') at heapam.c:1110 #19 0x0053570c in parserOpenTable (pstate=0xc885f8, relation=0xc88258, lockmode=1) at parse_relation.c:835 #20 0x0053594d in addRangeTableEntry (pstate=0xc885f8, relation=0xc88258, alias=0x0, inh=1 '\001', inFromCl=1 '\001') at parse_relation.c:901 #21 0x00524dbf in transformTableEntry (pstate=0xc885f8, n=0xc88258, top_rte=0x7fffe211bd78, top_rti=0x7fffe211bd84, relnamespace=0x7fffe211bd70, containedRels=0x7fffe211bd68) at parse_clause.c:445 #22 transformFromClauseItem (pstate=0xc885f8, n=0xc88258, top_rte=0x7fffe211bd78, top_rti=0x7fffe211bd84, relnamespace=0x7fffe211bd70, containedRels=0x7fffe211bd68) at parse_clause.c:683 #23 0x00526124 in transformFromClause (pstate=0xc885f8, frmList=value optimized out) at parse_clause.c:129 #24 0x0050c321 in transformSelectStmt (pstate=0xb7e8c0, parseTree=0xc88340) at analyze.c:896 #25 transformStmt (pstate=0xb7e8c0, parseTree=0xc88340) at analyze.c:186 #26 0x0050d583 in parse_analyze (parseTree=0xc88340, sourceText=0xc87828 table pg_stat_user_tables;, paramTypes=0x0, numParams=0) at analyze.c:94 #27 0x006789c5 in pg_analyze_and_rewrite (parsetree=0xc88340, query_string=0xc87828 table pg_stat_user_tables;, paramTypes=0x0, numParams=0) at postgres.c:583 #28 0x00678c77 in exec_simple_query (query_string=0xc87828 table pg_stat_user_tables;) at postgres.c:941 #29 0x00679997 in PostgresMain (argc=value optimized out, argv=value optimized out, username=0xbe03c0 nm) at postgres.c:3881 #30 0x00633d41 in BackendRun () at postmaster.c:3587 #31 BackendStartup
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 03, 2012 at 01:18:41AM +, Simon Riggs wrote: Just for the record, yes we do run multiple catalog scans in some parts of the code. So I can see how we might trigger 4 nested scans, using cache replacement while scanning, so best assume more, with no guarantee of them being neatly stacked for pop/push type access. Yeah, I wouldn't want to commit to a nesting limit. However, I _would_ have expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot() is adequate for a great deal of the backend, after all. In what sort of situation do catalog scans not strictly nest? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to allow users to kill their own queries
On Tue, Dec 20, 2011 at 02:30:08PM +0100, Magnus Hagander wrote: That said - can someone who knows the translation stuff better than me comment on if this is actually going to be translatable, or if it violates too many translation rules? +pg_signal_backend(int pid, int sig, bool allow_same_role, const char *actionstr, const char *hint) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg(must be superuser to %s other server processes, actionstr), + errhint(%s, hint))); + PG_RETURN_BOOL(pg_signal_backend(PG_GETARG_INT32(0), SIGTERM, false, + gettext_noop(terminate), + gettext_noop(You can cancel your own processes with pg_cancel_backend().))); } You need errhint(%s, _(hint)) or errhint(hint) to substitute the translation at runtime; only the printf-pattern string gets an automatic message catalog lookup. Regarding the other message, avoid composing a translated message from independently-translated parts. The translator sees this: #: utils/adt/misc.c:110 #, c-format msgid must be superuser to %s other server processes msgstr #: utils/adt/misc.c:166 msgid terminate msgstr #: utils/adt/misc.c:167 msgid You can cancel your own processes with pg_cancel_backend(). msgstr He'll probably need to read the code to see how the strings go together. If we add an alternative to terminate, not all languages will necessarily have a translation of the outer message that works for both inner fragments. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote: Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.). I get the same diagnostic from this version. Opteron processor, operating system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
Git master can already avoid rewriting the table for column type changes like varchar(10) - varchar(20). However, if the column in question is on either side of a FK relationship, we always revalidate the foreign key. Concretely, I wanted these no-rewrite type changes to also assume FK validity: - Any typmod-only change - text - varchar - domain - base type To achieve that, this patch skips the revalidation when two conditions hold: (a) Old and new pg_constraint.conpfeqop match exactly. This is actually stronger than needed; we could loosen things by way of operator families. However, no core type would benefit, and my head exploded when I tried to define the more-generous test correctly. (b) The functions, if any, implementing a cast from the foreign type to the primary opcintype are the same. For this purpose, we can consider a binary coercion equivalent to an exact type match. When the opcintype is polymorphic, require that the old and new foreign types match exactly. (Since ri_triggers.c does use the executor, the stronger check for polymorphic types is no mere future-proofing. However, no core type exercises its necessity.) These follow from the rules used to decide when to rebuild an index. I further justify them in source comments. To implement this, I have ATPostAlterTypeParse() stash the content of the old pg_constraint.conpfeqop in the Constraint node. ATAddForeignKeyConstraint() notices that and evaluates the above rules. If they both pass, it omits the validation step just as though skip_validation had been given. Thanks, nm diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3b52415..0cde503 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 276,281 static Oid transformFkeyCheckAttrs(Relation pkrel, --- 276,282 int numattrs, int16 *attnums, Oid *opclasses); static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts); + static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid *funcid); static void validateCheckConstraint(Relation rel, HeapTuple constrtup); static void validateForeignKeyConstraint(char *conname, Relation rel, Relation pkrel, *** *** 357,362 static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMOD --- 358,364 static void ATPostAlterTypeParse(Oid oldId, char *cmd, List **wqueue, LOCKMODE lockmode, bool rewrite); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); + static void TryReuseForeignKey(Oid oldId, Constraint *con); static void change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId); static void change_owner_recurse_to_sequences(Oid relationOid, *** *** 5696,5701 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5698,5705 numpks; Oid indexOid; Oid constrOid; + boolold_check_ok; + ListCell *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop); /* * Grab an exclusive lock on the pk table, so that someone doesn't delete *** *** 5812,5817 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5816,5828 (errcode(ERRCODE_INVALID_FOREIGN_KEY), errmsg(number of referencing and referenced columns for foreign key disagree))); + /* +* On the strength of a previous constraint, we might avoid scanning +* tables to validate this one. See below. +*/ + old_check_ok = (fkconstraint-old_conpfeqop != NIL); + Assert(!old_check_ok || numfks == list_length(fkconstraint-old_conpfeqop)); + for (i = 0; i numpks; i++) { Oid pktype = pktypoid[i]; *** *** 5826,5831 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5837,5843 Oid ppeqop; Oid ffeqop; int16 eqstrategy; + Oid pfeqop_right; /* We need several fields out of the pg_opclass entry */ cla_ht = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclasses[i])); *** *** 5868,5877 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) ffeqop =
Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
I neglected to commit after revising the text of a few comments; use this version instead. Thanks. diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3b52415..9eba8e8 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 276,281 static Oid transformFkeyCheckAttrs(Relation pkrel, --- 276,282 int numattrs, int16 *attnums, Oid *opclasses); static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts); + static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid *funcid); static void validateCheckConstraint(Relation rel, HeapTuple constrtup); static void validateForeignKeyConstraint(char *conname, Relation rel, Relation pkrel, *** *** 357,362 static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMOD --- 358,364 static void ATPostAlterTypeParse(Oid oldId, char *cmd, List **wqueue, LOCKMODE lockmode, bool rewrite); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); + static void TryReuseForeignKey(Oid oldId, Constraint *con); static void change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId); static void change_owner_recurse_to_sequences(Oid relationOid, *** *** 5696,5701 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5698,5705 numpks; Oid indexOid; Oid constrOid; + boolold_check_ok; + ListCell *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop); /* * Grab an exclusive lock on the pk table, so that someone doesn't delete *** *** 5812,5817 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5816,5828 (errcode(ERRCODE_INVALID_FOREIGN_KEY), errmsg(number of referencing and referenced columns for foreign key disagree))); + /* +* On the strength of a previous constraint, we might avoid scanning +* tables to validate this one. See below. +*/ + old_check_ok = (fkconstraint-old_conpfeqop != NIL); + Assert(!old_check_ok || numfks == list_length(fkconstraint-old_conpfeqop)); + for (i = 0; i numpks; i++) { Oid pktype = pktypoid[i]; *** *** 5826,5831 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5837,5843 Oid ppeqop; Oid ffeqop; int16 eqstrategy; + Oid pfeqop_right; /* We need several fields out of the pg_opclass entry */ cla_ht = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclasses[i])); *** *** 5868,5877 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) ffeqop = get_opfamily_member(opfamily, fktyped, fktyped, eqstrategy); else ! ffeqop = InvalidOid;/* keep compiler quiet */ if (!(OidIsValid(pfeqop) OidIsValid(ffeqop))) { --- 5880,5896 pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) + { + pfeqop_right = fktyped; ffeqop = get_opfamily_member(opfamily, fktyped, fktyped, eqstrategy); + } else ! { ! /* keep compiler quiet */ ! pfeqop_right = InvalidOid; ! ffeqop = InvalidOid; ! } if (!(OidIsValid(pfeqop) OidIsValid(ffeqop))) { *** *** 5893,5899 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5912,5921 target_typeids[1] = opcintype; if (can_coerce_type(2, input_typeids, target_typeids, COERCION_IMPLICIT)) + {
Re: [HACKERS] Collect frequency statistics for arrays
Corrections: On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote: On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote: + *We set s to be the estimated frequency of the K'th element in a natural + *language's frequency table, where K is the target number of entries in + *the MCELEM array. We assume that the distribution of element frequencies + *follows Zipf's law with an exponent of 1. + * + *Assuming Zipfian distribution, the frequency of the K'th element is equal + *to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of + *elements in the language. Putting W as one million, we get roughly + *0.07/K. This gives s = 0.07/K. We set epsilon = s/10, which gives bucket + *width w = K/0.007 and maximum expected hashtable size of about 1000 * K. These last two paragraphs, adapted from ts_typanalyze.c, assume natural language documents. To what extent do these parameter choices remain sensible for arbitrary data such as users may place in arrays? In any event, we need a different justification, even if it's just a hand-wavy justification. If I'm following this correctly, this choice of s makes the algorithm guaranteed to find only elements constituting = 7% of the input elements. Incidentally, isn't that far too high for natural language documents? If the English word the only constitutes 7% of typical documents, then this s value would permit us to discard practically every word; we'd be left with words read while filling the last bucket and words that happened to repeat exceedingly often in the column. I haven't tried to make a test case to observe this problem; am I missing something? (This question is largely orthogonal to your patch.) No, we'll find elements of frequency at least 0.07/(default_statistics_target * 10) -- in the default configuration, 0.007%. Also, ts_typanalyze() counts the number of documents that contain one or more instances of each lexeme, ignoring the number of appearances within each document. The word the may constitute 7% of a typical document, but it will appear at least once in nearly 100% of documents. Therefore, this s value is adequate even for the pathological case of each document containing just one lexeme. + * + *Note: in the above discussion, s, epsilon, and f/N are in terms of a + *element's frequency as a fraction of all elements seen in the input. + *However, what we actually want to store in the finished pg_statistic + *entry is each element's frequency as a fraction of all rows that it occurs + *in. Elements might be repeated in the same array. Since operators + *@, , @ takes care only about element occurence itself and not about + *occurence count, function takes additional care about uniqueness of + *counting. Also we need to change the divisor from N to nonnull_cnt to get + *the number we want. On the same tangent, why does ts_typanalyze() not deduplicate the same way? The @@ operator has the same property. Answer: to_tsvector() will have already done so. + /* +* We set bucket width equal to (num_mcelem + 10) / 0.007 as per the +* comment above. +*/ + bucket_width = num_mcelem * 1000 / 7; The addend mentioned is not present in the code or discussed in the comment above. (I see the comment is copied verbatim from ts_typanalyze(), where the addend *is* present, though again the preceding comment says nothing of it.) The addend rationale in fact does appear in the ts_typanalyze() comment. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
In 367bc426a1c22b9f6badb06cd41fc438fd034639, I introduced a CheckIndexCompatible() that approves btree and hash indexes having changed to a different operator class within the same operator family. To make that valid, I also tightened the operator family contracts for those access methods to address casts. However, I've found two retained formal hazards. They're boring and perhaps unlikely to harm real users, but I'd rather nail them down just in case. First, opclasses like array_ops have polymorphic opcintype. Members of such operator classes could, in general, change behavior arbitrarily in response to get_fn_expr_argtype(). No core polymorphic operator family member does this. Nor could they, for no core index access method sets fn_expr. In the absence of responses to my previous inquiry[1] on the topic, I'm taking the position that the lack of fn_expr in these calls is an isolated implementation detail that could change at any time. Therefore, we should only preserve an index of polymorphic operator class when the old and new opcintype match exactly. This patch's test suite addition illustrates one ALTER TABLE ALTER TYPE affected by this new restriction: a conversion from an array type to a domain over that array type will now require an index rebuild. Second, as a thought experiment, consider a database with these three types: 1. int4, the core type. 2. int4neg, stores to disk as the negation of its logical value. Includes a complete set of operators in the integer_ops btree family, but defines no casts to other integer_ops-represented types. 3. int4other, stores to disk like int4. No operators. Has a implicit binary coercion cast to int4 and an explicit binary coercion cast to int4neg. Suppose a table in this database has a column of type int4other with an index of default operator class. By virtue of the implicit binary coercion and lack of other candidates, the index will use int4_ops. Now, change the type of that column from int4other to int4neg. The operator class changes to int4neg_ops, still within the integer_ops family, and we do not rebuild the index. However, the logical sign of each value just flipped, making the index invalid. Where did CheckIndexCompatible() miss? An operator family only promises cast-compatibility when there exists an implicit or binary coercion cast between the types in question. There's no int4-int4neg cast here, not even a multiple-step pathway. CheckIndexCompatible() assumes that our ability to perform a no-rewrite ALTER TABLE ALTER TYPE implies the existence of such a cast. It does imply that for the before-and-after _table column types_, but it's the before-and-after _opcintype_ that matter here. I think we could close this hazard by having CheckIndexCompatible() test for an actual implicit or binary coercion cast between the opcintype of the old and new operator classes. However, I'm not confident to say that no similar problem would remain. Therefore, I propose ceasing to optimize intra-family index operator transitions and simply requiring exact matches of operator classes and exclusion operators. This does not remove any actual optimization for changes among core types, and it will be simpler to validate. (I designed the current rules under the misapprehension that varchar_ops was the default operator class for varchar columns. However, varchar_ops is a copy of text_ops apart from the name and being non-default. We ship no operator with a varchar operand, relying instead on binary coercion to text.) This patch does not, however, remove the new terms from the operator family contracts. While core PostgreSQL will no longer depend on them, they may again prove handy for future optimizations like this. I remain of the opinion that they're already widely (perhaps even universally) obeyed. This patch conflicts trivially with my patch Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE in that they both add test cases to the same location in alter_table.sql. They're otherwise independent, albeit reflecting parallel principles. Thanks, nm [1] http://archives.postgresql.org/message-id/20111229211711.ga8...@tornado.leadboat.com diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 712b0b0..1bf1de5 100644 *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** *** 23,28 --- 23,29 #include catalog/pg_opclass.h #include catalog/pg_opfamily.h #include catalog/pg_tablespace.h + #include catalog/pg_type.h #include commands/dbcommands.h #include commands/defrem.h #include commands/tablecmds.h *** *** 52,57 --- 53,59 /* non-export function prototypes */ static void CheckPredicate(Expr *predicate); static void ComputeIndexAttrs(IndexInfo *indexInfo, + Oid *typeOidP, Oid *collationOidP, Oid *classOidP,
Re: [HACKERS] Disabled features on Hot Standby
On Fri, Jan 13, 2012 at 12:08:05PM -0500, Robert Haas wrote: On Fri, Jan 13, 2012 at 11:13 AM, Simon Riggs si...@2ndquadrant.com wrote: Many WAL records have latestRemovedXid on them. We can use the same idea with XLOG_HEAP2_VISIBLE records, so we add a field to send the latest vacrelstats-latestRemovedXid. That then creates a recovery snapshot conflict that would cancel any query that might then see a page of the vis map that was written when the xmin was later than on the standby. If replication disconnects briefly and a vimap bit is updated that would cause a problem, just as the same situation would cause a problem because of other record types. That could create a lot of recovery conflicts when hot_standby_feedback=off, I think, but it might work when hot_standby_feedback=on. I don't fully understand the latestRemovedXid machinery, but I guess the idea would be to kill any standby transaction whose proc-xmin precedes the oldest committed xmin or xmax on the page. If hot_standby_feedback=on then there shouldn't be any, except in the case where it's just been enabled or the SR connection is bouncing. FWIW, Tom aired the same idea here: http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us While reviewing the crash-safe visibility map patch, I echoed the idea and explained why the extra conflicts would be immaterial: http://archives.postgresql.org/message-id/20110618133703.ga1...@tornado.leadboat.com Also, what happens if an all-visible bit gets set on the standby through some other mechanism - e.g. restored from an FPI or XLOG_HEAP_NEWPAGE? I'm not sure whether we ever do an FPI of the visibility map page itself, but we certainly do it for the heap pages. So it might be that this infrastructure would (somewhat bizarrely) trust the visibility map bits but not the PD_ALL_VISIBLE bits. Simon spoke to the FPI side of the question. For heap pages, the XLOG_HEAP_NEWPAGE consumers are CLUSTER, VACUUM FULL and ALTER TABLE SET TABLESPACE. For the last, we will have already logged any PD_ALL_VISIBLE bits through normal channels. CLUSTER and VACUUM FULL never set PD_ALL_VISIBLE or visibility map bits. When, someday, they do, we might emit a separate WAL record to force the recovery conflict. However, CLUSTER/VACUUM FULL already remove tuples still-visible to standby snapshots without provoking a recovery conflict. (Again only with hot_standby_feedback=off.) Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Disabled features on Hot Standby
On Sat, Jan 14, 2012 at 12:42:02AM -0500, Robert Haas wrote: On Fri, Jan 13, 2012 at 8:02 PM, Noah Misch n...@leadboat.com wrote: Simon spoke to the FPI side of the question. ?For heap pages, the XLOG_HEAP_NEWPAGE consumers are CLUSTER, VACUUM FULL and ALTER TABLE SET TABLESPACE. ?For the last, we will have already logged any PD_ALL_VISIBLE bits through normal channels. ?CLUSTER and VACUUM FULL never set PD_ALL_VISIBLE or visibility map bits. ?When, someday, they do, we might emit a separate WAL record to force the recovery conflict. ?However, CLUSTER/VACUUM FULL already remove tuples still-visible to standby snapshots without provoking a recovery conflict. ?(Again only with hot_standby_feedback=off.) Is the big about CLUSTER/VACUUM FULL a preexisting bug? If not, why not? I suspect it goes back to 9.0, yes. I'm on the fence regarding whether to call it a bug or an unimplemented feature. In any case, +1 for improving it. Other than that, it seems like we might be converging on a workable solution: if hot_standby_feedback=off, disable index-only scans for snapshots taken during recovery; if hot_standby_feedback=on, generate recovery conflicts when a snapshot's xmin precedes the youngest xmin on a page marked all-visible, but allow the use of index-only scans, and allow sequential scans to trust PD_ALL_VISIBLE. Off the top of my head, I don't see a hole in that logic... I wouldn't check hot_standby_feedback. Rather, mirror what we do for XLOG_HEAP2_CLEAN. Unconditionally add an xid to xl_heap_visible bearing the youngest xmin on the page (alternately, some convenient upper bound thereof). Have heap_xlog_visible() call ResolveRecoveryConflictWithSnapshot() on that xid. Now, unconditionally trust PD_ALL_VISIBLE and permit index-only scans. The user's settings of hot_standby_feedback, vacuum_defer_cleanup_age, max_standby_streaming_delay and max_standby_archive_delay will drive the consequential trade-off: nothing, query cancellation, or recovery delay. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Disabled features on Hot Standby
On Sat, Jan 14, 2012 at 08:08:29AM +, Simon Riggs wrote: On Sat, Jan 14, 2012 at 1:02 AM, Noah Misch n...@leadboat.com wrote: However, CLUSTER/VACUUM FULL already remove tuples still-visible to standby snapshots without provoking a recovery conflict. ?(Again only with hot_standby_feedback=off.) If that were the case it would be a bug. CLUSTER/VACUUM FULL emit an AccessExclusiveLock record that would conflict with any current lock holders, so should be fine on that. I speak of this sequence (M = master connection, S = standby connection): M: CREATE TABLE t AS SELECT * FROM generate_series(1,1000) t(n); S: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 0; M: DELETE FROM t WHERE n = 10; M: VACUUM FULL t; S: SELECT count(*) FROM t; -- 990, should be 1000 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Sat, Jan 14, 2012 at 04:41:57AM -0500, Jaime Casanova wrote: pgstattuple and relation_free_space are very close in all the numbers except for 2 indexes pgbench_branches_pkey and pgbench_tellers_pkey; after a VACUUM FULL and a REINDEX (and the difference persistence) i checked pgbench_tellers_pkey contents (it has only one page besides the metapage) and the numbers that i look at where: page size: 8192 free size: 4148 which in good romance means 50% of free space... so, answering Noah's question: if that difference has some meaning i can't see it but looking at the evidence the measure relation_free_space() is giving is the good one so, tomorrow (or ...looking at the clock... later today) i will update the relation_free_space() patch to accept toast tables and other kind of indexes and add it to the commitfest unless someone says that my math is wrong and somehow there is a more accurate way of getting the free space (which is entirely possible) Did you see this followup[1]? To summarize: - pgstattuple() and relation_free_space() should emit the same number, even if that means improving pgstattuple() at the same time. - relation_free_space() belongs in the pgstattuple extension, because its role is cheaper access to a single pgstattuple() metric. Thanks, nm [1] http://archives.postgresql.org/message-id/20111218165625.gb6...@tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] psql filename completion: quoting
Occasionally, I have a SQL file destined for psql's \i command whose name contains a space. Less often, I'll have a .csv destined for \copy with the same problem. psql's filename completion does not handle these well. It completes on the literal name, but the commands will only recognize quoted names. For example, given a file foo bar, \i fTAB will complete to \i foo bar, which will not execute. If I type \i 'fTAB, completion will not help at all. The attached patch wraps rl_filename_completion_function() to dequote on input and enquote on output. Now, \i fTAB and \i 'fTAB will both complete to \i 'foo bar', which executes as expected. The new code handles embedded whitespace, quotes, and backslashes. tab-complete.c works in terms of whitespace-separated words. As such, \i 'foo bTAB does not complete, because tab-complete.c has no notion of quotes affecting token boundaries. It thinks 'foo is one token and b is another. I'm sure we could fix this (Bash gets it right). It seemed rather independent code-wise, so I did not attempt that for this patch. Thanks, nm diff --git a/src/bin/psql/stringutils.c b/src/bin/psql/stringutils.c index 3b5ce1b..77387dc 100644 *** a/src/bin/psql/stringutils.c --- b/src/bin/psql/stringutils.c *** *** 272,274 strip_quotes(char *source, char quote, char escape, int encoding) --- 272,343 *dst = '\0'; } + + + /* + * quote_if_needed + * + * Opposite of strip_quotes(). If source denotes itself literally without + * quoting or escaping, returns NULL. Otherwise, returns a malloc'd copy with + * quoting and escaping applied: + * + * source - string to parse + * entails_quote -any of these present? need outer quotes + * quote -doubled within string, affixed to both ends + * escape - doubled within string + * encoding - the active character-set encoding + * + * Do not use this as a substitute for PQescapeStringConn(). Use it for + * strings to be parsed by strtokx() or psql_scan_slash_option(). + */ + char * + quote_if_needed(const char *source, const char *entails_quote, + char quote, char escape, int encoding) + { + const char *src; + char *ret; + char *dst; + boolneed_quotes = false; + + psql_assert(source); + psql_assert(quote); + + src = source; + dst = ret = pg_malloc(2 * strlen(src) + 3); /* excess */ + + *dst++ = quote; + + while (*src) + { + charc = *src; + int i; + + if (c == quote) + { + need_quotes = true; + *dst++ = quote; + } + else if (c == escape) + { + need_quotes = true; + *dst++ = escape; + } + else if (strchr(entails_quote, c)) + need_quotes = true; + + i = PQmblen(src, encoding); + while (i--) + *dst++ = *src++; + } + + *dst++ = quote; + *dst = '\0'; + + if (!need_quotes) + { + free(ret); + ret = NULL; + } + + return ret; + } diff --git a/src/bin/psql/stringuindex c7c5f38..c64fc58 100644 *** a/src/bin/psql/stringutils.h --- b/src/bin/psql/stringutils.h *** *** 19,22 extern char *strtokx(const char *s, --- 19,25 bool del_quotes, int encoding); + extern char *quote_if_needed(const char *source, const char *entails_quote, + char quote, char escape, int encoding); + #endif /* STRINGUTILS_H */ diff --git a/src/bin/psql/tab-comindex a27ef69..d226106 100644 *** a/src/bin/psql/tab-complete.c --- b/src/bin/psql/tab-complete.c *** *** 670,675 static char *complete_from_list(const char *text, int state); --- 670,676 static char *complete_from_const(const char *text, int state); static char **complete_from_variables(char *text, const char *prefix, const char *suffix); + static char *complete_from_files(const char *text, int state); static PGresult *exec_query(const char *query); *** *** 1619,1625 psql_completion(char *text, int start, int end) pg_strcasecmp(prev3_wd, BINARY) == 0) (pg_strcasecmp(prev_wd, FROM) == 0 || pg_strcasecmp(prev_wd, TO) == 0)) ! matches = completion_matches(text, filename_completion_function); /* Handle COPY|BINARY sth FROM|TO filename */ else if ((pg_strcasecmp(prev4_wd, COPY) == 0 || --- 1620,1629 pg_strcasecmp(prev3_wd, BINARY) == 0)
[HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings
It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting. Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all backslash commands, does not route through SendQuery(). Looking into this turned up several other weaknesses in psql's handling of COPY. For example, SendQuery() does not release the savepoint after an ordinary COPY: [local] test=# set log_statement = 'all'; set client_min_messages = 'log'; copy (select 0) to stdout; SET SET LOG: statement: RELEASE pg_psql_temporary_savepoint LOG: statement: SAVEPOINT pg_psql_temporary_savepoint LOG: statement: copy (select 0) to stdout; 0 When psql sends a command string containing more than one command, at least one of which is a COPY, we stop processing results at the first COPY: [local] test=# set log_statement = 'all'; set client_min_messages = 'log'; copy (select 0) to stdout\; select 1/0; select 1; SET SET LOG: statement: RELEASE pg_psql_temporary_savepoint LOG: statement: SAVEPOINT pg_psql_temporary_savepoint LOG: statement: copy (select 0) to stdout; select 1/0; 0 LOG: statement: select 1; ERROR: current transaction is aborted, commands ignored until end of transaction block To make the above work normally, this patch improves SendQuery()-based COPY command handling to process the entire queue of results whenever we've processed a COPY. It also brings sensible handling in the face of multiple COPY commands in a single command string. See the included test cases for some examples. We must prepare for COPY to fail for a local reason, like client-side malloc() failure or network I/O problems. The server will still have the connection in a COPY mode, and we must get it out of that mode. The \copy command was prepared for the COPY FROM case with this code block: /* * Make sure we have pumped libpq dry of results; else it may still be in * ASYNC_BUSY state, leading to false readings in, eg, get_prompt(). */ while ((result = PQgetResult(pset.db)) != NULL) { success = false; psql_error(\\copy: unexpected response (%d)\n, PQresultStatus(result)); /* if still in COPY IN state, try to get out of it */ if (PQresultStatus(result) == PGRES_COPY_IN) PQputCopyEnd(pset.db, _(trying to exit copy mode)); PQclear(result); } It arose from these threads: http://archives.postgresql.org/pgsql-hackers/2006-11/msg00694.php http://archives.postgresql.org/pgsql-general/2009-08/msg00268.php However, psql still enters an infinite loop when COPY TO STDOUT encounters a client-side failure, such as malloc() failure. I've merged the above treatment into lower-level routines, granting plain COPY commands similar benefits, and fixed the COPY TO handling. To help you test the corner cases involved here, I've attached a gdb script that will inject client side failures into both kinds of COPY commands. Attach gdb to your psql process, source the script, and compare the behavior of commands like these with and without the patch: \copy (select 0) to pstdout create table t (c int); \copy t from stdin 1 \. With plain COPY handled thoroughly, I fixed \copy by having it override the source or destination stream, then call SendQuery(). This gets us support for ON_ERROR_ROLLBACK, \timing, \set ECHO and \set SINGLESTEP for free. I found it reasonable to treat \copy's SQL commmand more like a user-entered command than an internal command, because \copy is such a thin wrapper around COPY FROM STDIN/COPY TO STDOUT. This patch makes superfluous the second argument of PSQLexec(), but I have not removed that argument. Incidentally, psql's common.c had several switches on PQresultStatus(res) that did not include a case for PGRES_COPY_BOTH and also silently assumed any unlisted status was some kind of error. I tightened these to distinguish all known statuses and give a diagnostic upon receiving an unknown status. Thanks, nm diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 69fac83..c7d024a 100644 *** a/src/bin/psql/command.c --- b/src/bin/psql/command.c *** *** 320,344 exec_command(const char *cmd, /* \copy */ else if (pg_strcasecmp(cmd, copy) == 0) { - /* Default fetch-it-all-and-print mode */ - instr_time before, - after; - char *opt = psql_scan_slash_option(scan_state, OT_WHOLE_LINE, NULL, false); - if (pset.timing) - INSTR_TIME_SET_CURRENT(before); - success = do_copy(opt); - - if (pset.timing success) - { - INSTR_TIME_SET_CURRENT(after); -
Re: [HACKERS] Measuring relation free space
On Sat, Jan 14, 2012 at 02:40:46PM -0500, Jaime Casanova wrote: On Sat, Jan 14, 2012 at 6:26 AM, Noah Misch n...@leadboat.com wrote: - pgstattuple() and relation_free_space() should emit the same number, even if ?that means improving pgstattuple() at the same time. yes, i just wanted to understand which one was more accurate and why... and give the opportunity for anyone to point my error if any pgstattuple()'s decision to treat half-dead pages like deleted pages is better. That transient state can only end in the page's deletion. I don't know about counting non-leaf pages, but I personally wouldn't revisit pgstattuple()'s decision there. In the indexes I've briefly surveyed, the ratio of leaf pages to non-leaf pages is 100:1 or better. No amount of bloat in that 1% will matter. Feel free to make the argument if you think otherwise, though; I've only taken a brief look at the topic. - relation_free_space() belongs in the pgstattuple extension, because its role ?is cheaper access to a single pgstattuple() metric. oh! right! so, what about just fixing the free_percent that pgstattuple is providing If pgstattuple() meets your needs, you might indeed no longer need any patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Sun, Jan 15, 2012 at 01:49:54AM -0300, Alvaro Herrera wrote: - I'm not sure that the multixact truncation code is sane on checkpoints. It might be that I need to tweak more the pg_control info we keep about truncation. The whole truncation thing needs more testing, too. My largest outstanding concern involves the possibility of MultiXactId wraparound. From my last review: This raises a notable formal hazard: it's possible to burn through the MultiXactId space faster than the regular TransactionId space. We could get into a situation where pg_clog is covering 2B xids, and yet we need 4B MultiXactId to cover that period. We had better at least notice this and halt, if not have autovacuum actively prevent it. (That should have been 2B rather than 4B, since MultiXactId uses the same 2B-in-past, 2B-in-future behavior as regular xids.) Are we willing to guess that this will never happen and make recovery minimally possible? If so, we could have GetNewMultiXactId() grow defenses similar to GetNewTransactionId() and leave it at that. If not, we need to involve autovacuum. The other remaining high-level thing is to have key_attrs contain only columns actually referenced by FKs. - Go over Noah's two reviews again and see if I missed anything; also make sure I haven't missed anything from other reviewers. There are some, yes. *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 2773,2783 l2: } else if (result == HeapTupleBeingUpdated wait) { ! TransactionId xwait; uint16 infomask; /* must copy state data before unlocking buffer */ ! xwait = HeapTupleHeaderGetXmax(oldtup.t_data); infomask = oldtup.t_data-t_infomask; LockBuffer(buffer, BUFFER_LOCK_UNLOCK); --- 2848,2871 } else if (result == HeapTupleBeingUpdated wait) { ! TransactionId xwait; uint16 infomask; + boolnone_remain = false; Nothing can ever set this variable to anything different. It seems that keep_xact == InvalidTransactionId substitutes well enough, though. /* * We may overwrite if previous xmax aborted, or if it committed but ! * only locked the tuple without updating it, or if we are going to ! * keep it around in Xmax. */ The second possibility is just a subset of the third. ! if (TransactionIdIsValid(keep_xmax) || ! none_remain || ! (oldtup.t_data-t_infomask HEAP_XMAX_INVALID)) result = HeapTupleMayBeUpdated; else result = HeapTupleUpdated; *** *** 3314,3323 heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum, */ static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs, !HeapTuple oldtup, HeapTuple newtup) { int attrnum; while ((attrnum = bms_first_member(hot_attrs)) = 0) { /* Adjust for system attributes */ --- 3537,3549 */ static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs, !HeapTuple oldtup, HeapTuple newtup, bool empty_okay) { int attrnum; + if (!empty_okay bms_is_empty(hot_attrs)) + return false; When a table contains no key attributes, it seems arbitrary whether we call the key revoked or not. What is the motivation for this change? ! /* ! * If we're requesting KeyShare, and there's no update present, we ! * don't need to wait. Even if there is an update, we can still ! * continue if the key hasn't been modified. ! * ! * However, if there are updates, we need to walk the update chain ! * to mark future versions of the row as locked, too. That way, if ! * somebody deletes that future version, we're protected against the ! * key going away. This locking of future versions could block ! * momentarily, if a concurrent transaction is deleting a key; or it ! * could return a value to the effect that the transaction deleting the ! * key has already committed. So we do this before re-locking the ! * buffer; otherwise this would be prone to deadlocks. Note that the TID ! * we're locking was grabbed before we unlocked the buffer. For it to ! * change while we're not looking, the other properties we're testing ! * for below after re-locking the buffer would also change, in which !
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Jan 16, 2012 at 04:52:36PM -0300, Alvaro Herrera wrote: Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012: On 15.01.2012 06:49, Alvaro Herrera wrote: --- 164,178 #define HEAP_HASVARWIDTH0x0002/* has variable-width attribute(s) */ #define HEAP_HASEXTERNAL0x0004/* has external stored attribute(s) */ #define HEAP_HASOID0x0008/* has an object-id field */ ! #define HEAP_XMAX_KEYSHR_LOCK0x0010/* xmax is a key-shared locker */ #define HEAP_COMBOCID0x0020/* t_cid is a combo cid */ #define HEAP_XMAX_EXCL_LOCK0x0040/* xmax is exclusive locker */ ! #define HEAP_XMAX_IS_NOT_UPDATE0x0080/* xmax, if valid, is only a locker. ! * Note this is not set unless ! * XMAX_IS_MULTI is also set. */ ! ! #define HEAP_LOCK_BITS(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \ ! HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMIN_COMMITTED0x0100/* t_xmin committed */ #define HEAP_XMIN_INVALID0x0200/* t_xmin invalid/aborted */ #define HEAP_XMAX_COMMITTED0x0400/* t_xmax committed */ HEAP_XMAX_IS_NOT_UPDATE is a pretty opaque name for that. From the name, I'd say that a DELETE would set that, but the comment says it has to do with locking. After going through all the combinations in my mind, I think I finally understood it: HEAP_XMAX_IS_NOT_UPDATE is set if xmax is a multi-xact, that represent only locking xids, no updates. How about calling it HEAP_XMAX_LOCK_ONLY, and setting it whenever whether or not xmax is a multi-xid? Hm, sounds like a good idea. Will do. +1 Why are you renaming HeapTupleHeaderGetXmax() into HeapTupleHeaderGetRawXmax()? Any current callers of HeapTupleHeaderGetXmax() should already check that HEAP_XMAX_IS_MULTI is not set. I had this vague impression that it'd be better to break existing callers so that they ensure they decide between HeapTupleHeaderGetRawXmax and HeapTupleHeaderGetUpdateXid. Noah suggested changing the macro name, too. It's up to each caller to decide what's the sematics they want. Most want the latter; and callers outside core are more likely to want that one. If we kept the old name, they would get the wrong value. My previous suggestion was to have both macros: #define HeapTupleHeaderGetXmax(tup) \ ( \ AssertMacro(!((tup)-t_infomask HEAP_XMAX_IS_MULTI)), \ HeapTupleHeaderGetRawXmax(tup) \ ) #define HeapTupleHeaderGetRawXmax(tup) \ ( \ (tup)-t_choice.t_heap.t_xmax \ ) No strong preference, though. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote: Thanks for your fixes to the patch. Them looks correct to me. I did some fixes in the patch. The proof of some concepts is still needed. I'm going to provide it in a few days. Your further fixes look good. Could you also answer my question about the header comment of mcelem_array_contained_selec()? /* * Estimate selectivity of column @ const based on most common element * statistics. Independent element occurrence would imply a particular * distribution of distinct element counts among matching rows. Real data * usually falsifies that assumption. For example, in a set of 1-element * integer arrays having elements in the range [0;10], element occurrences are * not independent. If they were, a sufficiently-large set would include all * distinct element counts 0 through 11. We correct for this using the * histogram of distinct element counts. * * In the column @ const and column const cases, we usually have * const with low summary frequency of elements (otherwise we have * selectivity close to 0 or 1 correspondingly). That's why the effect of * dependence related to distinct element counts distribution is negligible * there. In the column @ const case, summary frequency of elements is * high (otherwise we have selectivity close to 0). That's why we should do * correction due to array distinct element counts distribution. */ By summary frequency of elements, do you mean literally P_0 + P_1 ... + P_N? If so, I can follow the above argument for column const and column @ const, but not for column @ const. For column @ const, selectivity cannot exceed the smallest frequency among const elements. A number of high-frequency elements will drive up the sum of the frequencies without changing the true selectivity much at all. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql \timing vs failed statements
On Tue, Jan 17, 2012 at 04:01:23PM +0100, Magnus Hagander wrote: Thus - if I were to change psql to output timing on failed queries as well, will anybody object? ;) +1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote: On Mon, Jan 16, 2012 at 5:09 AM, Noah Misch n...@leadboat.com wrote: pgstattuple()'s decision to treat half-dead pages like deleted pages is better. ?That transient state can only end in the page's deletion. the only page in that index has 200 records (all live 0 dead) using half the page size (which is a leaf page and is not half dead, btw). so, how do you justify that pgstattuple say we have just 25% of free space? postgres=# SELECT * from bt_page_stats('pgbench_tellers_pkey', 1); -[ RECORD 1 ]-+- blkno | 1 type | l live_items | 200 dead_items | 0 avg_item_size | 16 page_size | 8192 free_size | 4148 btpo_prev| 0 btpo_next| 0 btpo| 0 btpo_flags | 3 I don't know about counting non-leaf pages ignoring all non-leaf pages still gives a considerable difference between pgstattuple and relation_free_space() pgstattuple() counts the single B-tree meta page as always-full, while relation_free_space() skips it for all purposes. For tiny indexes, that can shift the percentage dramatically. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Wed, Jan 18, 2012 at 05:18:31PM -0300, Alvaro Herrera wrote: Excerpts from Heikki Linnakangas's message of mar ene 17 03:21:28 -0300 2012: On 16.01.2012 21:52, Alvaro Herrera wrote: I was initially thinking that pg_multixact should return the empty set if requested members of a multi that preceded the freeze point. That way, I thought, we would just never try to access a page originated in the older version (assuming the freeze point is set to current whenever pg_upgrade runs). However, as things currently stand, accessing an old multi raises an error. So maybe we need a scheme a bit more complex to handle this. Hmm, could we create new multixact files filled with zeros, covering the range that was valid in the old cluster? Hm, we could do something like that I guess. I'm not sure that just zeroes is the right pattern, but it should be something simple. PostgreSQL 9.1 can have all ~4B MultiXactId on disk at any given time. We could silently ignore the lookup miss when HEAP_XMAX_LOCK_ONLY is also set. That makes existing data files acceptable while still catching data loss scenarios going forward. (It's tempting to be stricter when we know the cluster data files originated in PostgreSQL 9.2+, but I'm not sure whether that's worth its weight.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility
Hi Mikko, First, for everyone else: this patch provides a more-compact binary output format for arrays. When the array contains no NULL elements and has a fixed-length element type, the new format saves four bytes per array element. We could do more. We could add support for arrays containing NULLs by way of a nulls bitmap. We could reduce the per-array overhead, currently 20 bytes for a 1-D array, in addition to the per-element overhead. Does anyone care about these broader cases? If so, speak now. On Thu, Dec 01, 2011 at 04:42:43PM +0200, Mikko Tiihonen wrote: On 11/30/2011 06:52 PM, Merlin Moncure wrote: On Mon, Nov 28, 2011 at 9:18 AM, Mikko Tiihonen mikko.tiiho...@nitorcreations.com wrote: Hi, As discussed few days ago it would be beneficial if we could change the v3 backend-client protocol without breaking backwards compatibility. Here is a working patch that exports a supported_binary_minor constant and a binary_minor session variable and a that can be used by clients to enable newer features. I also added an example usage where the array encoding for constant size elements is optimized if binary_minor version is new enough. I have coded the client side support for binary_minor for jdbc driver and tested that it worked. But lets first discuss if this is an acceptable path forward. I think all references to 'protocol' should be removed; Binary wire formats != protocol: the protocol could bump to v4 but the wire formats (by happenstance) could all still continue to work -- therefore the version isn't minor (it's not dependent on protocol version but only on itself). Therefore, don't much like the name 'supported_binary_minor'. How about binary_format_version? I was thinking that it would be possible use the new minor version to introduce also new protocol messages such as streaming of large data into database without knowing it's size beforehand. I agree with Merlin; the frontend/backend protocol is logically distinct from the binary send/recv formats of data types. For one key point, the latter is not exclusively core-defined; third-party extensions change their send/recv formats on a different schedule. They can add myext.binary_format_version GUCs of their own to cope in a similar way. Client interfaces that do not interpret individual result column data, such as libpq, require no update for send/recv format changes. In contrast, all client interfaces would need changes for a new protocol message. Most clients doing so may as well then advertise their support unconditionally. In contrast, send/recv format is something individual _users_ of the same client library may set differently. It's reasonable to have an application that still reads its data in send/recv format v0 even after upgrading to a version of libpq that talks frontend/backend protocol v4. I do think this is a great way to handle send/recv format changes. There needs to be decision on official policy about breaking backwards compatibility of postgresql clients. Is it: A) Every x.y postgres release is free to break any parts of the * protocol * text encoding * binary protocol as long as it is documented - all client libraries should be coded so that they refuse to support version x.y+1 or newer (they might have a option to override this but there are no guarantees that it would work) Good: no random bugs when using old client library Bad: initial complaints from users before they understand that this is the best option for everyone The ability to use old client libraries with new servers is more valuable than the combined benefits of all currently-contemplated protocol improvements. D) My proposed compromise where there is one minor_version setting and code in server to support all different minor versions. The client requests the minor version so that all releases default to backwards compatible version. This is workable. When there combinations starts to create too much maintenance overhead a new clean v4 version of protocol is specified. Good: Keeps full backwards compatibility Good: Allows introducing changes at any time Bad: Accumulates conditional code to server and clients until new version of protocol is released We would retain support for protocol V3 for years following the first release to support protocol V4, so think of the conditional code as lasting forever. The patch works as advertised. After set binary_minor = 1, the output of this command shrinks from 102 MiB to 63 MiB: \copy (select array[0,1,2,3,4,5,6,7,8,9] from generate_series(1,100)) to mynums with binary --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -477,6 +477,7 @@ static intsegment_size; static int wal_block_size; static int wal_segment_size; static bool integer_datetimes; +static int supported_binary_minor; static int
Re: [HACKERS] Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility
On Thu, Jan 19, 2012 at 02:00:20PM -0500, Robert Haas wrote: On Thu, Jan 19, 2012 at 10:37 AM, Noah Misch n...@leadboat.com wrote: I agree with Merlin; the frontend/backend protocol is logically distinct from the binary send/recv formats of data types. ?For one key point, the latter is not exclusively core-defined; third-party extensions change their send/recv formats on a different schedule. ?They can add myext.binary_format_version GUCs of their own to cope in a similar way. I agree. It occurs to me that we recently changed the default *text* output format for bytea for reasons not dissimilar to those contemplated here. Presumably, that's a much more disruptive change, and yet we've had minimal complaints because anyone who gets bitten can easily set bytea_output='escape' and the problem goes away. The same thing seems like it would work here, only the number of people needing to change the parameter will probably be even smaller, because fewer people use binary than text. Having said that, if we're to follow the precedent set by bytea_format, maybe we ought to just add binary_array_format={huge,ittybitty} and be done with it, rather than inventing a minor protocol version GUC for something that isn't really a protocol version change at all. We could introduce a differently-named general mechanism, but I guess I'm not seeing the point of that either. Just because someone has a backward-compatibility issue with one change of this type doesn't mean they have a similar issue with all of them. So I think adding a special-purpose GUC is more logical and more parallel to what we've done in the past, and it doesn't saddle us with having to be certain that we've designed the mechanism generally enough to handle all the cases that may come later. That makes sense. An attraction of a single binary format version was avoiding the Is this worth a GUC? conversation for each change. However, adding a GUC should be no more notable than bumping a binary format version. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Fri, Jan 20, 2012 at 07:03:22PM -0500, Jaime Casanova wrote: On Wed, Jan 18, 2012 at 7:01 PM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote: ignoring all non-leaf pages still gives a considerable difference between pgstattuple and relation_free_space() pgstattuple() counts the single B-tree meta page as always-full, while relation_free_space() skips it for all purposes. ?For tiny indexes, that can shift the percentage dramatically. ok, i will reformulate the question. why is fine ignoring non-leaf pages but is not fine to ignore the meta page? pgstattuple() figures the free_percent by adding up all space available to hold tuples and dividing that by the simple size of the relation. Non-leaf pages and the meta page get identical treatment: both never hold tuples, so they do not contribute to the free space. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
On Sat, Jan 14, 2012 at 08:18:48PM +0100, Marco Nenciarini wrote: This is our latest version of the patch. Gabriele, Gianni and I have discussed a lot and decided to send an initial patch which uses EACH REFERENCES instead of ARRAY REFERENCES. The reason behind this is that ARRAY REFERENCES generates an ambiguous grammar, and we all agreed that EACH REFERENCES makes sense (and the same time does not introduce any new keyword). This is however open for discussion. I greatly like that name; it would still make sense for other aggregate types, should we ever expand its use. Please complete the name change: the documentation, catalog entries, etc should all call them something like each foreign key constraints (I don't particularly like that exact wording). You currently forbid multi-column EACH FKs. I agree that we should allow only one array column per FK; with more, the set of required PK rows would be something like the Cartesian product of the elements of array columns. However, there are no definitional problems, at least for NO ACTION, around a FK constraint having one array column and N scalar columns. Whether or not you implement that now, let's choose a table_constraint syntax leaving that opportunity open. How about: FOREIGN KEY(col_a, EACH col_b, col_c) REFERENCES pktable (a, b, c) You've identified that we cannot generally implement the ON DELETE ARRAY CASCADE action on multidimensional arrays. This patch chooses to downgrade to ON DELETE ARRAY SET NULL in that case. My initial reaction is that it would be better to forbid multidimensional arrays in the column when the delete action is ON DELETE ARRAY SET NULL. That's not satisfying, either, because it makes the definition of conforming data depend on the ON DELETE action. Do we have other options? --- - - | ON| ON| Action | DELETE | UPDATE | --- - - CASCADE| Row | Error | SET NULL | Row | Row | SET DEFAULT| Row | Row | ARRAY CASCADE | Element | Element | ARRAY SET NULL | Element | Element | NO ACTION |-|-| RESTRICT |-|-| --- - - To complete the ARRAY - EACH transition, I would suggest names like CASCADE EACH/SET EACH NULL. I like the extensive test cases you have included. There's one more thing they should do: leave tables having EACH REFERENCES relationships in the regression database. This way, pg_dump tests of the regression database will exercise pg_dump with respect to this feature. The patch emits several warnings: heap.c: In function `StoreRelCheck': heap.c:1947: warning: passing argument 17 of `CreateConstraintEntry' makes integer from pointer without a cast index.c: In function `index_constraint_create': index.c:1160: warning: passing argument 17 of `CreateConstraintEntry' makes integer from pointer without a cast In file included from gram.y:13051: scan.c: In function `yy_try_NUL_trans': scan.c:16243: warning: unused variable `yyg' trigger.c: In function `CreateTrigger': trigger.c:454: warning: passing argument 17 of `CreateConstraintEntry' makes integer from pointer without a cast typecmds.c: In function `domainAddConstraint': typecmds.c:2960: warning: passing argument 17 of `CreateConstraintEntry' makes integer from pointer without a cast arrayfuncs.c: In function `array_remove': arrayfuncs.c:5197: warning: unused variable `dimresult' ri_triggers.c: In function `RI_FKey_check': ri_triggers.c:484: warning: too many arguments for format This test case, copied from my previous review except for updating the syntax, still fails: BEGIN; CREATE TABLE parent (c int PRIMARY KEY); CREATE TABLE child (c int[]); INSERT INTO parent VALUES (1); INSERT INTO child VALUES ('{3,1,2}'); ALTER TABLE child ADD FOREIGN KEY (c) EACH REFERENCES parent; -- should error INSERT INTO child VALUES ('{3,1,2}'); -- does error, as expected ROLLBACK; Most of my code comments concern minor matters: *** a/doc/src/sgml/ddl.sgml --- b/doc/src/sgml/ddl.sgml *** CREATE TABLE order_items ( *** 852,857 --- 882,931 /para para + When working with foreign key arrays, you have two more + options that can be used with your + literalEACH REFERENCES/literal definition: + literalARRAY CASCADE/literal and + literalARRAY SET NULL/literal. Depending on + the triggering action (commandDELETE/command or + commandUPDATE/command) on the referenced table, + every element in the referencing array will be either + deleted/updated or set to NULL. + For more detailed information on foreign key arrays + options and special cases, please refer to the + documentation for xref linkend=sql-createtable. +/para + + para + For instance, in the example below, a commandDELETE/command + from the literalcountries/literal table
Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
On Sat, Jan 21, 2012 at 08:04:20PM -0800, Jeff Janes wrote: This is failing make check for me. *** /tmp/foo/src/test/regress/expected/alter_table.out Sat Jan 21 19:51:46 2012 --- /tmp/foo/src/test/regress/results/alter_table.out Sat Jan 21 19:54:18 2012 *** *** 1662,1667 --- 1662,1668 NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index norewrite1_parent_pkey for table norewrite1_parent DEBUG: building index norewrite1_parent_pkey on table norewrite1_parent CREATE TABLE norewrite1_child(c text REFERENCES norewrite1_parent); + DEBUG: building index pg_toast_41491_index on table pg_toast_41491 ALTER TABLE norewrite1_child ALTER c TYPE varchar; -- recheck FK DEBUG: validating foreign key constraint norewrite1_child_c_fkey CREATE DOMAIN other_int AS int; Thanks. I've attached a new version fixing this problem. diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index cc210f0..bb2cf62 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 276,281 static Oid transformFkeyCheckAttrs(Relation pkrel, --- 276,282 int numattrs, int16 *attnums, Oid *opclasses); static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts); + static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid *funcid); static void validateCheckConstraint(Relation rel, HeapTuple constrtup); static void validateForeignKeyConstraint(char *conname, Relation rel, Relation pkrel, *** *** 357,362 static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMOD --- 358,364 static void ATPostAlterTypeParse(Oid oldId, char *cmd, List **wqueue, LOCKMODE lockmode, bool rewrite); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); + static void TryReuseForeignKey(Oid oldId, Constraint *con); static void change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId); static void change_owner_recurse_to_sequences(Oid relationOid, *** *** 5574,5579 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5576,5583 numpks; Oid indexOid; Oid constrOid; + boolold_check_ok; + ListCell *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop); /* * Grab an exclusive lock on the pk table, so that someone doesn't delete *** *** 5690,5695 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5694,5706 (errcode(ERRCODE_INVALID_FOREIGN_KEY), errmsg(number of referencing and referenced columns for foreign key disagree))); + /* +* On the strength of a previous constraint, we might avoid scanning +* tables to validate this one. See below. +*/ + old_check_ok = (fkconstraint-old_conpfeqop != NIL); + Assert(!old_check_ok || numfks == list_length(fkconstraint-old_conpfeqop)); + for (i = 0; i numpks; i++) { Oid pktype = pktypoid[i]; *** *** 5704,5709 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5715,5721 Oid ppeqop; Oid ffeqop; int16 eqstrategy; + Oid pfeqop_right; /* We need several fields out of the pg_opclass entry */ cla_ht = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclasses[i])); *** *** 5746,5755 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) ffeqop = get_opfamily_member(opfamily, fktyped, fktyped, eqstrategy); else ! ffeqop = InvalidOid;/* keep compiler quiet */ if (!(OidIsValid(pfeqop) OidIsValid(ffeqop))) { --- 5758,5774 pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) + { + pfeqop_right = fktyped; ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
New version that repairs a defective test case. diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 712b0b0..1bf1de5 100644 *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** *** 23,28 --- 23,29 #include catalog/pg_opclass.h #include catalog/pg_opfamily.h #include catalog/pg_tablespace.h + #include catalog/pg_type.h #include commands/dbcommands.h #include commands/defrem.h #include commands/tablecmds.h *** *** 52,57 --- 53,59 /* non-export function prototypes */ static void CheckPredicate(Expr *predicate); static void ComputeIndexAttrs(IndexInfo *indexInfo, + Oid *typeOidP, Oid *collationOidP, Oid *classOidP, int16 *colOptionP, *** *** 87,104 static void RangeVarCallbackForReindexIndex(const RangeVar *relation, * of columns and that if one has an expression column or predicate, both do. * Errors arising from the attribute list still apply. * ! * Most column type changes that can skip a table rewrite will not invalidate ! * indexes. For btree and hash indexes, we assume continued validity when ! * each column of an index would have the same operator family before and ! * after the change. Since we do not document a contract for GIN or GiST ! * operator families, we require an exact operator class match for them and ! * for any other access methods. ! * ! * DefineIndex always verifies that each exclusion operator shares an operator ! * family with its corresponding index operator class. For access methods ! * having no operator family contract, confirm that the old and new indexes ! * use the exact same exclusion operator. For btree and hash, there's nothing ! * more to check. * * We do not yet implement a test to verify compatibility of expression * columns or predicates, so assume any such index is incompatible. --- 89,105 * of columns and that if one has an expression column or predicate, both do. * Errors arising from the attribute list still apply. * ! * Most column type changes that can skip a table rewrite do not invalidate ! * indexes. We ackowledge this when all operator classes, collations and ! * exclusion operators match. Though we could further permit intra-opfamily ! * changes for btree and hash indexes, that adds subtle complexity with no ! * concrete benefit for core types. ! ! * When a comparison or exclusion operator has a polymorphic input type, the ! * actual input types must also match. This defends against the possibility ! * that operators could vary behavior in response to get_fn_expr_argtype(). ! * At present, this hazard is theoretical: check_exclusion_constraint() and ! * all core index access methods decline to set fn_expr for such calls. * * We do not yet implement a test to verify compatibility of expression * columns or predicates, so assume any such index is incompatible. *** *** 111,116 CheckIndexCompatible(Oid oldId, --- 112,118 List *exclusionOpNames) { boolisconstraint; + Oid*typeObjectId; Oid*collationObjectId; Oid*classObjectId; Oid accessMethodId; *** *** 123,132 CheckIndexCompatible(Oid oldId, int numberOfAttributes; int old_natts; boolisnull; - boolfamily_am; boolret = true; oidvector *old_indclass; oidvector *old_indcollation; int i; Datum d; --- 125,134 int numberOfAttributes; int old_natts; boolisnull; boolret = true; oidvector *old_indclass; oidvector *old_indcollation; + Relationirel; int i; Datum d; *** *** 168,177 CheckIndexCompatible(Oid oldId, indexInfo-ii_ExclusionOps = NULL; indexInfo-ii_ExclusionProcs = NULL; indexInfo-ii_ExclusionStrats = NULL; collationObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid)); classObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid)); coloptions = (int16 *) palloc(numberOfAttributes * sizeof(int16)); ! ComputeIndexAttrs(indexInfo, collationObjectId, classObjectId, coloptions, attributeList, exclusionOpNames, relationId, accessMethodName, accessMethodId, --- 170,181 indexInfo-ii_ExclusionOps = NULL;
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
On Sun, Jan 22, 2012 at 09:06:49PM +, Simon Riggs wrote: On Sat, Jan 21, 2012 at 8:42 PM, Noah Misch n...@leadboat.com wrote: You currently forbid multi-column EACH FKs. ?I agree that we should allow only one array column per FK; with more, the set of required PK rows would be something like the Cartesian product of the elements of array columns. However, there are no definitional problems, at least for NO ACTION, around a FK constraint having one array column and N scalar columns. ?Whether or not you implement that now, let's choose a table_constraint syntax leaving that opportunity open. ?How about: ? ? ? ?FOREIGN KEY(col_a, EACH col_b, col_c) REFERENCES pktable (a, b, c) I don't think we should be trying to cover every possible combination of arrays, non-arrays and all the various options. The number of combinations is making this patch larger than it needs to be and as a result endangers its being committed in this release just on committer time to cope with the complexity. We have a matter of weeks to get this rock solid. Yes, lets keep syntax open for future additions, but lets please focus/edit this down to a solid, useful patch for 9.2. +1. Let's change the syntax to leave that door open but not use the flexibility at this time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Sun, Jan 22, 2012 at 11:47:06PM +0200, Mikko Tiihonen wrote: On 01/20/2012 04:45 AM, Noah Misch wrote: On Thu, Jan 19, 2012 at 02:00:20PM -0500, Robert Haas wrote: Having said that, if we're to follow the precedent set by bytea_format, maybe we ought to just add binary_array_format={huge,ittybitty} and be done with it, rather than inventing a minor protocol version GUC for something that isn't really a protocol version change at all. We could introduce a differently-named general mechanism, but I guess I'm not seeing the point of that either. Just because someone has a backward-compatibility issue with one change of this type doesn't mean they have a similar issue with all of them. So I think adding a special-purpose GUC is more logical and more parallel to what we've done in the past, and it doesn't saddle us with having to be certain that we've designed the mechanism generally enough to handle all the cases that may come later. That makes sense. An attraction of a single binary format version was avoiding the Is this worth a GUC? conversation for each change. However, adding a GUC should be no more notable than bumping a binary format version. I see the main difference between the GUC per feature vs minor version being that in versioned changes old clients keep working because the have to explicitly request a specific version. Whereas in separate GUC variables each feature will be enabled by default and users have to either keep up with new client versions or figure out how to explicitly disable the changes. No, we can decide that anew for each GUC. If you'd propose that array_output = full be the default, that works for me. Here is a second version of the patch. The binary array encoding changes stay the same but all code around was rewritten. Changes from previous versions based on received comments: * removed the minor protocol version concept * introduced a new GUC variable array_output copying the current bytea_output type, with values full (old value) and smallfixed (new default) How about the name array_binary_output? * added documentation for the new GUC variable * used constants for the array flags variable values *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** COPY postgres_log FROM '/full/path/to/lo *** 4888,4893 --- 4888,4910 /listitem /varlistentry + varlistentry id=guc-array-output xreflabel=array_output + termvarnamearray_output/varname (typeenum/type)/term + indexterm +primaryvarnamearray_output/ configuration parameter/primary + /indexterm + listitem +para + Sets the output format for binary encoding of values of + type typearray/type. Valid values are + literalsmallfixed/literal (the default) + and literalfull/literal (the traditional PostgreSQL + format). The typearray/type type always It's not The array type but Array types, a class. + accepts both formats on input, regardless of this setting. +/para + /listitem + /varlistentry The section Array Input and Output Syntax should reference this GUC. *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *** *** 64,69 --- 64,70 #include storage/predicate.h #include tcop/tcopprot.h #include tsearch/ts_cache.h + #include utils/array.h #include utils/builtins.h #include utils/bytea.h #include utils/guc_tables.h *** static const struct config_enum_entry by *** 225,230 --- 226,243 }; /* + * Options for enum values defined in this module. + * + * NOTE! Option values may not contain double quotes! + */ Don't replicate this comment. + + static const struct config_enum_entry array_output_options[] = { + {full, ARRAY_OUTPUT_FULL, false}, + {smallfixed, ARRAY_OUTPUT_SMALLFIXED, false}, + {NULL, 0, false} + }; + + /* * We have different sets for client and server message level options because * they sort slightly different (see log level) */ *** static struct config_enum ConfigureNames *** 3047,3052 --- 3060,3075 NULL, NULL, NULL }, + { + {array_output, PGC_USERSET, COMPAT_OPTIONS_PREVIOUS, You've put the GUC in COMPAT_OPTIONS_PREVIOUS, but you've put its documentation in the section for CLIENT_CONN_STATEMENT. I don't have a strong opinion on which one to use, but be consistent. + gettext_noop(Sets the binary output format for arrays.), + NULL + }, + bytea_output, array_output + ARRAY_OUTPUT_SMALLFIXED, array_output_options, + NULL, NULL, NULL + }, + { {client_min_messages, PGC_USERSET, LOGGING_WHEN, gettext_noop(Sets the message levels
Re: [HACKERS] Collect frequency statistics for arrays
On Mon, Jan 23, 2012 at 01:21:20AM +0400, Alexander Korotkov wrote: Updated patch is attached. I've updated comment of mcelem_array_contained_selec with more detailed description of probability distribution assumption. Also, I found that rest behavious should be better described by Poisson distribution, relevant changes were made. Thanks. That makes more of the math clear to me. I do not follow all of it, but I feel that the comments now have enough information that I could go about doing so. + /* Take care about events with low probabilities. */ + if (rest DEFAULT_CONTAIN_SEL) + { Why the change from rest 0 to this in the latest version? + /* emit some statistics for debug purposes */ + elog(DEBUG3, array: target # mces = %d, bucket width = %d, + # elements = %llu, hashtable size = %d, usable entries = %d, + num_mcelem, bucket_width, element_no, i, track_len); That should be UINT64_FMT. (I introduced that error in v0.10.) I've attached a new version that includes the UINT64_FMT fix, some edits of your newest comments, and a rerun of pgindent on the new files. I see no other issues precluding commit, so I am marking the patch Ready for Committer. If I made any of the comments worse, please post another update. Thanks, nm arrayanalyze-0.13.patch.gz Description: application/gunzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility
On Mon, Jan 23, 2012 at 10:34:12AM -0500, Robert Haas wrote: On Mon, Jan 23, 2012 at 9:36 AM, Florian Weimer fwei...@bfk.de wrote: * Robert Haas: In this particular case, I knew that the change was coming and could push updated Java and Perl client libraries well before the server-side change hit our internal repository, but I really don't want to have to pay attention to such details. But if we *don't* turn this on by default, then chances are very good that it will get much less use. ?That doesn't seem good either. ?If it's important enough to do it at all, then IMHO it's important enough for it to be turned on by default. ?We have never made any guarantee that the binary format won't change from release to release. The problem here are libpq-style drivers which expose the binary format to the application. ?The Java driver doesn't do that, but the Perl driver does. ?(However, both transparently decode BYTEA values received in text format, which led to the compatibility issue.) I can see where that could cause some headaches... but it seems to me that if we take that concern seriously, it brings us right back to square one. If breaking the binary format causes too much pain to actually go do it, then we shouldn't change it until we're breaking everything else anyway (i.e. new protocol version, as Tom suggested). As I said upthread, and you appeared to agree, the protocol is independent of individual data type send/recv formats. Even if we were already adding protocol v4 to PostgreSQL 9.2, having array_send() change its behavior in response to the active protocol version would be wrong. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote: Excerpts from Noah Misch's message of vie ene 20 22:33:30 -0300 2012: pgstattuple() figures the free_percent by adding up all space available to hold tuples and dividing that by the simple size of the relation. Non-leaf pages and the meta page get identical treatment: both never hold tuples, so they do not contribute to the free space. Hm. Leaf pages hold as much tuples as non-leaf pages, no? I mean for each page element there's a value and a CTID. In non-leaf those CTIDs point to other index pages, one level down the tree; in leaf pages they point to the heap. That distinction seemed important when I sent my last message, but now I agree that it's largely irrelevant for free space purposes. If someone feels like doing it, +1 for making pgstattuple() count non-leaf free space. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Tue, Jan 24, 2012 at 11:24:08AM -0500, Jaime Casanova wrote: On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch n...@leadboat.com wrote: If someone feels like doing it, +1 for making pgstattuple() count non-leaf free space. actually i agreed that non-leaf pages are irrelevant... i just confirmed that in a production system with 300GB none of the indexes in an 84M rows table nor in a heavily updated one has more than 1 root page, all the rest are deleted, half_dead or leaf. so the posibility of bloat coming from non-leaf pages seems very odd FWIW, the number to look at is internal_pages from pgstatindex(): [local] test=# create table t4 (c) as select * from generate_series(1,100); SELECT 100 [local] test=# alter table t4 add primary key(c); NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index t4_pkey for table t4 ALTER TABLE [local] test=# select * from pgstatindex('t4_pkey'); -[ RECORD 1 ]--+- version| 2 tree_level | 2 index_size | 22478848 root_block_no | 290 internal_pages | 10 leaf_pages | 2733 empty_pages| 0 deleted_pages | 0 avg_leaf_density | 90.06 leaf_fragmentation | 0 So, 0.4% of this index. They appear in proportion to the logarithm of the total index size. I agree that bloat centered on them is unlikely. Counting them would be justified, but that is a question of formal accuracy rather than practical importance. but the possibility of bloat coming from the meta page doesn't exist, AFAIUI at least we need the most accurate value about usable free space, because the idea is to add a sampler mode to the function so we don't scan the whole relation. that's why we still need the function. I doubt we'd add this function solely on the basis that a future improvement will make it useful. For the patch to go in now, it needs to be useful now. (This is not a universal principle, but it mostly holds for low-complexity patches like this one.) All my comments below would also apply to such a broader patch. btw... pgstattuple also has the problem that it's not using a ring buffer attached are two patches: - v5: is the same original patch but only track space in leaf, deleted and half_dead pages - v5.1: adds the same for all kind of indexes (problem is that this is inconsistent with the fact that pageinspect only manages btree indexes for everything else) Let's take a step back. Again, what you're proposing is essentially a faster implementation of SELECT free_percent FROM pgstattuple(rel). If this code belongs in core at all, it belongs in the pgstattuple module. Share as much infrastructure as is reasonable between the user-visible functions of that module. For example, I'm suspecting that the pgstat_index() call tree should be shared, with pgstat_index_page() checking a flag to decide whether to gather per-tuple stats. Next, compare the bits of code that differ between pgstattuple() and relation_free_space(), convincing yourself that the differences are justified. Each difference will yield one of the following conclusions: 1. Your code contains an innovation that would apply to both functions. Where not too difficult, merge these improvements into pgstattuple(). In order for a demonstration of your new code's better performance to be interesting, we must fix the same low-hanging fruit in its competitor. One example is the use of the bulk read strategy. Another is the support for SP-GiST. 2. Your code is missing an essential behavior of pgstattuple(). Add it to your code. One example is the presence of CHECK_FOR_INTERRUPTS() calls. 3. Your code behaves differently from pgstattuple() due to a fundamental difference in their tasks. These are the only functional differences that ought to remain in your finished patch; please point them out in comments. For example, pgstat_heap() visits every tuple in the heap. You'll have no reason to do that; pgstattuple() only needs it to calculate statistics other than free_percent. In particular, I call your attention to the fact that pgstattuple() takes shared buffer content locks before examining pages. Your proposed patch does not do so. I do not know with certainty whether that falls under #1 or #2. The broad convention is to take such locks, because we elsewhere want an exact answer. These functions are already inexact; they make no effort to observe a consistent snapshot of the table. If you convince yourself that the error arising from not locking buffers is reasonably bounded, we can lose the locks (in both functions -- conclusion #1). Comments would then be in order. With all that done, run some quick benchmarks: see how SELECT free_percent FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for a large heap and for a large B-tree index. If the timing difference is too small to be interesting to you, remove relation_free_space() and submit your pgstattuple() improvements alone. Otherwise
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
On Wed, Jan 25, 2012 at 03:32:49PM -0500, Robert Haas wrote: On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: Thanks. + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. I value the savings in vertical space more than the lost idiomaticness. This decision is 90+% subjective, so I cannot blame you for concluding otherwise. I do know the feeling of looking at PostgreSQL source code and wishing the author had not attempted to conserve every line. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
On Wed, Jan 25, 2012 at 11:17:27AM -0300, Alvaro Herrera wrote: Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012: Thanks. I've attached a new version fixing this problem. Looks good to me. Can you please clarify this bit? * Since we elsewhere require that all collations share the same * notion of equality, don't compare collation here. Since I'm not familiar with this code, it's difficult for me to figure out where this elsewhere is, and what does same notion of equality mean. Good questions. See the comment in front of ri_GenerateQualCollation(), the comment in process_ordered_aggregate_single() at nodeAgg.c:605, the larger comment in add_sort_column(), and the two XXX comments in relation_has_unique_index_for(). We should probably document that assumption in xindex.sgml to keep type implementors apprised. Same notion of equality just means that a COLLATE x = b COLLATE y has the same value for every x, y. In any event, the patch needed a rebase, so I've attached it rebased and with that comment edited to reference ri_GenerateQualCollation(), that being the most-relevant source for the assumption in question. Thanks for reviewing, nm diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index cb8ac67..ba20950 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 276,281 static Oid transformFkeyCheckAttrs(Relation pkrel, --- 276,282 int numattrs, int16 *attnums, Oid *opclasses); static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts); + static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid *funcid); static void validateCheckConstraint(Relation rel, HeapTuple constrtup); static void validateForeignKeyConstraint(char *conname, Relation rel, Relation pkrel, *** *** 357,362 static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMOD --- 358,364 static void ATPostAlterTypeParse(Oid oldId, char *cmd, List **wqueue, LOCKMODE lockmode, bool rewrite); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); + static void TryReuseForeignKey(Oid oldId, Constraint *con); static void change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId); static void change_owner_recurse_to_sequences(Oid relationOid, *** *** 5591,5596 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5593,5600 numpks; Oid indexOid; Oid constrOid; + boolold_check_ok; + ListCell *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop); /* * Grab an exclusive lock on the pk table, so that someone doesn't delete *** *** 5707,5712 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5711,5723 (errcode(ERRCODE_INVALID_FOREIGN_KEY), errmsg(number of referencing and referenced columns for foreign key disagree))); + /* +* On the strength of a previous constraint, we might avoid scanning +* tables to validate this one. See below. +*/ + old_check_ok = (fkconstraint-old_conpfeqop != NIL); + Assert(!old_check_ok || numfks == list_length(fkconstraint-old_conpfeqop)); + for (i = 0; i numpks; i++) { Oid pktype = pktypoid[i]; *** *** 5721,5726 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5732,5738 Oid ppeqop; Oid ffeqop; int16 eqstrategy; + Oid pfeqop_right; /* We need several fields out of the pg_opclass entry */ cla_ht = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclasses[i])); *** *** 5763,5772 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) ffeqop = get_opfamily_member(opfamily, fktyped, fktyped, eqstrategy); else ! ffeqop = InvalidOid;/* keep compiler quiet */ if (!(OidIsValid(pfeqop) OidIsValid(ffeqop))) { --- 5775,5791
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
On Wed, Jan 25, 2012 at 11:53:10PM -0500, Tom Lane wrote: Not only is that code spectacularly unreadable, but has nobody noticed that this commit broke the buildfarm? Thanks for reporting the problem. This arose because the new test case temporarily sets client_min_messages=DEBUG1. The default buildfarm configuration sets log_statement=all in its postgresql.conf, so the client gets those log_statement lines. I would fix this as attached, resetting the optional logging to defaults during the test cases in question. Not delightful, but that's what we have to work with. nm *** a/src/test/regress/expected/alter_table.out --- b/src/test/regress/expected/alter_table.out *** *** 1665,1671 where oid = 'test_storage'::regclass; --- 1665,1679 t (1 row) + -- -- SET DATA TYPE without a rewrite + -- + -- Temporarily disable optional logging that make installcheck testers + -- reasonably might enable. + SET log_duration = off; + SET log_lock_waits = off; + SET log_statement = none; + SET log_temp_files = -1; CREATE DOMAIN other_textarr AS text[]; CREATE TABLE norewrite_array(c text[] PRIMARY KEY); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index norewrite_array_pkey for table norewrite_array *** *** 1674,1679 ALTER TABLE norewrite_array ALTER c TYPE text[]; -- no work --- 1682,1691 ALTER TABLE norewrite_array ALTER c TYPE other_textarr; -- rebuild index DEBUG: building index norewrite_array_pkey on table norewrite_array RESET client_min_messages; + RESET log_duration; + RESET log_lock_waits; + RESET log_statement; + RESET log_temp_files; -- -- lock levels -- *** a/src/test/regress/sql/alter_table.sql --- b/src/test/regress/sql/alter_table.sql *** *** 1197,1203 select reltoastrelid 0 as has_toast_table --- 1197,1213 from pg_class where oid = 'test_storage'::regclass; + -- -- SET DATA TYPE without a rewrite + -- + + -- Temporarily disable optional logging that make installcheck testers + -- reasonably might enable. + SET log_duration = off; + SET log_lock_waits = off; + SET log_statement = none; + SET log_temp_files = -1; + CREATE DOMAIN other_textarr AS text[]; CREATE TABLE norewrite_array(c text[] PRIMARY KEY); SET client_min_messages = debug1; *** *** 1205,1210 ALTER TABLE norewrite_array ALTER c TYPE text[]; -- no work --- 1215,1225 ALTER TABLE norewrite_array ALTER c TYPE other_textarr; -- rebuild index RESET client_min_messages; + RESET log_duration; + RESET log_lock_waits; + RESET log_statement; + RESET log_temp_files; + -- -- lock levels -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
On Wed, Jan 25, 2012 at 10:39:56PM -0500, Noah Misch wrote: In any event, the patch needed a rebase, so I've attached it rebased and with that comment edited to reference ri_GenerateQualCollation(), that being the most-relevant source for the assumption in question. Commit 9d35116611e6a1fc10f2298944fbf0e4e1a826be invalidated the test case hunks again. We'll need to either remove the test cases, as Robert chose to do for that other patch, or bolster them per http://archives.postgresql.org/message-id/20120126115536.GD15670%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote: v7 This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. One could not simply condition them on the page_checksums setting, though. Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. If that write tears, leaving the checksum intact, that block will now fail verification. A couple of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the old page had a checksum. (b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty the buffer. Here are other sites I noticed that do MarkBufferDirty() without bumping the page LSN: heap_xlog_visible() visibilitymap_clear() visibilitymap_truncate() lazy_scan_heap() XLogRecordPageWithFreeSpace() FreeSpaceMapTruncateRel() fsm_set_and_search() fsm_vacuum_page() fsm_search_avail() Though I have not investigated each of these in detail, I suspect most or all represent continued torn-page hazards. Since the FSM is just a hint, we do not WAL-log it at all; it would be nicest to always skip checksums for it, too. The visibility map shortcuts are more nuanced. When page_checksums=off and we read a page last written by page_checksums=on, we still verify its checksum. If a mostly-insert/read site uses checksums for some time and eventually wishes to shed the overhead, disabling the feature will not stop the costs for reads of old data. They would need a dump/reload to get the performance of a never-checksummed database. Let's either unconditionally skip checksum validation under page_checksums=off or add a new state like page_checksums=ignore for that even-weaker condition. --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml + para +Turning this parameter off speeds normal operation, but +might allow data corruption to go unnoticed. The checksum uses +16-bit checksums, using the fast Fletcher 16 algorithm. With this +parameter enabled there is still a non-zero probability that an error +could go undetected, as well as a non-zero probability of false +positives. + /para What sources of false positives do you intend to retain? --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -20,6 +20,7 @@ #include postgres.h #include access/visibilitymap.h +#include access/transam.h #include access/xact.h #include access/xlogutils.h #include catalog/catalog.h @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_HINT 0x40 typedef struct xl_smgr_create { @@ -477,19 +479,74 @@ AtSubAbort_smgr(void) smgrDoPendingDeletes(false); } +/* + * Write a backup block if needed when we are setting a hint. + * + * Deciding the if needed bit is delicate and requires us to either + * grab WALInsertLock or check the info_lck spinlock. If we check the + * spinlock and it says Yes then we will need to get WALInsertLock as well, + * so the design choice here is to just go straight for the WALInsertLock + * and trust that calls to this function are minimised elsewhere. + * + * Callable while holding share lock on the buffer content. + * + * Possible that multiple concurrent backends could attempt to write + * WAL records. In that case, more than one backup block may be recorded + * though that isn't important to the outcome and the backup blocks are + * likely to be identical anyway. + */ +#define SMGR_HINT_WATERMARK 13579 +void +smgr_buffer_hint(Buffer buffer) +{ + /* + * Make an XLOG entry reporting the hint + */ + XLogRecPtr lsn; + XLogRecData rdata[2]; + int watermark = SMGR_HINT_WATERMARK; + + /* + * Not allowed to have zero-length records, so use a small watermark + */ + rdata[0].data = (char *) (watermark); + rdata[0].len = sizeof(int); + rdata[0].buffer = InvalidBuffer; + rdata[0].buffer_std = false; + rdata[0].next = (rdata[1]); + + rdata[1].data = NULL; + rdata[1].len = 0; + rdata[1].buffer = buffer; + rdata[1].buffer_std = true; + rdata[1].next = NULL; + + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); + + /* + * Set the page LSN if we wrote a backup block. +
Re: [HACKERS] initdb and fsync
On Fri, Jan 27, 2012 at 04:19:41PM -0800, Jeff Davis wrote: It looks like initdb doesn't fsync all the files it creates, e.g. the PG_VERSION file. While it's unlikely that it would cause any real data loss, it can be inconvenient in some testing scenarios involving VMs. Thoughts? Would a patch to add a few fsync calls to initdb be accepted? +1. If I'm piloting strace -f right, initdb currently issues *no* syncs. We'd probably, then, want a way to re-disable the fsyncs for hacker benefit. Is a platform-independent fsync be available at initdb time? Not sure. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
On Thu, Jan 26, 2012 at 08:23:34AM -0500, Robert Haas wrote: On Thu, Jan 26, 2012 at 6:55 AM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 25, 2012 at 11:53:10PM -0500, Tom Lane wrote: Not only is that code spectacularly unreadable, but has nobody noticed that this commit broke the buildfarm? Thanks for reporting the problem. ?This arose because the new test case temporarily sets client_min_messages=DEBUG1. ?The default buildfarm configuration sets log_statement=all in its postgresql.conf, so the client gets those log_statement lines. ?I would fix this as attached, resetting the optional logging to defaults during the test cases in question. ?Not delightful, but that's what we have to work with. I'm just going to remove the test. This is not very future-proof and It would deserve an update whenever we add a new optional-logging GUC pertaining to user backends. Other tests require similarly-infrequent refreshes in response to other changes. Of course, buildfarm members would not be setting those new GUCs the day they're available. Calling for an update to the test could reasonably fall to the first buildfarm member owner who actually decides to use a new GUC in his configuration. an ugly pattern if it gets copied to other places. We need to work on I would rather folks introduce ugliness to the test suite than introduce behaviors having no test coverage. a more sensible way for ALTER TABLE to report what it did, but a solution based on what GUCs the build-farm happens to set doesn't seem like it's justified for the narrowness of the case we're testing here. It's not based on what GUCs the buildfarm happens to set. I looked up all GUCs that might create problems such as the one observed here. They were the four I included in the patch, plus debug_pretty_print, debug_print_parse, debug_print_plan and debug_print_rewritten. I concluded that the four debug_* ones were materially less likely to ever get set in postgresql.conf for a make installcheck run, so I left them out for brevity. The setting changed *by default* for buildfarm clients is log_statement. Buildfarm member owners can do as they wish, though. Whether or not we allow this case to work without a rewrite is in some sense arbitrary. There's no real reason it can't be done; rather, we're just exercising restraint to minimize the risk of future bugs. So I don't want to go to great lengths to test it. I used the same strategy in another ALTER TABLE patch this CF: http://archives.postgresql.org/message-id/20120126033956.gc15...@tornado.leadboat.com If we pay its costs for those tests, we then may as well let this test case ride their coattails. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote: The biggest item remaining is the point you raised about multixactid wraparound. This is closely related to multixact truncation and the way checkpoints are to be handled. If we think that MultiXactId wraparound is possible, and we need to involve autovacuum to keep it at bay, then I To prove it possible, we need prove there exists some sequence of operations consuming N xids and M N multixactids. Have N transactions key-lock N-1 rows apiece, then have each of them key-lock one of the rows locked by each other transaction. This consumes N xids and N(N-1) multixactids. I believe you could construct a workload with N! multixact usage, too. Existence proofs are one thing, real workloads another. My unsubstantiated guess is that multixactid use will not overtake xid use in bulk on a workload not specifically tailored to do so. So, I think it's enough to notice it, refuse to assign a new multixactid, and tell the user to clear the condition with a VACUUM FREEZE of all databases. Other opinions would indeed be useful. think the only way to make that work is to add another column to pg_class so that each table's oldest multixact is tracked, same as we do with relfrozenxid for Xids. If we do that, I think we can do away with most of the MultiXactTruncate junk I added -- things would become a lot simpler. The cost would be bloating pg_class a bit more. Are we okay with paying that cost? I asked this question some months ago and I decided that I would not add the column, but I am starting to lean the other way. I would like some opinions on this. That's not the only way; autovacuum could just accelerate normal freezing to advance the multixactid horizon indirectly. Your proposal could make it far more efficient, though. You asked two questions about WAL-logging locks: one was about the level of detail we log for each lock we grab; the other was about heap_xlog_update logging the right info. AFAICS, the main thing that makes detailed WAL logging necessary is hot standbys. That is, the standby must have all the locking info so that concurrent transactions are similarly locked as in the master ... or am I wrong in that? (ISTM this means I need to fix heap_xlog_update so that it faithfully registers the lock info we're storing, not just the current Xid). Standby servers do not need accurate tuple locks, because it's not possible to wait on a tuple lock during recovery. (By the way, the question about log detail was just from my own curiosity. We don't especially need an answer to move forward with this patch, unless you want one.) * Columns that are part of the key Noah thinks the set of columns should only consider those actually referenced by keys, not those that *could* be referenced. Well, do you disagree? To me it's low-hanging fruit, because it isolates the UPDATE-time overhead of this patch to FK-referenced tables rather than all tables having a PK or PK-like index (often just all tables). Also, in a table without columns, are all columns part of the key, or is the key the empty set? I changed HeapSatisfiesHOTUpdate but that seems arbitrary. It does seem arbitrary. What led you to switch in a later version? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] show Heap Fetches in EXPLAIN for index-only scans
On Wed, Jan 25, 2012 at 08:42:05PM -0500, Robert Haas wrote: Only the first pass of vacuum knows how to mark pages all-visible. After the update, the first pass of the first vacuum sees a dead tuple on the old page and truncates it to a dead line pointer. When it comes to the new page, it observes that the page is now all-visible and marks it so. It then does index vacuuming and returns to the first page, marking the dead line pointer unused. But during this second visit to the old page, there's no possibility of marking the page as all-visible, because the code doesn't know how to do that. The next vacuum's first pass, however, can do so, because there are no longer any dead tuples on the page. We could fix this by modifying lazy_vacuum_page(): since we have to dirty the buffer anyway, we could recheck whether all the remaining tuples on the page are now all-visible, and if so set the visibility map bit. This is probably desirable even apart from index-only scans, because it will frequently save the next vacuum the cost of reading, dirtying, and writing extra pages. There will be some incremental CPU cost, but that seems likely to be more than repaid by the I/O savings. Thoughts? Should we do this at all? If so, should we squeeze it into 9.2 or leave it for 9.3? Sounds like a good idea. It has bothered me that two consecutive VACUUMs of a table, with no intervening activity, can dirty a page twice. Making that less frequent is a good thing. I'd hold the change for 9.3, but that's probably an unusually-conservative viewpoint. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Review of: explain / allow collecting row counts without timing info
On Fri, Feb 03, 2012 at 11:12:33AM -0500, Robert Haas wrote: After looking at this a bit, I think we can make the interface a bit more convenient. I'd like to propose that we call the EXPLAIN option ROWS rather than TIMING, and that we provide the row-count information if *either* ROWS or ANALYZE is specified. Then, if you want rows without timing, you can say this: EXPLAIN (ROWS) query... Rather than, as in the approach taken by the current patch: EXPLAIN (ANALYZE, TIMING off) query... ...which is a little more long-winded. I favor the originally-submitted syntax. I like having a single option, ANALYZE, signal whether to actually run the statement. The submitted syntax also aligns with the fact that would motivate someone to use it: Timing has high overhead on my system. or Timing makes the output unstable. We have both estimated row counts and actual row counts. It may someday prove useful to have an invocation that plans the query and shows estimated row counts, omitting only cost estimates. Having a ROWS option that implies running the query could complicate that effort. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote: On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: Yeah. Personally I would be sad if initdb got noticeably slower, and I've never seen or heard of a failure that this would fix. I worked up a patch, and it looks like it does about 6 file fsync's and a 7th for the PGDATA directory. That degrades the time from about 1.1s to 1.4s on my workstation. So, is it worth it? Should we make it an option that can be specified? If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. An optimization like the pg_flush_data() call in copy_file() may reduce the speed penalty. initdb should do these syncs by default and offer an option to disable them. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote: On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. It doesn't make sense for initdb to take responsibility to sync files created by the backend. If there are important files that the backend creates, it should be the backend's responsibility to fsync them (and their parent directory, if needed). And if they are unimportant to the backend, then there is no reason for initdb to fsync them. I meant primarily to illustrate the need to be comprehensive, not comment on which executable should fsync a particular file. Bootstrap-mode backends do not sync anything during an initdb run on my system. With your patch, we'll fsync a small handful of files and leave nearly everything else vulnerable. That being said, having each backend fsync its own writes will mean syncing certain files several times within a single initdb run. If the penalty from that proves high enough, we may do well to instead have initdb.c sync everything just once. initdb should do these syncs by default and offer an option to disable them. For test frameworks that run initdb often, that makes sense. But for developers, it doesn't make sense to spend 0.5s typing an option that saves you 0.3s. So, we'd need some more convenient way to choose the no-fsync option, like an environment variable that developers can set. Or maybe developers don't care about 0.3s? Developers have shell aliases/functions/scripts and command history. I wouldn't object to having an environment variable control it, but I would not personally find that more convenient than a command-line switch. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add protransform for numeric, varbit, and temporal types
On Tue, Feb 07, 2012 at 12:43:11PM -0500, Robert Haas wrote: I've committed the numeric and varbit patches and will look at the temporal one next. Thanks. The comment you added to numeric_transform() has a few typos, constained - constrained and nodes - notes. I did notice one odd thing, though: sometimes we seem to get a rebuild on the toast index for no obvious reason: rhaas=# set client_min_messages=debug1; SET rhaas=# create table foo (a serial primary key, b varbit); NOTICE: CREATE TABLE will create implicit sequence foo_a_seq for serial column foo.a DEBUG: building index pg_toast_16430_index on table pg_toast_16430 NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo DEBUG: building index foo_pkey on table foo CREATE TABLE rhaas=# alter table foo alter column b set data type varbit(4); DEBUG: rewriting table foo DEBUG: building index foo_pkey on table foo ALTER TABLE When we rebuild the table at this point, it has a small maximum tuple size. Therefore, we omit the toast table entirely. rhaas=# alter table foo alter column b set data type varbit; DEBUG: building index pg_toast_16430_index on table pg_toast_16430 ALTER TABLE This command makes the tuple size unbounded again, so we create a toast table. Strangely, it doesn't happen if I add another column to the table: rhaas=# set client_min_messages=debug1; SET rhaas=# create table foo (a serial primary key, b varbit, c varbit); With the extra unconstrained varbit column, the tuple size is continuously unbounded and the table has a toast table at all stages. So, the difference in behavior is correct, albeit unintuitive. NOTICE: CREATE TABLE will create implicit sequence foo_a_seq for serial column foo.a DEBUG: building index pg_toast_16481_index on table pg_toast_16481 NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo DEBUG: building index foo_pkey on table foo CREATE TABLE rhaas=# alter table foo alter column b set data type varbit(4); DEBUG: building index pg_toast_16490_index on table pg_toast_16490 DEBUG: rewriting table foo DEBUG: building index foo_pkey on table foo ALTER TABLE rhaas=# alter table foo alter column b set data type varbit; ALTER TABLE -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote: On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote: This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. ?One could not simply condition them on the page_checksums setting, though. ?Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. ?If that write tears, leaving the checksum intact, that block will now fail verification. ?A couple of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. We can only change page_checksums at restart (now) so the above seems moot. If we crash with FPWs enabled we repair any torn pages. There's no live bug, but that comes at a high cost: the patch has us emit full-page images for hint bit writes regardless of the page_checksums setting. PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding is insufficient. Am serialising this by only writing PageLSN while holding buf hdr lock. That means also taking the buffer header spinlock in every PageGetLSN() caller holding only a shared buffer content lock. Do you think that will pay off, versus the settled pattern of trading here your shared buffer content lock for an exclusive one? I can see value in an option to exclude local buffers, since corruption there may be less exciting. ?It doesn't seem important for an initial patch, though. I'm continuing to exclude local buffers. Let me know if that should change. Seems reasonable. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote: On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote: On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote: On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote: This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. ?One could not simply condition them on the page_checksums setting, though. ?Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. ?If that write tears, leaving the checksum intact, that block will now fail verification. ?A couple of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. We can only change page_checksums at restart (now) so the above seems moot. If we crash with FPWs enabled we repair any torn pages. There's no live bug, but that comes at a high cost: the patch has us emit full-page images for hint bit writes regardless of the page_checksums setting. Sorry, I don't understand what you mean. I don't see any failure cases that require that. page_checksums can only change at a shutdown checkpoint, The statement If that write tears, leaving the checksum intact, that block will now fail verification. cannot happen, ISTM. If we write out a block we update the checksum if page_checksums is set, or we clear it. If we experience a torn page at crash, the FPI corrects that, so the checksum never does fail verification. We only need to write a FPI when we write with checksums. If that's wrong, please explain a failure case in detail. In the following description, I will treat a heap page as having two facts. The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT. A torn page write lacking the protection of a full-page image can update one or both facts despite the transaction having logically updated both. (This is just the normal torn-page scenario.) Given that, consider the following sequence of events: 1. startup with page_checksums = on 2. update page P with facts CHKSUM, !HINT 3. clean write of P 4. clean shutdown 5. startup with page_checksums = off 6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off 7. begin write of P 8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT 9. startup with page_checksums = off 10. crash recovery. does not touch P, because step 6 generated no WAL 11. clean shutdown 12. startup with page_checksums = on 13. read P. checksum failure Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch, because step 6 _does_ write WAL. That ought to change for the sake of page_checksums=off performance, at which point we must protect the above scenario by other means. PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding is insufficient. Am serialising this by only writing PageLSN while holding buf hdr lock. That means also taking the buffer header spinlock in every PageGetLSN() caller holding only a shared buffer content lock. ?Do you think that will pay off, versus the settled pattern of trading here your shared buffer content lock for an exclusive one? Yes, I think it will pay off. This is the only code location where we do that, and we are already taking the buffer header lock, so there is effectively no overhead. The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer() (via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header spinlock at either of those locations. If they remain safe under the new rules, we'll need comments explaining why. I think they may indeed be safe, but it's far from obvious. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add protransform for numeric, varbit, and temporal types
On Wed, Feb 08, 2012 at 09:37:01AM -0500, Robert Haas wrote: On Tue, Feb 7, 2012 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote: I've committed the numeric and varbit patches and will look at the temporal one next. Committed, after changing the OIDs so they don't conflict. Here's one more case for you to ponder: rhaas=# create table noah (i interval day); CREATE TABLE rhaas=# alter table noah alter column i set data type interval second(3); DEBUG: rewriting table noah ALTER TABLE Do we really need a rewrite in that case? The code acts like the interval range and precision are separate beasts, but is that really true? The code has a thinko; a given interval typmod ultimately implies a single point from which we truncate rightward. The precision only matters if the range covers SECOND. Thanks; the attached patch improves this. *** a/src/backend/utils/adt/timestamp.c --- b/src/backend/utils/adt/timestamp.c *** *** 958,963 interval_transform(PG_FUNCTION_ARGS) --- 958,964 int new_range = INTERVAL_RANGE(new_typmod); int new_precis = INTERVAL_PRECISION(new_typmod); int new_range_fls; + int old_range_fls; if (old_typmod == -1) { *** *** 974,985 interval_transform(PG_FUNCTION_ARGS) * Temporally-smaller fields occupy higher positions in the range * bitmap. Since only the temporally-smallest bit matters for length * coercion purposes, we compare the last-set bits in the ranges. */ new_range_fls = fls(new_range); if (new_typmod == -1 || ((new_range_fls = SECOND || ! new_range_fls = fls(old_range)) !(new_precis = MAX_INTERVAL_PRECISION || new_precis = old_precis))) ret = relabel_to_typmod(source, new_typmod); } --- 975,990 * Temporally-smaller fields occupy higher positions in the range * bitmap. Since only the temporally-smallest bit matters for length * coercion purposes, we compare the last-set bits in the ranges. +* Precision, which is to say, sub-second precision, only affects +* ranges that include SECOND. */ new_range_fls = fls(new_range); + old_range_fls = fls(old_range); if (new_typmod == -1 || ((new_range_fls = SECOND || ! new_range_fls = old_range_fls) !(old_range_fls SECOND || ! new_precis = MAX_INTERVAL_PRECISION || new_precis = old_precis))) ret = relabel_to_typmod(source, new_typmod); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote: Second, there's a concern about binary bloat: duplicating lots of code with different comparators inlined generates, well, a lot of machine code. Of course, an 0.8% increase in the size of the resulting binary is very unlikely to produce any measurable performance regression on its own, but that's not really the issue. The issue is rather that we could easily go nuts applying this technique in lots of different places - or even just in one place to lots of different types. Let's say that we find that even for types with very complex comparators, we can get a 5% speedup on quick-sorting speed using this technique. Should we, then, apply the technique to every type we have? Perhaps the binary would grow by, I don't know, 10%. Maybe that's still not enough to start hurting performance (or making packagers complain), but surely it's getting there, and what happens when we discover that similar speedups are possible using the same trick in five other scenarios? I think it's a bit like going out to dinner and spending $50. It's nicer than eating at home, and many people can afford to do it occasionally, but if you do it every night, it gets expensive (and possibly fattening). So we need some principled way of deciding how much inlining is reasonable, because I am 100% certain this is not going to be the last time someone discovers that a massive exercise in inlining can yield a nifty performance benefit in some case or another: index builds and COPY have already been mentioned on this thread, and I expect that people will find other cases as well. I'm not really sure what the budget is - i.e. how much binary bloat we can afford to add - or how many cases there are that can benefit, but the first isn't infinite and the second is more than the first. Having such a metric would resolve this discussion, but formulating it will be all but impossible. We don't know how much time PostgreSQL installations now spend sorting or how much additional time they would spend on cache misses, TLB misses, page faults, etc. That data won't show up on this thread. You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB lookup table. I would not feel bad about rejecting that, because it can live quite comfortably as an external module. Indeed, I think the best thing we can do to constrain long-term bloat in the postgres executable is to improve pluggability. Let a wider range of features live outside the postmaster binary. For example, if we had the infrastructure to let hash, GIN, GiST and SP-GiST index access methods live in their own DSOs (like PL/pgSQL does today), I would support doing that. Extensibility is a hallmark of PostgreSQL. It's a bad day when we reject a minority-interest feature or optimization that has no other way to exist. Better pluggability can't ease the decision process for Peter's patch, because its specializations essentially must live in the postgres executable to live at all. Nominally, we could let external modules inject sort specializations for core types, and Peter could publish an all_fast_sorts extension. That would be useless punting: if we can't make a principled decision on whether these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers the module make that decision? This patch has gotten more than its fair share of attention for bloat, and I think that's mostly because there's a dial-a-bloat-level knob sticking out and begging to be frobbed. On my system, fastpath_sort_2012_01_19.patch adds 85 KiB to the postgres binary. A recent commit added 57 KiB and bloat never came up in discussions thereof. I appreciate your concern over a slippery slope of inlining proposals, but I also don't wish to see the day when every patch needs to declare and justify its binary byte count. Unless, of course, we discover that elusive metric for evaluating it fairly. If we'd like to start taking interest in binary bloat, how about having the buildfarm log capture an ls -l on the bin directory? We'll then have data to mine if we ever wish to really take action. All that being said, I'd want to see a 15-30% (depending on how contrived) improvement to a microbenchmark or a 5% improvement to a generic benchmark (pgbench, DBT-N) before adopting an optimization of this complexity. Based on your measurements[2], per-type inlining gives us a 7% microbenchmark improvement. I'd therefore excise the per-type inlining. (For the record, given a 20% improvement to the same benchmark, I'd vote yea on the submission and perhaps even suggest int2 and uuid support.) nm [1] http://archives.postgresql.org/message-id/ca+tgmozo1xsz+yiqz2mrokmcmqtb+jir0lz43cne6de7--q...@mail.gmail.com [2] http://archives.postgresql.org/message-id/ca+tgmobfz8sqtgtaaryerywuzfodgetprbuygbhsplr4+li...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Thu, Feb 09, 2012 at 11:11:16PM +0200, Marti Raudsepp wrote: I've always been a little wary of using the TRUNCATE command due to the warning in the documentation about it not being MVCC-safe: queries may silently give wrong results and it's hard to tell when they are affected. That got me thinking, why can't we handle this like a standby server does -- if some query has data removed from underneath it, it aborts with a serialization failure. Does this solution sound like a good idea? The attached patch is a lame attempt at implementing this. I added a new pg_class.relvalidxmin attribute which tracks the Xid of the last TRUNCATE (maybe it should be called reltruncatexid?). Whenever starting a relation scan with a snapshot older than relvalidxmin, an error is thrown. This seems to work out well since TRUNCATE updates pg_class anyway, and anyone who sees the new relfilenode automatically knows when it was truncated. I like the design you have chosen. It would find applications beyond TRUNCATE, so your use of non-specific naming is sound. For example, older snapshots will see an empty table t after CREATE TABLE t AS SELECT 1 commits; that's a comparable MVCC anomaly. Some of our non-MVCC-safe commands should perhaps just become MVCC-safe, but there will always be use cases for operations that shortcut MVCC. When one truly does want that, your proposal for keeping behavior consistent makes plenty of sense. Should I also add another counter to pg_stat_database_conflicts? Currently this table is only used on standby servers. ERROR: canceling statement due to conflict with TRUNCATE TABLE on foo DETAIL: Rows visible to this transaction have been removed. My initial reaction is not to portray this like a recovery conflict, since several aspects distinguish it from all recovery conflict types. (I have not read your actual patch.) Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote: On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote: I like the design you have chosen. ?It would find applications beyond TRUNCATE, so your use of non-specific naming is sound. ?For example, older snapshots will see an empty table t after CREATE TABLE t AS SELECT 1 commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands should perhaps just become MVCC-safe, but there will always be use cases for operations that shortcut MVCC. ?When one truly does want that, your proposal for keeping behavior consistent makes plenty of sense. I guess I'm not particularly excited by the idea of trying to make TRUNCATE MVCC-safe. I notice that the example involves the REPEATABLE READ isolation level, which is already known to be busted in a variety of ways; that's why we now have SERIALIZABLE, and why most people use READ COMMITTED. Are there examples of this behavior at other isolation levels? I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and not at READ COMMITTED. They tend to be narrow race conditions at READ COMMITTED, yet easy to demonstrate at REPEATABLE READ. Related: http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php Incidentally, people use READ COMMITTED because they don't question the default, not because they know hazards of REPEATABLE READ. I don't know the bustedness you speak of; could we improve the documentation to inform folks? But I have to admit I'm intrigued by the idea of extending this to other cases, if there's a reasonable way to do that. For example, if we could fix things up so that we don't see a table at all if it was created after we took our snapshot, that would remove one of the obstacles to marking pages bulk-loaded into that table with FrozenXID and PD_ALL_VISIBLE from the get-go. I think a lot of people would be mighty happy about that. Exactly. But the necessary semantics seem somewhat different. For TRUNCATE, you want to throw a serialization error; but is that also what you want for CREATE TABLE? Or do you just want it to appear empty? I think an error helps just as much there. If you create a table and populate it with data in the same transaction, letting some snapshot see an empty table is an atomicity failure. Your comment illustrates a helpful point: this is just another kind of ordinary serialization failure, one that can happen at any isolation level. No serial transaction sequence can yield one of these errors. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote: On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs si...@2ndquadrant.com wrote: What straightforward implementation is that?? This *is* the straightforward one. God knows what else we'd break if we drop the lock, reacquire as an exclusive, then drop it again and reacquire in shared mode. Code tends to assume that when you take a lock you hold it until you release; doing otherwise would allow all manner of race conditions to emerge. And do you really think that would be faster? I don't know, but neither do you, because you apparently haven't tried it. Games where we drop the shared lock and get an exclusive lock are used in numerous places in the source code; see, for example, heap_update(), so I don't believe that there is any reason to reject that a priori. Indeed, I can think of at least one pretty good reason to do it exactly that way: it's the way we've handled all previous problems of this type, and in general it's better to make new code look like existing code than to invent something new, especially when you haven't made any effort to quantify the problem or the extent to which the proposed solution mitigates it. We do, in numerous places, drop a shared buffer content lock and reacquire it in exclusive mode. However, the existing users of that pattern tend to trade the lock, complete subsequent work, and unlock the buffer all within the same function. So they must, because several of them recheck facts that can change during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers do not adhere to that pattern; they can be quite distant from the original lock acquisition and eventual lock release. Take the prototypical case of SetHintBits(). Our shared lock originated in a place like heapgettup(), which expects to hold it continually until finished. We would need to somehow push up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer lock, then have heapgettup() reorient itself as needed. Every caller of code that can reach SetBufferCommitInfoNeedsSave() would need to do the same. Perhaps there's a different way to apply the lock-trade technique that avoids this mess, but I'm not seeing it. Consequently, I find the idea of requiring a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE to be a sensible one. Within that umbrella, some details need attention: - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the spinlock for heap buffers, letting gistScanPage() off the hook.) We need a public API, perhaps LockBufferForLSN(). - The use of some spinlock need not imply using the buffer header spinlock. We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual trade-off of splitting a lock: less contention at the cost of more acquisitions. I have no intuition on which approach would perform better. I agree that src/backend/storage/buffer/README is the place to document the new locking rules. I do share your general unease about adding new twists to the buffer access rules. Some of our hairiest code is also the code manipulating buffer locks most extensively, and I would not wish to see that code get even more difficult to understand. However, I'm not seeing a credible alternative that retains the same high-level behaviors. A possible compromise is to leave the page clean after setting a hint bit, much like the patch already has us do under hot standby. Then there's no new WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin benchmarked when he was looking at hint bits? Does anyone recall the result? Thanks, nm [1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The comment is true today, but the same patch makes it false by having XLogSaveBufferForHint() call XLogInsert() under a share lock. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote: On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch n...@leadboat.com wrote: With all that done, run some quick benchmarks: see how SELECT free_percent FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for a large heap and for a large B-tree index. ?If the timing difference is too small to be interesting to you, remove relation_free_space() and submit your pgstattuple() improvements alone. ?Otherwise, submit as written. Ok. I split this in three patches. 1) pgstattuple-gin_spgist.patch This first patch adds gin and spgist support to pgstattuple, also makes pgstattuple use a ring buffer when reading tables or indexes. The buffer access strategy usage bits look fine to commit. The gin and spgist support has problems, detailed below. 2) pgstattuple-relation_free_space.patch This patch adds the relation_free_space function to pgstattuple. the function relation_free_space() is faster than pgstattuple(), to test that i initialize pgbench with a scale of 40. In that context pgstattuple() tooks 1.4s to process pgbench_account table and relation_free_space() tooks 730ms (half the time!) In the index the difference is less notorious, 170ms the former and 150ms the latter. Benchmarks lasting on the order of one second are far too short. I tried the first two patches on this 6914 MiB table and 4284 MiB index: create table t(n) as select * from generate_series(1,2); create index on t(n); This machine has about 1 GiB of memory available for disk cache, and I used shared_buffers = 128MB. I used a regular assert-enabled build with debug_assertions = off. Timings: pgstattuple.free_percent, heap: runtime 166.2s; answer 0.34 pgstattuple.free_percent, index:runtime 110.9s; answer 9.83 relation_free_space, heap: runtime 165.1s; answer 0.00838721 relation_free_space, index: runtime 108.7s; answer 2.23692 Note the disagreement in answers and the nonsensical answer from the last test. The numbers do line up for smaller tables and indexes that I tried. During the pgstattuple() runs on the heap, CPU usage divided evenly between user and iowait time. With relation_free_space(), iowait kept above 80%. For the index, pgstattuple() managed 60-80% iowait and relation_free_space() again kept above 80%. Even so, that did not produce any significant change in runtime. I'm guessing that readahead was highly effective here, so the I/O bound dictated elapsed time. Bottom line, this patch can probably achieve 50% speedups on already-in-memory relations. It can reduce the contribution to CPU load, but not the elapsed runtime, for relations we largely pull from disk. Do those benefits justify the additional user-visible interface? I suppose the sort of installation that would benefit most is one just short of the tipping point of the database size exceeding memory size. Larger databases could not afford either function, and smaller databases don't need to watch bloat so closely. Offhand, I think that the I/O savings of sampling will be the real win, and it's not worth an extra user-visible function to get the CPU usage savings this patch offers. Other opinions welcome. 3) pgstattuple-stats_target.patch This patch adds a stats_target parameter to the relation_free_space() function, it mimics the way analyze choose the blocks to read and is faster than plain relation_free_space() but of course could be inexact if the pages that we don't read are the ones with more free space This part is a fresh submission. It is simple enough that I have reviewed it. It gives the expected speedup. However, the results are wrong: 3 runs of pgstattuple('t', 5): 0.000171412, 0.000171876, 0.000169326 3 runs of pgstattuple('t', 10): 0.000336525, 0.000344404, 0.000341314 I can get an apparent infinite loop: create table t0(n) as select * from generate_series(1,400); create index on t0(n); [local] test=# select relation_free_space('t0_n_idx', 100); relation_free_space - 0.0982675 Time: 133.788 ms [local] test=# select relation_free_space('t0_n_idx', 5); Cancel request sent ERROR: canceling statement due to user request Time: 24655.481 ms As a way forward, I suggest abandoning relation_free_space() entirely and adding a sampling capability to pgstattuple(). There are clear gains to be had from that method. The gains of splitting out the free percent calculation from the other pgstattuple() calculations seem meager. If you would like, submit the buffer strategy bits as a separate patch with its own CF entry, noting that it arose from here. That part can be Ready for Committer. I'm marking the original CF entry Returned with Feedback. Patch 1: *** a/contrib/pgstattuple/pgstatindex.c --- b/contrib/pgstattuple/pgstatindex.c + /* + * Buffer access strategy for reading relations, it's simpler to keep
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs si...@2ndquadrant.com wrote: So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that do? As explained in detailed comments, the purpose of this is to implement Heikki's suggestion that we have a bit set to zero so we can detect failures that cause a run of 1s. I think it's nonsensical to pretend that there's anything special about that particular bit. If we want to validate the page header before trusting the lack of a checksum bit, we can do that far more thoroughly than just checking that one bit. There are a whole bunch of bits that ought to always be zero, and there are other things we can validate as well (e.g. LSN not in future). If we're concerned about the checksum-enabled bit getting flipped (and I agree that we should be), we can check some combination of that stuff in the hope of catching it, and that'll be a far better guard than just checking one arbitrarily selected bit. PageHeaderIsValid() (being renamed to PageIsVerified()) already checks (page-pd_flags ~PD_VALID_FLAG_BITS) == 0. Explicitly naming another bit and keeping it unset is redundant with that existing check. It would cease to be redundant if we ever allocate all the flag bits, but then we also wouldn't have a bit to spare as PD_CHECKSUM2. I agree with you on this point. That having been said, I don't feel very good about the idea of relying on the contents of the page to tell us whether or not the page has a checksum. There's no guarantee that an error that flips the has-checksum bit will flip any other bit on the page, or that it won't flip everything else we're relying on as a backstop in exactly the manner that foils whatever algorithm we put in place. Random corruption is, perhaps, unlikely to do that, but somehow I feel like it'll happen more often than random chance suggests. Things never fail the way you want them to. Another disadvantage of the current scheme is that there's no particularly easy way to know that your whole cluster has checksums. No matter how we implement checksums, you'll have to rewrite every table in the cluster in order to get them fully turned on. But with the current design, there's no easy way to know how much of the cluster is actually checksummed. If you shut checksums off, they'll linger until those pages are rewritten, and there's no easy way to find the relations from which they need to be removed, either. I'm not seeing value in rewriting pages to remove checksums, as opposed to just ignoring those checksums going forward. Did you have a particular scenario in mind? I'm tempted to suggest a relation-level switch: when you want checksums, you use ALTER TABLE to turn them on, and when you don't want them any more you use ALTER TABLE to shut them off again, in each case rewriting the table. That way, there's never any ambiguity about what's in the data pages in a given relation: either they're either all checksummed, or none of them are. This moves the decision about whether checksums are enabled or disabled quite a but further away from the data itself, and also allows you to determine (by catalog inspection) which parts of the cluster do in fact have checksums. It might be kind of a pain to implement, though: you'd have to pass the information about how any given relation was configured down to the place where we validate page sanity. I'm not sure whether that's practical. This patch implies future opportunities to flesh out its UI, and I don't see it locking us out of implementing the above. We'll want a weak-lock command that adds checksums to pages lacking them. We'll want to expose whether a given relation has full checksum coverage. With that, we could produce an error when an apparently-non-checksummed page appears in a relation previously known to have full checksum coverage. Even supposing an ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS as the only tool for enabling or disabling them, it's helpful to have each page declare whether it has a checksum. That way, the rewrite can take as little as an AccessShareLock, and any failure need not lose work already done. If you rely on anticipating the presence of a checksum based on a pg_class column, that ALTER needs to perform an atomic rewrite. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote: Okay, so this patch fixes the truncation and wraparound issues through a mechanism much like pg_clog's: it keeps track of the oldest possibly existing multis on each and every table, and then during tuple freezing those are removed. I also took the liberty to make the code remove multis altogether (i.e. resetting the IS_MULTI hint bit) when only the update remains and lockers are all gone. I also cleaned up the code in heapam so that there's a couple of tables mapping MultiXactStatus to LockTupleMode and back, and to heavyweight lock modes (the older patches used functions to do this, which was pretty ugly). I had to add a little helper function to lock.c to make this work. I made a rather large bunch of other minor changes to close minor bugs here and there. Docs have been added, as have new tests for the isolation harness, which I've ensured pass in both read committed and serializable modes; WAL logging for locking updated versions of a tuple, when an old one is locked due to an old snapshot, was also added; there's plenty of room for growth in the MultiXact flag bits; the bit that made tables with no keys lock the entire row all the time was removed; multiple places in code comments were cleaned up that referred to this feature as FOR KEY LOCK and ensured that it also mentions FOR KEY UPDATE; the pg_rowlocks, pageinspect, pg_controldata, pg_resetxlog utilities have been updated. All of the above sounds great. I especially like the growing test coverage. All in all, I think this is in pretty much final shape. Only pg_upgrade bits are still missing. If sharp eyes could give this a critical look and knuckle-cracking testers could give it a spin, that would be helpful. Lack of pg_upgrade support leaves this version incomplete, because that omission would constitute a blocker for beta 2. This version changes as much code compared to the version I reviewed at the beginning of the CommitFest as that version changed overall. In that light, it's time to close the books on this patch for the purpose of this CommitFest; I'm marking it Returned with Feedback. Thanks for your efforts thus far. On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote: On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote: * Columns that are part of the key Noah thinks the set of columns should only consider those actually referenced by keys, not those that *could* be referenced. Well, do you disagree? To me it's low-hanging fruit, because it isolates the UPDATE-time overhead of this patch to FK-referenced tables rather than all tables having a PK or PK-like index (often just all tables). You have not answered my question above. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Thu, Feb 23, 2012 at 02:08:28PM +0100, Jeroen Vermeulen wrote: On 2012-02-23 10:18, Simon Riggs wrote: However, review of such a large patch should not be simply pass or fail. We should be looking back at the original problem and ask ourselves whether some subset of the patch could solve a useful subset of the problem. For me, that seems quite likely and this is very definitely an important patch. Even if we can't solve some part of the problem we can at least commit some useful parts of infrastructure to allow later work to happen more smoothly and quickly. So please let's not focus on the 100%, lets focus on 80/20. The suggested immutable-column constraint was meant as a potential 80/20 workaround. Definitely not a full solution, helpful to some, probably easier to do. I don't know if an immutable key would actually be enough to elide foreign-key locks though. That alone would not simplify the patch much. INSERT/UPDATE/DELETE on the foreign side would still need to take some kind of tuple lock on the primary side to prevent primary-side DELETE. You then still face the complicated case of a tuple that's both locked and updated (non-key/immutable columns only). Updates that change keys are relatively straightforward, following what we already do today. It's the non-key updates that complicate things. If you had both an immutable column constraint and a never-deleted table constraint, that combination would be sufficient to simplify the picture. (Directly or indirectly, it would not actually be a never-deleted constraint so much as a you must take AccessExclusiveLock to DELETE constraint.) Foreign-side DML would then take an AccessShareLock on the parent table with no tuple lock at all. By then, though, that change would share little or no code with the current patch. It may have its own value, but it's not a means for carving a subset from the current patch. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Thu, Feb 23, 2012 at 06:36:42PM -0300, Alvaro Herrera wrote: Excerpts from Noah Misch's message of mi?? feb 22 14:00:07 -0300 2012: On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote: On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote: On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote: * Columns that are part of the key Noah thinks the set of columns should only consider those actually referenced by keys, not those that *could* be referenced. Well, do you disagree? To me it's low-hanging fruit, because it isolates the UPDATE-time overhead of this patch to FK-referenced tables rather than all tables having a PK or PK-like index (often just all tables). You have not answered my question above. Sorry. The reason I didn't research this is that at the very start of the discussion it was said that having heapam.c figure out whether columns are being used as FK destinations or not would be more of a modularity violation than indexed columns already are for HOT support (this was a contentious issue for HOT, so I don't take it lightly. I don't think I need any more reasons for Tom to object to this patch, or more bulk into it. Both are already serious issues.) That's fair. In any case, with the way we've defined FOR KEY SHARE locks (in the docs it's explicitely said that the set of columns considered could vary in the future), it's a relatively easy patch to add on top of what I've submitted. Just as the ALTER TABLE bits to add columns to the set of columns considered, it could be left for a second pass on the issue. Agreed. Let's have that debate another day, as a follow-on patch. Thanks for shedding this light. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Thu, Feb 23, 2012 at 12:43:11PM -0300, Alvaro Herrera wrote: Excerpts from Simon Riggs's message of jue feb 23 06:18:57 -0300 2012: On Wed, Feb 22, 2012 at 5:00 PM, Noah Misch n...@leadboat.com wrote: All in all, I think this is in pretty much final shape. ??Only pg_upgrade bits are still missing. ??If sharp eyes could give this a critical look and knuckle-cracking testers could give it a spin, that would be helpful. Lack of pg_upgrade support leaves this version incomplete, because that omission would constitute a blocker for beta 2. ??This version changes as much code compared to the version I reviewed at the beginning of the CommitFest as that version changed overall. ??In that light, it's time to close the books on this patch for the purpose of this CommitFest; I'm marking it Returned with Feedback. ??Thanks for your efforts thus far. Now this is an interesting turn of events. I must thank you for your extensive review effort in the current version of the patch, and also thank you and credit you for the idea that initially kicked this patch from the older, smaller, simpler version I wrote during the 9.1 timeline (which you also reviewed exhaustively). Without your and Simon's brilliant ideas, this patch wouldn't exist at all. I completely understand that you don't want to review this latest version of the patch; it's a lot of effort and I wouldn't inflict it on anybody who hasn't not volunteered. However, it doesn't seem to me that this is reason to boot the patch from the commitfest. I think the thing to do would be to remove yourself from the reviewers column and set it back to needs review, so that other reviewers can pick it up. It would indeed be wrong to change any patch from Needs Review to Returned with Feedback on account of a personal distaste for reviewing the patch. I hope I did not harbor such a motive here. Rather, this CommitFest has given your patch its fair shake, and I and other reviewers would better serve the CF's needs by reviewing, say, ECPG FETCH readahead instead of your latest submission. Likewise, you would better serve the CF by evaluating one of the four non-committer patches that have been Ready for Committer since January. That's not to imply that the goals of the CF align with my goals, your goals, or broader PGDG goals. The patch status on commitfest.postgresql.org does exist solely for the advancement of the CF, and I have set it in accordingly. As for the late code churn, it mostly happened as a result of your own feedback; I would have left most of it in the original state, but as I went ahead it seemed much better to refactor things. This is mostly in heapam.c. As for multixact.c, it also had a lot of churn, but that was mostly to restore it to the state it has in the master branch, dropping much of the code I had written to handle multixact truncation. The new code there and in the vacuum code path (relminmxid and so on) is a lot smaller than that other code was, and it's closely based on relfrozenxid which is a known piece of technology. I appreciate that. However, review of such a large patch should not be simply pass or fail. We should be looking back at the original problem and ask ourselves whether some subset of the patch could solve a useful subset of the problem. For me, that seems quite likely and this is very definitely an important patch. Incidentally, I find it harmful to think of Returned with Feedback as fail. For large patches, it's healthier to think of a CF as a bimonthly project status meeting with stakeholders. When the project is done, wonderful! When there's work left, that's no great surprise. Even if we can't solve some part of the problem we can at least commit some useful parts of infrastructure to allow later work to happen more smoothly and quickly. So please let's not focus on the 100%, lets focus on 80/20. Well, we have the patch I originally posted in the 9.1 timeframe. That's a lot smaller and simpler. However, that solves only part of the blocking problem, and in particular it doesn't fix the initial deadlock reports from Joel Jacobson at Glue Finance (now renamed Trustly, in case you wonder about his change of email address) that started this effort in the first place. I don't think we can cut down to that and still satisfy the users that requested this; and Glue was just the first one, because after I started blogging about this, some more people started asking for it. I don't think there's any useful middlepoint between that one and the current one, but maybe I'm wrong. Nothing additional comes to my mind, either. This patch is monolithic. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
Hi Marco, This version fixes everything I noted in my last review. Apart from corner cases I note, the patch is technically sound. I pose a few policy-type questions; comments from a broader audience would help those. On Mon, Feb 06, 2012 at 07:04:42PM +0100, Marco Nenciarini wrote: Please find attached version 3 of our patch. We thoroughly followed your suggestions and were able to implement EACH foreign key constraints with multi-column syntax as well. [actual review based on v3b] * support for EACH foreign keys in multi-column foreign key table constraints - e.g. FOREIGN KEY (c1, EACH c2) REFERENCES t1 (u1, u2) You support, and have test cases for, constraints with multiple EACH columns. The documentation and comments do not explain their semantics. On reviewing the behavior and code, you have implemented it in terms of a Cartesian product. That's logical, but when is such a constraint actually useful? If someone can think of an application, great; let's document his example. Otherwise, let's reject such constraints at DDL time. As previously said, we preferred to keep this patch simple for 9.2 and forbid EACH CASCADE and EACH SET NULL on multi-column foreign keys. After all, majority of use cases is represented by EACH foreign key column constraints (single-column foreign key arrays), and more complicated use cases can be discussed for 9.3 - should this patch make it. :) Good call. We can use multi-dimensional arrays as well as referencing columns. In that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH SET NULL. This is a safe way of implementing the action. We have some ideas on how to implement this, but we feel it is better to limit the behaviour for now. This still feels awfully unprincipled to me. How about just throwing an error when we need to remove an element from a multidimensional array? Essentially, have ON DELETE EACH CASCADE downgrade to ON DELETE RESTRICT when it encounters a multidimensional array. That's strictly less code than what you have now, and it keeps our options open. We can always change from error to set-null later, but we can't so readily change from set-null to anything else. As I pondered this patch again, I came upon formal hazards around non-default operator classes. Today, ATAddForeignKeyConstraint() always chooses pfeqop, ffeqop and ppeqop in the same operator family. If it neglected this, we would get wrong behavior when one of the operators is sensitive to a change to which another is insensitive. For EACH FK constraints, this patch sets ffeqop = ARRAY_EQ_OP unconditionally. array_eq() uses the TYPECACHE_EQ_OPR (usually from the default B-tree operator class). That operator may not exist at all, let alone share an operator family with pfeqop. Shall we deal with this by retrieving TYPECACHE_EQ_OPR in ATAddForeignKeyConstraint() and rejecting the constraint creation if it does not share an operator family with pfeqop? The same hazard affects ri_triggers.c use of array_remove() and array_replace(), and the same change mitigates it. Let me say again how much I like the test coverage from this patch. It's great to think of something worth verifying and find existing coverage for it in the submitted test cases. *** a/doc/src/sgml/catalogs.sgml --- b/doc/src/sgml/catalogs.sgml *** *** 2102,2107 --- 2106,2118 entry/entry entryIf a check constraint, a human-readable representation of the expression/entry /row + + row + entrystructfieldconfisarray/structfield/entry Name is now confiseach. pg_constraint.confeach needs documentation. + entrytypebool/type/entry + entry/entry + entryIf a foreign key, true if a EACH REFERENCE foreign key/entry + /row /tbody /tgroup /table *** a/doc/src/sgml/ddl.sgml --- b/doc/src/sgml/ddl.sgml +para + Another option you have with foreign keys is to use a + referencing column which is an array of elements with + the same type (or a compatible one) as the referenced + column in the related table. This feature, commonly known + as firsttermforeign key arrays/firstterm, is implemented Is it indeed commonly known by that name? My web searching did not turn up any independent use of the term. In any event, this should be the only place where we mention multiple names for the feature. Pick one preferred term and use it for all other mentions. The documentation, comments and messages currently have a mix of each foreign key, EACH foreign key and foreign key array. +para + Even though the most common use case for foreign key arrays + is on a single column key, you can define an quoteeach foreign + key constraint/quote on a group of columns. As the following + contrived example shows, it needs to be written in table constraint form: + programlisting + CREATE TABLE t1 ( + a integer PRIMARY KEY,
Re: [HACKERS] leakproof
On Wed, Feb 22, 2012 at 06:30:37PM -0500, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 02/22/2012 04:29 PM, Marc Munro wrote: As the developer of veil I feel marginally qualified to bikeshed here: how about silent? A silent function being one that will not blab. I also made this suggestion later in the day. SILENT isn't a bad idea. It seems to lead the mind in the right direction, or at least not encourage people to guess the wrong meaning. +1 for SILENT. I also liked Kevin's suggestion of DISCREET; it comes closer than SILENT to denoting the property in question, but it also carries a social undertone that feels out of place in a software function declaration. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings
On Wed, Jan 25, 2012 at 06:25:46PM -0300, Alvaro Herrera wrote: Excerpts from Noah Misch's message of s??b ene 14 12:40:02 -0300 2012: It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting. Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all backslash commands, does not route through SendQuery(). Looking into this turned up several other weaknesses in psql's handling of COPY. Interesting. Committed, thanks. Thanks. While testing a crashing function, I noticed that my above patch added some noise to psql output when the server crashes: [local] test=# select crashme(); The connection to the server was lost. Attempting reset: Failed. The connection to the server was lost. Attempting reset: Failed. unexpected transaction status (4) Time: 6.681 ms ! \q Status 4 is PQTRANS_UNKNOWN, which is expected when the connection is not CONNECTION_OK. The double message arrives because ProcessResult() now calls CheckConnection() at least twice, for the benefit of COPY. (Incidentally, the reconnect fails because the server has not yet finished recovering; that part is nothing new.) The attached small patch has SendQuery() keep quiet about PQTRANS_UNKNOWN when the connection is down. It makes ProcessResult() skip the second CheckConnection() when the command string had no COPY results. This restores the pre-08146775acd8bfe0fcc509c71857abb928697171 psql output: [local] test=# select crashme(); The connection to the server was lost. Attempting reset: Failed. Time: 4.798 ms ! \q Thanks, nm diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 5c9bd96..715e231 100644 *** a/src/bin/psql/common.c --- b/src/bin/psql/common.c *** *** 740,746 ProcessResult(PGresult **results) } while (next_result); /* may need this to recover from conn loss during COPY */ ! if (!CheckConnection()) return false; return success; --- 740,746 } while (next_result); /* may need this to recover from conn loss during COPY */ ! if (!first_cycle !CheckConnection()) return false; return success; *** *** 1015,1022 SendQuery(const char *query) case PQTRANS_UNKNOWN: default: OK = false; ! psql_error(unexpected transaction status (%d)\n, ! transaction_status); break; } --- 1015,1024 case PQTRANS_UNKNOWN: default: OK = false; ! /* PQTRANS_UNKNOWN is expected given a broken connection. */ ! if (transaction_status != PQTRANS_UNKNOWN || ConnectionUp()) ! psql_error(unexpected transaction status (%d)\n, ! transaction_status); break; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Feb 27, 2012 at 02:13:32PM +0200, Heikki Linnakangas wrote: On 23.02.2012 18:01, Alvaro Herrera wrote: As far as complexity, yeah, it's a lot more complex now -- no question about that. How about assigning a new, real, transaction id, to represent the group of transaction ids. The new transaction id would be treated as a subtransaction of the updater, and the xids of the lockers would be stored in the multixact-members slru. That way the multixact structures wouldn't need to survive a crash; you don't care about the shared lockers after a crash, and the xid of the updater would be safely stored as is in the xmax field. That way you wouldn't need to handle multixact wraparound, because we already handle xid wraparound, and you wouldn't need to make multixact slrus crash-safe. Not sure what the performance implications would be. You would use up xids more quickly, which would require more frequent anti-wraparound vacuuming. And if we just start using real xids as the key to multixact-offsets slru, we would need to extend that a lot more often. But I feel it would probably be acceptable. When a key locker arrives after the updater and creates this implicit subtransaction of the updater, how might you arrange for the xid's clog status to eventually get updated in accordance with the updater's outcome? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ECPG FETCH readahead
On Thu, Dec 29, 2011 at 10:46:23AM +0100, Boszormenyi Zoltan wrote: 2011-11-16 20:51 keltez?ssel, Boszormenyi Zoltan ?rta: 2010-10-14 11:56 keltez?ssel, Boszormenyi Zoltan ?rta: On Thu, Jun 24, 2010 at 8:19 AM, Michael Meskes mes...@postgresql.org wrote: On Thu, Jun 24, 2010 at 12:04:30PM +0300, Heikki Linnakangas wrote: Is there a reason not to enable it by default? I'm a bit worried that it will receive no testing if it's not always on. Yes, there is a reason, namely that you don't need it in native mode at all. ECPG can read as many records as you want in one go, something ESQL/C apparently cannot do. This patch is a workaround for that restriction. I still do not really see that this feature gives us an advantage in native mode though. We yet lack a consensus on whether native ECPG apps should have access to the feature. My 2c is to make it available, because it's useful syntactic sugar. If your program independently processes each row of an arbitrary-length result set, current facilities force you to add an extra outer loop to batch the FETCHes for every such code site. Applications could define macros to abstract that pattern, but this seems common-enough to justify bespoke handling. Besides, minimalists already use libpq directly. I suggest enabling the feature by default but drastically reducing the default readahead chunk size from 256 to, say, 5. That still reduces the FETCH round trip overhead by 80%, but it's small enough not to attract pathological behavior on a workload where each row is a 10 MiB document. I would not offer an ecpg-time option to disable the feature per se. Instead, let the user set the default chunk size at ecpg time. A setting of 1 effectively disables the feature, though one could later re-enable it with ECPGFETCHSZ. The ASAP took a little long. The attached patch is in git diff format, because (1) the regression test intentionally doesn't do ECPGdebug() so the patch isn't dominated by a 2MB stderr file, so this file is empty and (2) regular diff cannot cope with empty new files. Avoid the empty file with fprintf(STDERR, Dummy non-empty error output\n); - *NEW FEATURE* Readahead can be individually enabled or disabled by ECPG-side grammar: DECLARE curname [ [ NO ] READAHEAD ] CURSOR FOR ... Without [ NO ] READAHEAD, the default behaviour is used for cursors. Likewise, this may as well take a chunk size rather than a yes/no. The patch adds warnings: error.c: In function `ecpg_raise': error.c:281: warning: format not a string literal and no format arguments error.c:281: warning: format not a string literal and no format arguments The patch adds few comments and no larger comments explaining its higher-level ideas. That makes it much harder to review. In this regard it follows the tradition of the ECPG code, but let's depart from that tradition for new work. I mention a few cases below where the need for commentary is acute. I tested full reads of various SELECT * FROM generate_series(1, $1) commands over a 50ms link, and the patch gives a sound performance improvement. With no readahead, a mere N=100 takes 5s to drain. With readahead enabled, N=1 takes only 3s. Performance was quite similar to that of implementing my own batching with FETCH 256 FROM cur. When I kicked up ECPGFETCHSZ (after fixing its implementation -- see below) past the result set size, performance was comparable to that of simply passing the query through psql. --- a/doc/src/sgml/ecpg.sgml +++ b/doc/src/sgml/ecpg.sgml @@ -5289,6 +5315,17 @@ while (1) /varlistentry varlistentry + term-231 (symbolECPG_INVALID_CURSOR/symbol)/term + listitem + para + The cursor you are trying to use with readahead has not been opened yet (SQLSTATE 34000), + invalid values were passed to libecpg (SQLSTATE 42P11) hor not in prerequisite state, i.e. Typo. --- /dev/null +++ b/src/interfaces/ecpg/ecpglib/cursor.c @@ -0,0 +1,730 @@ cursor.c contains various 78-col lines. pgindent has limited ability to improve those, so please split them. +static struct cursor_descriptor * +add_cursor(int lineno, struct connection *con, const char *name, bool scrollable, int64 n_tuples, bool *existing) +{ + struct cursor_descriptor *desc, + *ptr, *prev = NULL; + boolfound = false; + + if (!name || name[0] == '\0') + { + if (existing) + *existing = false; + return NULL; + } + + ptr = con-cursor_desc; + while (ptr) + { + int ret = strcasecmp(ptr-name, name); + + if (ret == 0) + { + found = true; + break; + } + if (ret 0) + break; + + prev = ptr; + ptr = ptr-next; + } Any reason not to use find_cursor() here?
Re: [HACKERS] index-only quals vs. security_barrier views
On Thu, Feb 09, 2012 at 12:02:29PM -0500, Robert Haas wrote: When Heikki worked up his original index-only scan patches (which didn't end up looking much like what eventually got committed), he had the notion of an index-only qual. That is, given a query like this: select sum(1) from foo where substring(a,1,3) = 'abc'; We could evaluate the substring qual before performing the heap fetch, and fetch the tuple from the heap only if the qual passes. Now, there's a fly in the ointment here, which is that applying arbitrary user-defined functions to tuples that might not be visible doesn't sound very safe. The user-defined function in question might perform some action based on those invisible tuples that has side effects, which would be bad, because now we're violating MVCC semantics. Or it might throw an error, killing the scan dead on the basis of the contents of some tuple that the scan shouldn't even see. However, there is certainly a class of functions for which this type of optimization would be safe, and it's an awful lot like the set of functions that can be safely pushed down through a security_barrier view - namely, things that don't have side effects or throw errors. So it's possible that the proleakproof flag KaiGai is proposing to add to pg_proc could do double duty, serving also to identify when it's safe to apply a qual to an index tuple when the corresponding heap tuple might be invisible. However, I have some reservations about assuming that the two concepts are exactly the same. For one thing, people are inevitably going to want to cheat a little bit more here than is appropriate for security views, and encouraging people to mark things LEAKPROOF when they're really not is a security disaster waiting to happen. The similarity is indeed tempting, but I find the concepts sufficiently distinct to not make one device serve both. Adding to the reservations you offer, LEAKPROOF is superuser-only. This other marker would not entail any special privilege. For another thing, there are some important cases that this doesn't cover, like: select * from foo where substring(a,1,3) like '%def%'; The like operator doesn't seem to be leakproof in the security sense, because it can throw an error if the pattern is something like a single backslash (ERROR: LIKE pattern must not end with escape character) and indeed it doesn't seem like it would be safe here either if the pattern were stored in the table. But if the pattern were constant, it'd be OK, or almost OK: there's still the edge case where the table contains invisible rows but no visible ones - whether or not we complain about the pattern there ought to be the same as whether or not we complain about it on a completely empty table. If we got to that point, then we might as well consider the qual leakproof for security purposes under the same set of circumstances we'd consider it OK to apply to possibly-invisible tuples. This sort of thing implicates substring(), too, when you call it as substring(a, 1, b); b 0 produces an error. To handle these, I think we'd need a facility along the lines of protransform. Have a function inspecting call nodes for a particular other function and determining whether each is ok-for-index-only-quals. You could even force protransform itself into that role. Create an additional pg_proc entry identical to the ordinary substring() but for a different name and having the ok-for-index-only-quals flag. Add a protransform to the main pg_proc entry that inspects the argument nodes and, when they're safe, replaces the call with a call to that errorfree_substring() at plan time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY with hints, rebirth
On Fri, Mar 02, 2012 at 08:46:45AM +, Simon Riggs wrote: On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It's still broken: [BEGIN;TRUNCATE;SAVEPOINT;COPY;ROLLBACK TO] So this approach isn't the one... The COPY FREEZE patch provides a way for the user to say explicitly that they don't really care about these MVCC corner cases and as a result allows us to avoid touching XidInMVCCSnapshot() at all. So there is still a patch on the table. You can salvage the optimization by tightening its prerequisite: use it when the current subtransaction or a child thereof created or truncated the table. A parent subtransaction having done so is acceptable for the WAL avoidance optimization but not for this. A COPY FREEZE ignoring that stronger restriction would be an interesting feature in its own right, but it brings up other problems. For example, suppose you write a tuple, then fail while writing its index entries. The tuple is already frozen and visible, but it lacks a full set of live index entries. The subtransaction aborts, but the top transaction commits and makes the situation permanent. Incidentally, I contend that we should write frozen tuples to new/truncated tables unconditionally. The current behavior of making old snapshots see the table as empty violates atomicity at least as badly as letting those snapshots see the future-snapshot contents. But Marti has a sound proposal that would interact with your efforts here to avoid violating atomicity at all: http://archives.postgresql.org/message-id/cabrt9rbrmdsoz8kxgehfb4lg-ev9u67-6dlqvoiibpkkhtl...@mail.gmail.com Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ECPG FETCH readahead
On Sun, Mar 04, 2012 at 05:16:06PM +0100, Michael Meskes wrote: On Fri, Mar 02, 2012 at 11:41:05AM -0500, Noah Misch wrote: I suggest enabling the feature by default but drastically reducing the default readahead chunk size from 256 to, say, 5. That still reduces the FETCH round trip overhead by 80%, but it's small enough not to attract pathological behavior on a workload where each row is a 10 MiB document. I would not offer an ecpg-time option to disable the feature per se. Instead, let the user set the default chunk size at ecpg time. A setting of 1 effectively disables the feature, though one could later re-enable it with ECPGFETCHSZ. Using 1 to effectively disable the feature is fine with me, but I strongly object any default enabling this feature. It's farily easy to create cases with pathological behaviour and this features is not standard by any means. I figure a normal programmer would expect only one row being transfered when fetching one. On further reflection, I agree with you here. The prospect for queries that call volatile functions changed my mind; they would exhibit different functional behavior under readahead. We mustn't silently give affected programs different semantics. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ECPG FETCH readahead
On Sun, Mar 04, 2012 at 04:33:32PM +0100, Boszormenyi Zoltan wrote: 2012-03-02 17:41 keltez?ssel, Noah Misch ?rta: On Thu, Dec 29, 2011 at 10:46:23AM +0100, Boszormenyi Zoltan wrote: I suggest enabling the feature by default but drastically reducing the default readahead chunk size from 256 to, say, 5. That still reduces the FETCH round trip overhead by 80%, but it's small enough not to attract pathological behavior on a workload where each row is a 10 MiB document. I see. How about 8? Nice round power of 2 value, still small and avoids 87.5% of overhead. Having pondered the matter further, I now agree with Michael that the feature should stay disabled by default. See my response to him for rationale. Assuming that conclusion holds, we can recommended a higher value to users who enable the feature at all. Your former proposal of 256 seems fine. BTW, the default disabled behaviour was to avoid make check breakage, see below. I would not offer an ecpg-time option to disable the feature per se. Instead, let the user set the default chunk size at ecpg time. A setting of 1 effectively disables the feature, though one could later re-enable it with ECPGFETCHSZ. This means all code previously going through ECPGdo() would go through ECPGopen()/ECPGfetch()/ECPGclose(). This is more intrusive and all regression tests that were only testing certain features would also test the readahead feature, too. It's a good sort of intrusiveness, reducing the likelihood of introducing bugs basically unrelated to readahead that happen to afflict only ECPGdo() or only the cursor.c interfaces. Let's indeed not have any preexisting test cases use readahead per se, but having them use the cursor.c interfaces anyway will build confidence in the new code. The churn in expected debug output isn't ideal, but I don't prefer the alternative of segmenting the implementation for the sake of the test cases. Also, the test for WHERE CURRENT OF at ecpg time would have to be done at runtime, possibly making previously working code fail if ECPGFETCHSZ is enabled. Good point. How about still allowing NO READAHEAD cursors that compile into plain ECPGdo()? This way, ECPGFETCHSZ don't interfere with WHERE CURRENT OF. But this would mean code changes everywhere where WHERE CURRENT OF is used. ECPGFETCHSZ should only affect cursors that make no explicit mention of READAHEAD. I'm not sure whether that should mean actually routing READHEAD 1 cursors through ECPGdo() or simply making sure that cursor.c achieves the same outcome; see later for a possible reason to still do the latter. Or how about a new feature in the backend, so ECPG can do UPDATE/DELETE ... WHERE OFFSET N OF cursor and the offset of computed from the actual cursor position and the position known by the application? This way an app can do readahead and do work on rows collected by the cursor with WHERE CURRENT OF which gets converted to WHERE OFFSET OF behind the scenes. That's a neat idea, but I would expect obstacles threatening our ability to use it automatically for readahead. You would have to make the cursor a SCROLL cursor. We'll often pass a negative offset, making the operation fail if the cursor query used FOR UPDATE. Volatile functions in the query will get more calls. That's assuming the operation will map internally to something like MOVE N; UPDATE ... WHERE CURRENT OF; MOVE -N. You might come up with innovations to mitigate those obstacles, but those innovations would probably also apply to MOVE/FETCH. In any event, this would constitute a substantive patch in its own right. One way out of trouble here is to make WHERE CURRENT OF imply READHEAD 1/READHEAD 0 (incidentally, perhaps those two should be synonyms) on the affected cursor. If the cursor has some other readahead quantity declared explicitly, throw an error during preprocessing. Failing a reasonable resolution, I'm prepared to withdraw my suggestion of making ECPGFETCHSZ always-usable. It's nice to have, not critical. +bool +ECPGopen(const int lineno, const int compat, const int force_indicator, + const char *connection_name, const bool questionmarks, + const char *curname, const int st, const char *query, ...) +{ + va_list args; + boolret, scrollable; + char *new_query, *ptr, *whold, *noscroll, *scroll, *dollar0; + struct sqlca_t *sqlca = ECPGget_sqlca(); + + if (!query) + { + ecpg_raise(lineno, ECPG_EMPTY, ECPG_SQLSTATE_ECPG_INTERNAL_ERROR, NULL); + return false; + } + ptr = strstr(query, for ); + if (!ptr) + { + ecpg_raise(lineno, ECPG_INVALID_STMT, ECPG_SQLSTATE_ECPG_INTERNAL_ERROR, NULL); + return false; + } + whold = strstr(query, with hold ); + dollar0 = strstr(query, $0); + + noscroll = strstr(query, no scroll ); + scroll = strstr(query, scroll
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote: On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote: I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and not at READ COMMITTED. ?They tend to be narrow race conditions at READ COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related: http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php Yeah. Well, that's actually an interesting example, because it illustrates how general this problem is. We could potentially get ourselves into a situation where just about every system catalog table needs an xmin field to store the point at which the object came into existence - or for that matter, was updated. I can see this strategy applying to many relation-pertinent system catalogs. Do you foresee applications to non-relation catalogs? In any event, I think a pg_class.relvalidxmin is the right starting point. One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin (already exists), inhvalidxmin, and attvalidxmin. relvalidxmin is like the AccessExclusiveLock of that family; it necessarily blocks everything that might impugn the others. The value in extending this to more catalogs is the ability to narrow the impact of failing the check. A failed indcheckxmin comparison merely excludes plans involving the index. A failed inhvalidxmin check might just skip recursion to the table in question. Those are further refinements, much like using weaker heavyweight lock types. But it's not quite the same as the xmin of the row itself, because some updates might be judged not to matter. There could also be intermediate cases where updates are invalidating for some purposes but not others. I think we'd better get our hands around more of the problem space before we start trying to engineer solutions. I'm not seeing that problem. Any operation that would update some xmin horizon should set it to the greater of its current value and the value the operation needs for its own correctness. If you have something in mind that needs more, could you elaborate? Incidentally, people use READ COMMITTED because they don't question the default, not because they know hazards of REPEATABLE READ. ?I don't know the bustedness you speak of; could we improve the documentation to inform folks? The example that I remember was related to SELECT FOR UPDATE/SELECT FOR SHARE. The idea of those statements is that you want to prevent the row from being updated or deleted until some other concurrent action is complete; for example, in the case of a foreign key, we'd like to prevent the referenced row from being deleted or updated in the relevant columns until the inserting transaction is committed. But it doesn't work, because when the updating or deleting process gets done with the lock wait, they are still using the same snapshot as before, and merrily do exactly the the thing that the lock-wait was supposed to prevent. If an actual UPDATE is used, it's safe (I think): anyone who was going to UPDATE or DELETE the row will fail with some kind of serialization error. But a SELECT FOR UPDATE that commits is treated more like an UPDATE that rolls back: it's as if the lock never existed. Someone (Florian?) proposed a patch to change this, but it seemed problematic for reasons I no longer exactly remember. Thanks. I vaguely remember that thread. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Sun, Mar 04, 2012 at 01:02:57PM +, Simon Riggs wrote: More detailed thoughts show that the test in heap_beginscan_internal() is not right enough, i.e. wrong. We need a specific XidInMVCCSnapshot test on the relvalidxid, so it needs to be a specific xid, not an xmin because otherwise we can get concurrent transactions failing, not just older transactions. Good point; I agree. indcheckxmin's level of pessimism isn't appropriate for this new check. If we're going freeze tuples on load this needs to be watertight, so some minor rework needed. Of course, if we only have a valid xid on the class we might get new columns added when we do repeated SELECT * statements using the same snapshot while concurrent DDL occurs. That is impractical, so if we define this as applying to rows it can work; if we want it to apply to everything its getting more difficult. Excess columns seem less grave to me than excess or missing rows. I'm having difficulty thinking up an explanation for that opinion. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Mon, Mar 05, 2012 at 03:46:16PM -0500, Robert Haas wrote: On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch n...@leadboat.com wrote: I can see this strategy applying to many relation-pertinent system catalogs. Do you foresee applications to non-relation catalogs? Well, in theory, we have similar issues if, say, a query uses a function that didn't exist at the time the snapshot as taken; the actual results the user sees may not be consistent with any serial execution schedule. And the same could be true for any other SQL object. It's unclear that those cases are as compelling as this one, but then again it's unclear that no one will ever want to fix them, either. For example, suppose we have a view v over a table t that calls a function f. Somebody alters f to give different results and, in the same transaction, modifies the contents of t (but no DDL). This doesn't strike me as a terribly unlikely scenario; the change to t could well be envisioned as a compensating transaction. But now if somebody uses the new definition of f against the old contents of t, the user may fail to get what they were hoping for out of bundling those changes together in one transaction. Good example. Now, maybe we're never going to fix those kinds of anomalies anyway, but if we go with this architecture, then I think the chances of it ever being palatable to try are pretty low. Why? But it's not quite the same as the xmin of the row itself, because some updates might be judged not to matter. ?There could also be intermediate cases where updates are invalidating for some purposes but not others. ?I think we'd better get our hands around more of the problem space before we start trying to engineer solutions. I'm not seeing that problem. ?Any operation that would update some xmin horizon should set it to the greater of its current value and the value the operation needs for its own correctness. ?If you have something in mind that needs more, could you elaborate? Simon's point about xmin vs. xid probably leads to an example. One value is fine for TRUNCATE, because only the most recent TRUNCATE matters. Not all DDL is so simple. Well, consider something like CLUSTER. It's perfectly OK for CLUSTER to operate on a table that has been truncated since CLUSTER's snapshot was taken, and no serialization anomaly is created that would not have already existed as a result of the non-MVCC-safe TRUNCATE. On the other hand, if CLUSTER operates on a table that was created since CLUSTER's snapshot was taken, then you have a bona fide serialization anomaly. Core CLUSTER does not use any MVCC snapshot. We do push one for the benefit of functions called during the reindex phase, but it does not appear that you speak of that snapshot. Could you elaborate this example? Maybe not a very important one, but does that prove that there's no significant problem of this type in general, or just nobody's thought through all the cases yet? After all, the issues with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around for a very long time, and we're only just getting around to looking at them, so I don't have much confidence that there aren't other cases floating around out there. Granted. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ECPG FETCH readahead
On Tue, Mar 06, 2012 at 07:07:41AM +0100, Boszormenyi Zoltan wrote: 2012-03-05 19:56 keltez?ssel, Noah Misch ?rta: Or how about a new feature in the backend, so ECPG can do UPDATE/DELETE ... WHERE OFFSET N OF cursor and the offset of computed from the actual cursor position and the position known by the application? This way an app can do readahead and do work on rows collected by the cursor with WHERE CURRENT OF which gets converted to WHERE OFFSET OF behind the scenes. That's a neat idea, but I would expect obstacles threatening our ability to use it automatically for readahead. You would have to make the cursor a SCROLL cursor. We'll often pass a negative offset, making the operation fail if the cursor query used FOR UPDATE. Volatile functions in the query will get more calls. That's assuming the operation will map internally to something like MOVE N; UPDATE ... WHERE CURRENT OF; MOVE -N. You might come up with innovations to mitigate those obstacles, but those innovations would probably also apply to MOVE/FETCH. In any event, this would constitute a substantive patch in its own right. I was thinking along the lines of a Portal keeping the ItemPointerData for each tuple in the last FETCH statement. The WHERE OFFSET N OF cursor would treat the offset value relative to the tuple order returned by FETCH. So, OFFSET 0 OF == CURRENT OF and other values of N are negative. This way, it doesn't matter if the cursor is SCROLL, NO SCROLL or have the default behaviour with SCROLL in some cases. Then ECPGopen() doesn't have to play games with the DECLARE statement. Only ECPGfetch() needs to play with MOVE statements, passing different offsets to the backend, not what the application passed. That broad approach sounds promising. The main other consideration that comes to mind is a plan to limit resource usage for a cursor that reads, say, 1B rows. However, I think attempting to implement this now will significantly decrease the chance of getting the core patch features committed now. One way out of trouble here is to make WHERE CURRENT OF imply READHEAD 1/READHEAD 0 (incidentally, perhaps those two should be synonyms) on the affected cursor. If the cursor has some other readahead quantity declared explicitly, throw an error during preprocessing. I played with this idea a while ago, from a different point of view. If the ECPG code had the DECLARE mycur, DML ... WHERE CURRENT OF mycur and OPEN mycur in exactly this order, i.e. WHERE CURRENT OF appears in a standalone function between DECLARE and the first OPEN for the cursor, then ECPG disabled readahead automatically for that cursor and for that cursor only. But this requires effort on the user of ECPG and can be very fragile. Code cleanup with reordering functions can break previously working code. Don't the same challenges apply to accurately reporting an error when the user specifies WHERE CURRENT OF for a readahead cursor? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote: On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch n...@leadboat.com wrote: Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER to operate on a table that has been truncated since CLUSTER's snapshot was taken, and no serialization anomaly is created that would not have already existed as a result of the non-MVCC-safe TRUNCATE. ?On the other hand, if CLUSTER operates on a table that was created since CLUSTER's snapshot was taken, then you have a bona fide serialization anomaly. Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit of functions called during the reindex phase, but it does not appear that you speak of that snapshot. ?Could you elaborate this example? Imagine this: - Transaction #1 acquires a snapshot. - Transaction #2 creates tables A, inserts a row into table B, and then commits. - Transaction #1 tries to CLUSTER A and then select from B. The only serial execution schedules are T1 T2, in which case the transaction fails, or T2 T1, in which case the row is seen. But what actually happens is that the row isn't seen and yet the transaction doesn't fail. For the purpose of contemplating this anomaly, one could just as well replace CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a table, correct? I agree this test case is good to keep in mind while designing, but we could well conclude not to bother improving it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings
On Wed, Mar 07, 2012 at 04:57:12PM -0500, Robert Haas wrote: On Sat, Feb 25, 2012 at 12:57 AM, Noah Misch n...@leadboat.com wrote: Thanks. ?While testing a crashing function, I noticed that my above patch added some noise to psql output when the server crashes: [local] test=# select crashme(); The connection to the server was lost. Attempting reset: Failed. The connection to the server was lost. Attempting reset: Failed. unexpected transaction status (4) Time: 6.681 ms ?! \q Status 4 is PQTRANS_UNKNOWN, which is expected when the connection is not CONNECTION_OK. ?The double message arrives because ProcessResult() now calls CheckConnection() at least twice, for the benefit of COPY. ?(Incidentally, the reconnect fails because the server has not yet finished recovering; that part is nothing new.) The attached small patch has SendQuery() keep quiet about PQTRANS_UNKNOWN when the connection is down. ?It makes ProcessResult() skip the second CheckConnection() when the command string had no COPY results. ?This restores the pre-08146775acd8bfe0fcc509c71857abb928697171 psql output: [local] test=# select crashme(); The connection to the server was lost. Attempting reset: Failed. Time: 4.798 ms ?! \q Committed, but... why do we EVER need to call CheckConnection() at the bottom of ProcessResult(), after exiting that function's main loop? I don't see any way to get out of that loop without first calling AcceptResult on every PGresult we've seen, and that function already calls CheckConnection() if any failure is observed. You can reproduce a case where it helps by sending SIGKILL to a backend serving a long-running COPY TO, such as this: copy (select n, pg_sleep(case when n 1000 then 1 else 0 end) from generate_series(1,1030) t(n)) to stdout; The connection dies during handleCopyOut(). By the time control returns to ProcessResult(), there's no remaining PGresult to check. Only that last-ditch CheckConnection() notices the lost connection. [I notice more examples of poor error reporting cosmetics in some of these COPY corner cases, so I anticipate another patch someday.] As a side note, the documentation for PQexec() is misleading about what will happen if COPY is present in a multi-command string. It says: Note however that the returned PGresult structure describes only the result of the last command executed from the string. Should one of the commands fail, processing of the string stops with it and the returned PGresult describes the error condition. It does not explain that it also stops if it hits a COPY. I had to read the source code for libpq to understand why this psql logic was coded the way it is. Agreed; I went through a similar process. Awhile after reading the code, I found the behavior documented in section Functions Associated with the COPY Command: If a COPY command is issued via PQexec in a string that could contain additional commands, the application must continue fetching results via PQgetResult after completing the COPY sequence. Only when PQgetResult returns NULL is it certain that the PQexec command string is done and it is safe to issue more commands. I'm not quite sure what revision would help most here -- a cross reference, moving that content, duplicating that content. Offhand, I'm inclined to move it to the PQexec() documentation with some kind of reference back from its original location. Thoughts? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On Fri, Mar 02, 2012 at 03:45:38PM -0500, Robert Haas wrote: SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t) x; On unpatched master, this takes about 416 ms (plus or minus a few). With the attached patch, it takes about 389 ms (plus or minus a very few), a speedup of about 7%. I repeated the experiment using the C locale, like this: SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t COLLATE C) x; Here, it takes about 202 ms with the patch, and about 231 ms on unpatched master, a savings of about 13%. [oprofile report, further discussion] Thanks for looking into this. Your patch is also a nice demonstration of sortsupport's ability to help with more than just fmgr overhead. Considering all that, I had hoped for more like a 15-20% gain from this approach, but it didn't happen, I suppose because some of the instructions saved just resulted in more processor stalls. All the same, I'm inclined to think it's still worth doing. This is a border case, but I suggest that a 13% speedup on a narrowly-tailored benchmark, degrading to 7% in common configurations, is too meager to justify adopting this patch. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote: Alexander Korotkov aekorot...@gmail.com writes: True. If (max count - min count + 1) is small, enumerating of frequencies is both more compact and more precise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of statistics_target with such representation. We can use same representation of count distribution as for scalar column value: MCV and HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better ideas? I wasn't thinking of introducing two different representations, but just trimming the histogram length when it's larger than necessary. On reflection my idea above is wrong; for example assume that we have a column with 900 arrays of length 1 and 100 arrays of length 2. Going by what I said, we'd reduce the histogram to {1,2}, which might accurately capture the set of lengths present but fails to show that 1 is much more common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) would capture the situation perfectly in one-tenth the space that the current logic does. Granted. When the next sample finds 899/101 instead, though, the optimization vanishes. You save 90% of the space, perhaps 10% of the time. If you want to materially narrow typical statistics, Alexander's proposal looks like the way to go. I'd guess array columns always having DEC = default_statistics_target are common enough to make that representation the dominant representation, if not the only necessary representation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote: Noah Misch n...@leadboat.com writes: On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote: On reflection my idea above is wrong; for example assume that we have a column with 900 arrays of length 1 and 100 arrays of length 2. Going by what I said, we'd reduce the histogram to {1,2}, which might accurately capture the set of lengths present but fails to show that 1 is much more common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) would capture the situation perfectly in one-tenth the space that the current logic does. Granted. When the next sample finds 899/101 instead, though, the optimization vanishes. No, you missed my next point. That example shows that sometimes a smaller histogram can represent the situation with zero error, but in all cases a smaller histogram can represent the situation with perhaps more error than a larger one. Since we already have a defined error tolerance, we should try to generate a histogram that is as small as possible while still not exceeding the error tolerance. Now, it might be that doing that is computationally impractical, or too complicated to be reasonable. But it seems to me to be worth looking into. Yes, I did miss your point. One characteristic favoring this approach is its equal applicability to both STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST. If you want to materially narrow typical statistics, Alexander's proposal looks like the way to go. I'd guess array columns always having DEC = default_statistics_target are common enough to make that representation the dominant representation, if not the only necessary representation. Well, I don't want to have two representations; I don't think it's worth the complexity. But certainly we could consider switching to a different representation if it seems likely to usually be smaller. Perhaps some heavy array users could provide input: what are some typical length ranges among arrays in your applications on which you use arr const, arr @ const or arr @ const searches? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Mar 12, 2012 at 01:28:11PM -0400, Robert Haas wrote: I spent some time thinking about this over the weekend, and I have an observation, and an idea. Here's the observation: I believe that locking a tuple whose xmin is uncommitted is always a noop, because if it's ever possible for a transaction to wait for an XID that is part of its own transaction (exact same XID, or sub-XIDs of the same top XID), then a transaction could deadlock against itself. I believe that this is not possible: if a transaction were to wait for an XID assigned to that same backend, then the lock manager would observe that an ExclusiveLock on the xid is already held, so the request for a ShareLock would be granted immediately. I also don't believe there's any situation in which the existence of an uncommitted tuple fails to block another backend, but a lock on that same uncommitted tuple would have caused another backend to block. If any of that sounds wrong, you can stop reading here (but please tell me why it sounds wrong). When we lock an update-in-progress row, we walk the t_ctid chain and lock all descendant tuples. They may all have uncommitted xmins. This is essential to ensure that the final outcome of the updating transaction does not affect whether the locking transaction has its KEY SHARE lock. Similarly, when we update a previously-locked tuple, we copy any locks (always KEY SHARE locks) to the new version. That new tuple is both uncommitted and has locks, and we cannot easily sacrifice either property. Do you see a way to extend your scheme to cover these needs? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Fri, Mar 09, 2012 at 02:18:02AM -0500, Jaime Casanova wrote: On Wed, Feb 22, 2012 at 12:27 AM, Noah Misch n...@leadboat.com wrote: On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote: 1) pgstattuple-gin_spgist.patch This first patch adds gin and spgist support to pgstattuple, also makes pgstattuple use a ring buffer when reading tables or indexes. The buffer access strategy usage bits look fine to commit. ok. i extracted that part. which basically makes pgstattuple usable in production (i mean, by not bloating shared buffers when using the function) I created a CF entry for this and marked it Ready for Committer. You left the bstrategy variable non-static, but that didn't seem important enough to justify another round trip. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote: On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs si...@2ndquadrant.com wrote: I agree with you that some worst case performance tests should be done. Could you please say what you think the worst cases would be, so those can be tested? That would avoid wasting time or getting anything backwards. I've thought about this some and here's what I've come up with so far: 1. SELECT FOR SHARE on a large table on a system with no write cache. Easy enough that we may as well check it. Share-locking an entire large table is impractical in a real application, so I would not worry if this shows a substantial regression. 2. A small parent table (say 30 rows or so) and a larger child table with a many-to-one FK relationship to the parent (say 100 child rows per parent row), with heavy update activity on the child table, on a system where fsyncs are very slow. This should generate lots of mxid consumption, and every 1600 or so mxids (I think) we've got to fsync; does that generate a noticeable performance hit? More often than that; each 2-member mxid takes 4 bytes in an offsets file and 10 bytes in a members file. So, more like one fsync per ~580 mxids. Note that we already fsync the multixact SLRUs today, so any increase will arise from the widening of member entries from 4 bytes to 5. The realism of this test is attractive. Nearly-static parent tables are plenty common, and this test will illustrate the impact on those designs. 3. It would be nice to test the impact of increased mxid lookups in the parent, but I've realized that the visibility map will probably mask a good chunk of that effect, which is a good thing. Still, maybe something like this: a fairly large parent table, say a million rows, but narrow rows, so that many of them fit on a page, with frequent reads and occasional updates (if there are only reads, autovacuum might end with all the visibility map bits set); plus a child table with one or a few rows per parent which is heavily updated. In theory this ought to be good for the patch, since the the more fine-grained locking will avoid blocking, but in this case the parent table is large enough that you shouldn't get much blocking anyway, yet you'll still pay the cost of mxid lookups because the occasional updates on the parent will clear VM bits. This might not be the exactly right workload to measure this effect, but if it's not maybe someone can devote a little time to thinking about what would be. You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid lookups, so I think something more sophisticated is needed to exercise that cost. Not sure what. 4. A plain old pgbench run or two, to see whether there's any regression when none of this matters at all... Might as well. This isn't exactly a test case, but from Noah's previous comments I gather that there is a theoretical risk of mxid consumption running ahead of xid consumption. We should try to think about whether there are any realistic workloads where that might actually happen. I'm willing to believe that there aren't, but not just because somebody asserts it. The reason I'm concerned about this is because, if it should happen, the result will be more frequent anti-wraparound vacuums on every table in the cluster. Those are already quite painful for some users. Yes. Pre-release, what can we really do here other than have more people thinking about ways it might happen in practice? Post-release, we could suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table is using more mxid space than xid space. Also consider a benchmark that does plenty of non-key updates on a parent table with no activity on the child table. We'll pay the overhead of determining that the key column(s) have not changed, but it will never pay off by preventing a lock wait. Granted, this is barely representative of application behavior. Perhaps, too, we already have a good sense of this cost from the HOT benchmarking efforts and have no cause to revisit it. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Wed, Mar 14, 2012 at 01:23:14PM -0400, Robert Haas wrote: On Tue, Mar 13, 2012 at 11:42 PM, Noah Misch n...@leadboat.com wrote: More often than that; each 2-member mxid takes 4 bytes in an offsets file and 10 bytes in a members file. ?So, more like one fsync per ~580 mxids. ?Note that we already fsync the multixact SLRUs today, so any increase will arise from the widening of member entries from 4 bytes to 5. ?The realism of this test is attractive. ?Nearly-static parent tables are plenty common, and this test will illustrate the impact on those designs. Agreed. But speaking of that, why exactly do we fsync the multixact SLRU today? Good question. So far, I can't think of a reason. nextMulti is critical, but we already fsync it with pg_control. We could delete the other multixact state data at every startup and set OldestVisibleMXactId accordingly. You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid lookups, so I think something more sophisticated is needed to exercise that cost. ?Not sure what. I don't think HEAP_XMAX_COMMITTED is much help, because committed != all-visible. HEAP_XMAX_INVALID will obviously help, when it happens. True. The patch (see ResetMultiHintBit()) also replaces a multixact xmax with the updater xid when all transactions of the multixact have ended. You would need a test workload with long-running multixacts that delay such replacement. However, the workloads that come to mind are the very workloads for which this patch eliminates lock waits; they wouldn't illustrate a worst-case. This isn't exactly a test case, but from Noah's previous comments I gather that there is a theoretical risk of mxid consumption running ahead of xid consumption. ?We should try to think about whether there are any realistic workloads where that might actually happen. ?I'm willing to believe that there aren't, but not just because somebody asserts it. ?The reason I'm concerned about this is because, if it should happen, the result will be more frequent anti-wraparound vacuums on every table in the cluster. ?Those are already quite painful for some users. Yes. ?Pre-release, what can we really do here other than have more people thinking about ways it might happen in practice? ?Post-release, we could suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table is using more mxid space than xid space. Well, post-release, the cat is out of the bag: we'll be stuck with this whether the performance characteristics are acceptable or not. That's why we'd better be as sure as possible before committing to this implementation that there's nothing we can't live with. It's not like there's any reasonable way to turn this off if you don't like it. I disagree; we're only carving in stone the FOR KEY SHARE and FOR KEY UPDATE syntax additions. We could even avoid doing that by not documenting them. A later major release could implement them using a completely different mechanism or even reduce them to aliases, KEY SHARE = SHARE and KEY UPDATE = UPDATE. To be sure, let's still do a good job the first time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Thu, Mar 15, 2012 at 08:37:36PM -0400, Robert Haas wrote: On Thu, Mar 15, 2012 at 5:07 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Mar 14, 2012 at 5:23 PM, Robert Haas robertmh...@gmail.com wrote: You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid lookups, so I think something more sophisticated is needed to exercise that cost. ?Not sure what. I don't think HEAP_XMAX_COMMITTED is much help, because committed != all-visible. So because committed does not equal all visible there will be additional lookups on mxids? That's complete rubbish. Noah seemed to be implying that once the updating transaction committed, HEAP_XMAX_COMMITTED would get set and save the mxid lookup. But I think that's not true, because anyone who looks at the tuple afterward will still need to know the exact xmax, to test it against their snapshot. Yeah, my comment above was wrong. I agree that we'll need to retrieve the mxid members during every MVCC scan until we either mark the page all-visible or have occasion to simplify the mxid xmax to the updater xid. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend for same-role
On Thu, Mar 15, 2012 at 04:14:03PM -0700, Daniel Farina wrote: Parallel to pg_cancel_backend, it'd be nice to allow the user to just outright kill a backend that they own (politely, with a SIGTERM), aborting any transactions in progress, including the idle transaction, and closing the socket. +1 I imagine the problem is a race condition whereby a pid might be reused by another process owned by another user (doesn't that also affect pg_cancel_backend?). Shall we just do everything using the MyCancelKey (which I think could just be called SessionKey, SessionSecret, or even just Session) as to ensure we have no case of mistaken identity? Or does that end up being problematic? No, I think the hazard you identify here is orthogonal to the question of when to authorize pg_terminate_backend(). As you note downthread, protocol-level cancellations available in released versions already exhibit this hazard. I wouldn't mind a clean fix for this, but it's an independent subject. Here I discussed a hazard specific to allowing pg_terminate_backend(): http://archives.postgresql.org/message-id/20110602045955.gc8...@tornado.gateway.2wire.net To summarize, user code can trap SIGINT cancellations, but it cannot trap SIGTERM terminations. If a backend is executing a SECURITY DEFINER function when another backend of the same role calls pg_terminate_backend() thereon, the pg_terminate_backend() caller could achieve something he cannot achieve in PostgreSQL 9.1. I vote that this is an acceptable loss. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
[used followup EACH-foreign-key.v4b.patch.bz2 for review] On Wed, Mar 14, 2012 at 07:03:08PM +0100, Marco Nenciarini wrote: please find attached v4 of the EACH Foreign Key patch (formerly known also as Foreign Key Array). On Fri, Feb 24, 2012 at 09:01:35PM -0500, Noah Misch wrote: We can use multi-dimensional arrays as well as referencing columns. In that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH SET NULL. This is a safe way of implementing the action. We have some ideas on how to implement this, but we feel it is better to limit the behaviour for now. This still feels awfully unprincipled to me. How about just throwing an error when we need to remove an element from a multidimensional array? Essentially, have ON DELETE EACH CASCADE downgrade to ON DELETE RESTRICT when it encounters a multidimensional array. That's strictly less code than what you have now, and it keeps our options open. We can always change from error to set-null later, but we can't so readily change from set-null to anything else. That seems reasonable to me. Implemented and documented. I like the semantics now. You implemented this by doing two scans per PK delete, the first to detect multidimensional dependants of the PK row and the second to fix 1-dimensional dependants. We don't require an index to support the scan, so this can mean two seq scans. Currently, the only benefit to doing it this way is a change of error message. Here is the current error message when we would need to remove a multidimensional array element: ERROR: update or delete on table pktableforarray violates foreign key constraint fktableforarray_ftest1_fkey on table fktableforarray DETAIL: Key (ptest1)=(2) is still referenced from table fktableforarray. If I just rip out the first scan, we get the same outcome with a different error message: ERROR: removing elements from multidimensional arrays is not supported CONTEXT: SQL statement UPDATE ONLY public.fktableforarray SET ftest1 = array_remove(ftest1, $1) WHERE $1 OPERATOR(pg_catalog.=) ANY (ftest1) That has less polish, but I think it's actually more useful. The first message gives no indication that a multidimensional array foiled your ON DELETE EACH CASCADE. The second message hints at that cause. I recommend removing the new block of code in RI_FKey_eachcascade_del() and letting array_remove() throw the error. If you find a way to throw a nicer error without an extra scan, by all means submit that to a future CF as an improvement. I don't think it's important enough to justify cycles at this late hour of the current CF. As I pondered this patch again, I came upon formal hazards around non-default operator classes. Today, ATAddForeignKeyConstraint() always chooses pfeqop, ffeqop and ppeqop in the same operator family. If it neglected this, we would get wrong behavior when one of the operators is sensitive to a change to which another is insensitive. For EACH FK constraints, this patch sets ffeqop = ARRAY_EQ_OP unconditionally. array_eq() uses the TYPECACHE_EQ_OPR (usually from the default B-tree operator class). That operator may not exist at all, let alone share an operator family with pfeqop. Shall we deal with this by retrieving TYPECACHE_EQ_OPR in ATAddForeignKeyConstraint() and rejecting the constraint creation if it does not share an operator family with pfeqop? The same hazard affects ri_triggers.c use of array_remove() and array_replace(), and the same change mitigates it. Check added. As a consequence of this stricter policy, certain previously allowed cases are now unacceptable (e.g pk FLOAT and fk INT[]). Regression tests have been added. Ah, good. Not so formal after all. pg_constraint.confeach needs documentation. Most of pg_constraint columns, including all the boolean ones, are documented only in the description column of http://www.postgresql.org/docs/9.1/static/catalog-pg-constraint.html#AEN85760 it seems that confiseach should not be an exception, since it just indicates whether the constraint is of a certain kind or not. Your patch adds two columns to pg_constraint, confiseach and confeach, but it only mentions confiseach in documentation. Just add a similar minimal mention of confeach. *** *** 5736,5741 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5759,5807 (errcode(ERRCODE_INVALID_FOREIGN_KEY), errmsg(number of referencing and referenced columns for foreign key disagree))); + /* Enforce each foreign key restrictions */ + if (fkconstraint-fk_is_each) + { + /* + * ON UPDATE CASCADE action is not supported on EACH foreign keys + */ + if (fkconstraint-fk_upd_action
Re: [HACKERS] pg_terminate_backend for same-role
On Fri, Mar 16, 2012 at 04:42:07PM -0700, Daniel Farina wrote: On Fri, Mar 16, 2012 at 3:42 PM, Noah Misch n...@leadboat.com wrote: On Thu, Mar 15, 2012 at 04:14:03PM -0700, Daniel Farina wrote: I imagine the problem is a race condition whereby a pid might be reused by another process owned by another user (doesn't that also affect pg_cancel_backend?). ?Shall we just do everything using the MyCancelKey (which I think could just be called SessionKey, SessionSecret, or even just Session) as to ensure we have no case of mistaken identity? Or does that end up being problematic? No, I think the hazard you identify here is orthogonal to the question of when to authorize pg_terminate_backend(). ?As you note downthread, protocol-level cancellations available in released versions already exhibit this hazard. ?I wouldn't mind a clean fix for this, but it's an independent subject. Hmm. Well, here's a patch that implements exactly that, I think, similar to the one posted to this thread, but not using BackendIds, but rather the newly-introduced SessionId. Would appreciate comments. Because an out-of-band signaling method has been integrated more complex behaviors -- such as closing the TERM-against-SECURITY-DEFINER-FUNCTION hazard -- can be addressed. For now I've only attempted to solve the problem of backend ambiguity, which basically necessitated out-of-line information transfer as per the usual means. This patch still changes the policy for pg_terminate_backend(), and it does not fix other SIGINT senders like processCancelRequest() and ProcSleep(). If you're concerned about PID-reuse races, audit all backend signalling. Either fix all such problems or propose a plan to get there eventually. Any further discussion of this topic needs a new subject line; mixing its consideration with proposals to change the policy behind pg_terminate_backend() reduces the chances of the right people commenting on these distinct questions. Currently, when pg_terminate_backend() follows a pg_cancel_backend() on which the target has yet to act, the eventual outcome is a terminated process. With this patch, the pg_terminate_backend() becomes a no-op with this warning: ! ereport(WARNING, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! (errmsg(process is busy responding to administrative ! request)), ! (errhint(This is temporary, and may be retried.; That's less useful than the current behavior. That being said, I can't get too excited about closing PID-reuse races. I've yet to see another program do so. I've never seen a trouble report around this race for any software. Every OS I have used assigns PIDs so as to maximize the reuse interval, which seems like an important POLA measure given typical admin formulae like kill `ps | grep ...`. This defense can only matter in fork-bomb conditions, at which point a stray signal is minor. I do think it's worth keeping this idea in a back pocket for achieving those more complex behaviors, should we ever desire them. Here I discussed a hazard specific to allowing pg_terminate_backend(): http://archives.postgresql.org/message-id/20110602045955.gc8...@tornado.gateway.2wire.net To summarize, user code can trap SIGINT cancellations, but it cannot trap SIGTERM terminations. ?If a backend is executing a SECURITY DEFINER function when another backend of the same role calls pg_terminate_backend() thereon, the pg_terminate_backend() caller could achieve something he cannot achieve in PostgreSQL 9.1. ?I vote that this is an acceptable loss. I'll throw out a patch that just lets this hazard exist and see what happens, although it is obsoleted/incompatible with the one already attached. +1. Has anyone actually said that the PID-reuse race or the thing I mention above should block such a patch? I poked back through the threads I could remember and found nothing. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers