from:"Noah Misch"

[HACKERS] Review of SQLDA support for ECPG

2009-10-04 Thread Noah Misch

I took a look at 2-pg85-sqlda-10-ctxdiff.patch.  Starting from CVS HEAD of
roughly 2009-10-03 05:00 UTC, prerequisite patches 1a-1h applied cleanly.
2-pg85-sqlda hit a trivial whitespace reject in ecpg.trailer along with a more
substantive reject at ecpg.addons:407 (FetchStmtMOVEfetch_args).  Fixing it up
by hand leads to my first question - why did the transition from `opt_ecpg_into'
to `opt_ecpg_fetch_into' affect FETCH FORWARD and not FETCH BACKWARD?

The main test suite acquired no regressions, but I get failures in two tests of
the ecpg test suite (make -C src/interfaces/ecpg/test check).  I attach
regression.{out,diff} and postmaster.log from the test run.  The abort in
sqlda.pgc looks like the interesting failure, but I exhausted time with which to
dig into it further.  preproc/cursor.pgc creates (and ideally drops) the same
table `t1' as compat_informix/{sqlda,cursor}.pgc, so a crash in either of the
others makes it fail too.  Could they all use temp tables, use different table
names, or `DROP TABLE IF EXISTS t1' first?

Do those logs suggest the cause of the sqlda.pgc failure?  If not, I will look
into it further.  Otherwise, I'm happy to review a future iteration of the
patch.

As a side note, with patch 1* but not patch 2, test_informix entered an infinite
loop with this error:
ERROR:  syntax error at or near c at character 15
STATEMENT:  fetch forward c

Thank you,
nm
test compat_informix/dec_test ... ok
test compat_informix/charfuncs ... ok
test compat_informix/rfmtdate ... ok
test compat_informix/rfmtlong ... ok
test compat_informix/rnull ... ok
test compat_informix/cursor ... ok
test compat_informix/sqlda ... FAILED (test process was terminated by signal 6: 
Aborted)
test compat_informix/test_informix ... ok
test compat_informix/test_informix2 ... ok
test connect/test2... ok
test connect/test3... ok
test connect/test4... ok
test connect/test5... ok
test pgtypeslib/dt_test   ... ok
test pgtypeslib/dt_test2  ... ok
test pgtypeslib/num_test  ... ok
test pgtypeslib/num_test2 ... ok
test preproc/array_of_struct ... ok
test preproc/autoprep ... ok
test preproc/comment  ... ok
test preproc/cursor   ... FAILED (test process exited with exit code 1)
test preproc/define   ... ok
test preproc/init ... ok
test preproc/strings  ... ok
test preproc/type ... ok
test preproc/variable ... ok
test preproc/whenever ... ok
test sql/array... ok
test sql/binary   ... ok
test sql/code100  ... ok
test sql/copystdout   ... ok
test sql/define   ... ok
test sql/desc ... ok
test sql/dynalloc ... ok
test sql/dynalloc2... ok
test sql/dyntest  ... ok
test sql/execute  ... ok
test sql/fetch... ok
test sql/func ... ok
test sql/indicators   ... ok
test sql/oldexec  ... ok
test sql/quote... ok
test sql/show ... ok
test sql/insupd   ... ok
test sql/parser   ... ok
test thread/thread... ok
test thread/thread_implicit ... ok
test thread/prep  ... ok
test thread/alloc ... ok
test thread/descriptor... ok
*** 
/home/nm/src/pg/bd-sqlda/src/interfaces/ecpg/test/expected/compat_informix-sqlda.stdout
 2009-10-03 04:23:59.0 -0400
--- 
/home/nm/src/pg/bd-sqlda/src/interfaces/ecpg/test/results/compat_informix-sqlda.stdout
  2009-10-04 01:52:52.0 -0400
***
*** 1,60 
- FETCH RECORD 1
- name sqlda descriptor: 'id' value 1
- name sqlda descriptor: 't' value 'a'
- name sqlda descriptor: 'd1' value DECIMAL '1.0'
- name sqlda descriptor: 'd2' value 1.00
- name sqlda descriptor: 'c' value 'a '
- FETCH RECORD 2
- name sqlda descriptor: 'id' value 2
- name sqlda descriptor: 't' value NULL'
- name sqlda descriptor: 'd1' value NULL'
- name sqlda descriptor: 'd2' value NULL'
- name sqlda descriptor: 'c' value NULL'
- FETCH RECORD 3
- name sqlda descriptor: 'id' value 3
- name sqlda descriptor: 't' value 'c'
- name sqlda descriptor: 'd1' value DECIMAL '-3'
- name sqlda descriptor: 'd2' value nan
- name sqlda descriptor: 'c' value 'c '
- FETCH RECORD 4
- name sqlda descriptor: 'id' value 4
- name sqlda descriptor: 't' value 'd'
- name sqlda descriptor: 'd1' value DECIMAL '4.0'
- name sqlda descriptor: 'd2' value 4.00
- name sqlda descriptor: 'c' value 'd '
- FETCH RECORD 1
- name sqlda descriptor: 'id' value 1
- name sqlda descriptor: 't' value 'a'
- name sqlda descriptor: 'd1' value DECIMAL '1.0'
- name sqlda descriptor: 'd2' value 1.00
- name sqlda descriptor: 'c' value 'a '
- FETCH RECORD 2
- name sqlda descriptor: 'id' value 2
- name sqlda descriptor: 't' value NULL'
- name sqlda descriptor: 'd1' value NULL'
- name sqlda descriptor: 'd2' value NULL'
- name sqlda descriptor: 'c' value NULL'
- FETCH RECORD 3
- name sqlda descriptor: 'id' value 3
- name sqlda descriptor: 't' value 'c'
- name sqlda descriptor:

Re: [HACKERS] Review of SQLDA support for ECPG

2009-10-07 Thread Noah Misch

On Mon, Oct 05, 2009 at 02:23:57PM +0200, Boszormenyi Zoltan wrote:
 Noah Misch ?rta:

 I will post a new patch for SQLDA and for all others that need
 updating.

Thanks; that one does apply cleanly.

  The main test suite acquired no regressions, but I get failures in two 
  tests of
  the ecpg test suite (make -C src/interfaces/ecpg/test check).

 No, this is not a real error. I have run into this as well,
 which is quickly solved if you execute make install
 before running make check under the ecpg directory.
 I guess you already have an installed PG tree at the same
 prefix as where you have pointed the new one with the
 SQLDA patch applied. Another solution may be to use
 a different prefix for the SQLDA source tree.

This was exactly the problem; with an unoccupied prefix, all the tests do pass.

  As a side note, with patch 1* but not patch 2, test_informix entered an 
  infinite
  loop with this error:
  ERROR:  syntax error at or near c at character 15
  STATEMENT:  fetch forward c

 Do you mean that you applied all the split-up patches posted
 for the dynamic cursorname extension? I didn't get this error.
 What did you do exactly?

Having started over from a clean base tree, I can no longer reproduce this.
Another operator error, no doubt.


All tests now pass here with 1a-1h of Dynamic cursor support for ECPG alone,
with SQLDA support for ECPG also applied, and with DESCRIBE [OUTPUT] support
for ECPG additionally.  I will report on the sqlda patch in more detail on
2009-10-10.  The one concern that's clear now is a lack of documentation update.

Thank you,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] *_collapse_limit, geqo_threshold

2009-07-08 Thread Noah Misch

On Tue, Jul 07, 2009 at 09:31:14AM -0500, Kevin Grittner wrote:
 I don't remember any clear resolution to the wild variations in plan
 time mentioned here:
  
 http://archives.postgresql.org/pgsql-hackers/2009-06/msg00743.php
  
 I think it would be prudent to try to figure out why small changes in
 the query caused the large changes in the plan times Andres was
 seeing.  Has anyone else ever seen such behavior?  Can we get
 examples?  (It should be enough to get the statistics and the schema,
 since this is about planning time, not run time.)

With joins between statistically indistinguishable columns, I see planning times
change by a factor of ~4 for each join added or removed (postgres 8.3).  Varying
join_collapse_limit in the neighborhood of the actual number of joins has a
similar effect.  See attachment with annotated timings.  The example uses a
single table joined to itself, but using distinct tables with identical contents
yields the same figures.

The expontential factor seems smaller for real queries.  I have a query of
sixteen joins that takes 71s to plan deterministically; it looks like this:

SELECT 1 FROM fact JOIN dim0 ... JOIN dim6
JOIN t t0 ON fact.key = t.key AND t.x = MCV0
LEFT JOIN t t1 ON fact.key = t.key AND t.x = MCV1
JOIN t t2 ON fact.key = t.key AND t.x = MCV2
LEFT JOIN t t3 ON fact.key = t.key AND t.x = NON-MCV0
LEFT JOIN t t4 ON fact.key = t.key AND t.x = NON-MCV1
LEFT JOIN t t5 ON fact.key = t.key AND t.x = NON-MCV2
LEFT JOIN t t6 ON fact.key = t.key AND t.x = NON-MCV3
LEFT JOIN t t7 ON fact.key = t.key AND t.x = NON-MCV4

For the real query, removing one join drops plan time to 26s, and removing two
drops the time to 11s.  I don't have a good theory for the multiplier changing
from 4 for the trivial demonstration to ~2.5 for this real query.  Re-enabling
geqo drops plan time to .5s.  These tests used default_statistics_target = 1000,
but dropping that to 100 does not change anything dramatically.

 I guess the question is whether there is anyone who has had a contrary
 experience.  (There must have been some benchmarks to justify adding
 geqo at some point?)

I have queries with a few more joins (19-21), and I cancelled attempts to plan
them deterministically after 600+ seconds and 10+ GiB of memory usage.  Even
with geqo_effort = 10, they plan within 5-15s with good results.

All that being said, I've never encountered a situation where a value other than
1 or inf for *_collapse_limit appeared optimal.

nm
SET geqo = off;
SET join_collapse_limit = 100;
CREATE TEMP TABLE t AS SELECT * FROM generate_series(1, 1000) f(n); ANALYZE t;

--- Vary join count
-- 242.4s
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10 NATURAL JOIN t t11
NATURAL JOIN t t12;
-- 31.2s
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10 NATURAL JOIN t t11;
-- 8.1s
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10;
-- 2.0s
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09;
-- 0.5s
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08;

--- Vary join_collapse_limit
-- 8.1s
SET join_collapse_limit = 100;
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10;
-- 8.0s
SET join_collapse_limit = 11;
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10;
-- 2.2s
SET join_collapse_limit = 10;
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10;
-- 0.5s
SET join_collapse_limit = 9;
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t t06 NATURAL JOIN t t07
NATURAL JOIN t t08 NATURAL JOIN t t09 NATURAL JOIN t t10;
-- 0.1s
SET join_collapse_limit = 8;
EXPLAIN SELECT 1 FROM t t00 NATURAL JOIN t t01 NATURAL JOIN t t02 NATURAL JOIN t
t03 NATURAL JOIN t t04 NATURAL JOIN t t05 NATURAL JOIN t

Re: [HACKERS] *_collapse_limit, geqo_threshold

2009-07-09 Thread Noah Misch

On Wed, Jul 08, 2009 at 05:23:16PM -0400, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  The expontential factor seems smaller for real queries.  I have a query of
  sixteen joins that takes 71s to plan deterministically; it looks like this:
 
  SELECT 1 FROM fact JOIN dim0 ... JOIN dim6
  JOIN t t0 ON fact.key = t.key AND t.x = MCV0
  LEFT JOIN t t1 ON fact.key = t.key AND t.x = MCV1
  JOIN t t2 ON fact.key = t.key AND t.x = MCV2
  LEFT JOIN t t3 ON fact.key = t.key AND t.x = NON-MCV0
  LEFT JOIN t t4 ON fact.key = t.key AND t.x = NON-MCV1
  LEFT JOIN t t5 ON fact.key = t.key AND t.x = NON-MCV2
  LEFT JOIN t t6 ON fact.key = t.key AND t.x = NON-MCV3
  LEFT JOIN t t7 ON fact.key = t.key AND t.x = NON-MCV4
 
 I'm confused here --- I think you must have over-anonymized your query.
 Surely the ON conditions for the left joins should be referencing t3,
 t4, etc?

Yes.  Aside from that error, I picked the wrong features to emphasize and gloss
over.  Only two relations (amidst `dim0 ... dim6') resemble dimensions, having
an implied relative position on the plan tree.  The other fourteen relations
share a join key.  That explains the distinctive plan cost; this query resembles
the utterly unconstrained artificial query to a larger degree than most.

  For the real query, removing one join drops plan time to 26s, and
  removing two drops the time to 11s.  I don't have a good theory for
  the multiplier changing from 4 for the trivial demonstration to ~2.5
  for this real query.
 
 The rule of thumb that says that an n-way join requires 2^n work is only
 true if we consider every single combination of possible joins, which
 normally we don't.  The planner prefers join paths that follow join
 clauses, and will only consider clauseless joins when it has no other
 choice.  I believe that real queries tend to be pretty non-flat in this
 space and so the number of join paths to consider is a lot less than 2^n.
 Your synthesized query, on the other hand, allows any relation to be
 joined to any other --- it might not look that way, but after creating
 derived equalities there will be a potential join clause linking every
 relation to every other one.  So I think you were testing the worst case,
 and I'm not surprised that more-typical queries would show a slower
 growth curve.

Describing in those terms illuminates much.  While the concepts do suggest 2^N
worst-case planning cost, my artificial test case showed a rigid 4^N pattern;
what could explain that?

Thanks,
nm


pgpuxZFLTTeZz.pgp
Description: PGP signature

Re: [HACKERS] *_collapse_limit, geqo_threshold

2009-07-09 Thread Noah Misch

On Wed, Jul 08, 2009 at 10:28:02AM -0500, Robert Haas wrote:
 On Jul 8, 2009, at 8:43 AM, Noah Misch n...@leadboat.com wrote:
 The expontential factor seems smaller for real queries.  I have a query of
 sixteen joins that takes 71s to plan deterministically; it looks like this:

 SELECT 1 FROM fact JOIN dim0 ... JOIN dim6
 JOIN t t0 ON fact.key = t.key AND t.x = MCV0
 LEFT JOIN t t1 ON fact.key = t.key AND t.x = MCV1
 JOIN t t2 ON fact.key = t.key AND t.x = MCV2
 LEFT JOIN t t3 ON fact.key = t.key AND t.x = NON-MCV0
 LEFT JOIN t t4 ON fact.key = t.key AND t.x = NON-MCV1
 LEFT JOIN t t5 ON fact.key = t.key AND t.x = NON-MCV2
 LEFT JOIN t t6 ON fact.key = t.key AND t.x = NON-MCV3
 LEFT JOIN t t7 ON fact.key = t.key AND t.x = NON-MCV4

 Very interesting!  I am guessing that the problem here is that all the  
 inner joins commute, as do all the left joins, so there are a lot of  
 possibilities to explore.  I also suspect that the actual join order  
 doesn't matter much, so it's a good candidate for GEQO. Even if you had 
 some restriction clauses against your dimension/t tables, that would 
 probably just mean that you want to do those joins first, and the rest 
 afterwards, which still seems like it ought to be OK for GEQO.

 But, in many queries this size, the join order is more constrained.   
 Some of the later joins depend on the tables added by the earlier ones, 
 rather than all referring back to the base table.  Is there some way we 
 could estimate the number of join offerings we'll have to consider 
 relatively cheaply?  That might be a better metric than # of tables.

Observing the pathological case taking plan time proportional to 4^N, apply
geqo_threshold as use GEQO when comparing more than geqo_threshold * 4^N join
order possibilities?  I'm not sure whether that figure is available (early
enough) to factor into the decision.  Such a change would probably imply a lower
default geqo_threshold, around 9 to 11.  The number of typical queries subject
to GEQO would nonetheless decrease.

nm


pgphBueR95eY4.pgp
Description: PGP signature

Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2011-12-20 Thread Noah Misch

On Mon, Dec 19, 2011 at 11:13:57PM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  I created a function that does this in a loop:
 
  HeapTuple t;
 
  CatalogCacheFlushCatalog(ProcedureRelationId);
  t = SearchSysCache1(PROCOID, ObjectIdGetDatum(42) /* int4in */);
  if (!HeapTupleIsValid(t))
  elog(ERROR, cache lookup failed for function 42);
  ReleaseSysCache(t);
 
 ... but this performance test seems to me to be entirely misguided,
 because it's testing a situation that isn't going to occur much in the
 field, precisely because the syscache should prevent constant reloads of
 the same syscache entry.

 [ideas for more-realistic tests]

Granted, but I don't hope to reliably measure a change in a macro-benchmark
after seeing a rickety 2% change in a micro-benchmark.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RangeVarGetRelid()

2011-12-20 Thread Noah Misch

On Mon, Dec 19, 2011 at 11:52:54PM -0500, Robert Haas wrote:
 After staring at this for quite a while longer, it seemed to me that
 the logic for renaming a relation was similar enough to the logic for
 changing a schema that the two calbacks could reasonably be combined
 using a bit of conditional logic; and that, further, the same callback
 could be used, with a small amount of additional modification, for
 ALTER TABLE.  Here's a patch to do that.

Nice.

 I also notice that cluster() - which doesn't have a callback - has
 exactly the same needs as ReindexRelation() - which does.  So that
 case can certainly share code; though I'm not quite sure what to call
 the shared callback, or which file to put it in.
 RangeVarCallbackForStorageRewrite?

I'd put it in tablecmds.c and name it RangeVarCallbackOwnsTable.


A few things on the patch:

 --- a/src/backend/commands/tablecmds.c
 +++ b/src/backend/commands/tablecmds.c

 @@ -2560,90 +2500,26 @@ CheckTableNotInUse(Relation rel, const char *stmt)
   * Thanks to the magic of MVCC, an error anywhere along the way rolls back
   * the whole operation; we don't have to do anything special to clean up.
   *
 - * We lock the table as the first action, with an appropriate lock level
 + * The caller must lock the relation, with an appropriate lock level 
   * for the subcommands requested. Any subcommand that needs to rewrite
   * tuples in the table forces the whole command to be executed with
 - * AccessExclusiveLock. If all subcommands do not require rewrite table
 - * then we may be able to use lower lock levels. We pass the lock level down
 + * AccessExclusiveLock (actually, that is currently required always, but
 + * we hope to relax it at some point).  We pass the lock level down
   * so that we can apply it recursively to inherited tables. Note that the
 - * lock level we want as we recurse may well be higher than required for
 + * lock level we want as we recurse might well be higher than required for
   * that specific subcommand. So we pass down the overall lock requirement,
   * rather than reassess it at lower levels.
   */
  void
 -AlterTable(AlterTableStmt *stmt)
 +AlterTable(Oid relid, LOCKMODE lockmode, AlterTableStmt *stmt)
  {
   Relationrel;
 - LOCKMODElockmode = AlterTableGetLockLevel(stmt-cmds);
  
 - /*
 -  * Acquire same level of lock as already acquired during parsing.
 -  */
 - rel = relation_openrv(stmt-relation, lockmode);
 + /* Caller is required to provide an adequate lock. */
 + rel = relation_open(relid, NoLock);
  
   CheckTableNotInUse(rel, ALTER TABLE);
  
 - /* Check relation type against type specified in the ALTER command */
 - switch (stmt-relkind)
 - {
 - case OBJECT_TABLE:
 -
 - /*
 -  * For mostly-historical reasons, we allow ALTER TABLE 
 to apply to
 -  * almost all relation types.
 -  */
 - if (rel-rd_rel-relkind == RELKIND_COMPOSITE_TYPE
 - || rel-rd_rel-relkind == 
 RELKIND_FOREIGN_TABLE)
 - ereport(ERROR,
 - 
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
 -  errmsg(\%s\ is not a table,
 - 
 RelationGetRelationName(rel;

RangeVarCallbackForAlterRelation() does not preserve ALTER TABLE's refusal to
operate on foreign tables.

 - break;
 -
 - case OBJECT_INDEX:
 - if (rel-rd_rel-relkind != RELKIND_INDEX)
 - ereport(ERROR,
 - 
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
 -  errmsg(\%s\ is not an 
 index,
 - 
 RelationGetRelationName(rel;
 - break;
 -
 - case OBJECT_SEQUENCE:
 - if (rel-rd_rel-relkind != RELKIND_SEQUENCE)
 - ereport(ERROR,
 - 
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
 -  errmsg(\%s\ is not a 
 sequence,
 - 
 RelationGetRelationName(rel;
 - break;
 -
 - case OBJECT_TYPE:
 - if (rel-rd_rel-relkind != RELKIND_COMPOSITE_TYPE)
 - ereport(ERROR,
 - 
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
 -  errmsg(\%s\ is not a 
 composite type,
 - 
 RelationGetRelationName(rel;
 - break;
 -
 - case OBJECT_VIEW:
 - if (rel-rd_rel-relkind

Re: [HACKERS] RangeVarGetRelid()

2011-12-21 Thread Noah Misch

On Wed, Dec 21, 2011 at 03:16:39PM -0500, Robert Haas wrote:
 On Tue, Dec 20, 2011 at 8:14 PM, Noah Misch n...@leadboat.com wrote:
  RangeVarCallbackForAlterRelation() does not preserve ALTER TABLE's refusal 
  to
  operate on foreign tables.
 
 I should probably fix that, but I'm wondering if I ought to fix it by
 disallowing the use of ALTER TABLE on FOREIGN TABLEs for the other
 commands that share the callback as well.  Allowing ALTER TABLE to
 apply to any relation type is mostly a legacy thing, I think, and any
 code that's new enough to know about foreign tables isn't old enough
 to know about the time when you had to use ALTER TABLE to rename
 views.

Maybe.  Now that we have a release with these semantics, I'd lean toward
preserving the wart and being more careful next time.  It's certainly a
borderline case, though.

  RangeVarCallbackForAlterRelation() does not preserve the check for 
  unexpected
  object types.
 
 I don't feel a strong need to retain that.

Okay.

  utility.c doesn't take locks for any other command; parse analysis usually
  does that. ?To preserve that modularity, you could add a bool toplevel
  argument to transformAlterTableStmt(). ?Pass true here, false in
  ATPostAlterTypeParse(). ?If true, use AlterTableLookupRelation() to get full
  security checks. ?Otherwise, just call relation_openrv() as now. ?Would that
  be an improvement?
 
 Not sure.  I feel that it's unwise to pass relation names all over the
 backend and assume that nothing will change meanwhile; no locking we
 do will prevent that, at least in the case of search path
 interposition.  Ultimately I think this ought to be restructured
 somehow so that we look up each name ONCE and ever-after refer only to
 the resulting OID (except for error message text).  But I'm not sure
 how to do that, and thought it might make sense to commit this much
 independently of such a refactoring.

I agree with all that, though my suggestion would not have increased the
number of by-name lookups.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2011-12-27 Thread Noah Misch

On Tue, Dec 20, 2011 at 04:37:37PM +0400, Alexander Korotkov wrote:
 On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley npbo...@gmail.com wrote:
 
  FYI, I've added myself as the reviewer for the current commitfest.
 
 How is going review now?

I will examine this patch within the week.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-29 Thread Noah Misch

On Thu, Dec 29, 2011 at 11:08:43AM -0600, Kevin Grittner wrote:
  Heikki Linnakangas  wrote:
  Simon Riggs wrote:
  
  OK, then we are talking at cross purposes. Double write buffers,
  in the way you explain them allow us to remove full page writes.
  They clearly don't do anything to check page validity on read.
  Torn pages are not the only fault we wish to correct against...
  and the double writes idea is orthogonal to the idea of checksums.
  
  The reason we're talking about double write buffers in this thread
  is that double write buffers can be used to solve the problem with
  hint bits and checksums.

 Of course, it will be a big plus if we can roll this out in 9.2 in
 conjunction with a double-write feature.  Not only will double-write
 probably be a bit faster than full_page_writes in the WAL log, but it
 will allow protection against torn pages on hint-bit-only writes
 without adding those writes to the WAL or doing any major
 rearrangement of where they sit that would break pg_upgrade.

[Thanks for your recent thread summaries.]

A double-write buffer, like a WAL-logged full-page image, is a technique for
performing atomic writes wider than those automatically provided by components
further down the storage stack.  The two strategies have different performance
characteristics, and we're told that a double-write buffer would better serve us
overall.  However, its benefits would not be *greater* for hint-only writes than
for any other write.  For that reason, I think we should consider these changes
independently.  With page checksums enabled, remove the hazard of torn hint-only
writes by ensuring that a WAL FPI has flushed since the last checkpoint.  When
necessary, emit an FPI-only record.  Separately, optimize first-since-checkpoint
writes by replacing FPIs with double-write buffers.  The double-write patch will
reduce the added WAL of the checksum/safe-hint-updates patch to zero.  If the
double-writes patch founders, we'll just have more-costly, yet equally reliable,
page checksums.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] get_fn_expr_argtype() vs. internal calls

2011-12-29 Thread Noah Misch

We document that a polymorphic C-language function may identify the concrete
data type of each argument using calls to get_fn_expr_argtype().  That relies
on FmgrInfo.fn_expr, which only the executor sets.  Calls of internal origin,
by way of {Direct,,Oid}FunctionCall*(), don't cons up an fn_expr, so
get_fn_expr_argtype() just returns InvalidOid every time.  (Indeed, we
couldn't easily do better in many cases.)  To what extent is it safe to rely
on this situation remaining as it is?

I ask on account of some second thoughts I had about CheckIndexCompatible().
When writing it, I did not explicitly consider operator classes having
polymorphic opcintype.  If get_fn_expr_argtype() were to work in a function
called from the btree search code, CheckIndexCompatible() should impose
stricter checks on indexes having opclasses of polymorphic opcintype.  If
that's not too likely to happen, I might just add a comment instead.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Add protransform for numeric, varbit, and temporal types

2011-12-31 Thread Noah Misch

Building on commit 8f9fe6edce358f7904e0db119416b4d1080a83aa, this adds
protransform functions to the length coercions for numeric, varbit, timestamp,
timestamptz, time, timetz and interval.  This mostly serves to make more ALTER
TABLE ALTER TYPE operations avoid a rewrite, including numeric(10,2) -
numeric(12,2), varbit(4) - varbit(8) and timestamptz(2) - timestamptz(4).
The rules for varbit are exactly the same as for varchar.  Numeric is slightly
more complex:

 * Flatten calls to our length coercion function that solely represent
 * increases in allowable precision.  Scale changes mutate every datum, so
 * they are unoptimizable.  Some values, e.g. 1E-1001, can only fit into an
 * unconstrained numeric, so a change from an unconstrained numeric to any
 * constrained numeric is also unoptimizable.

time{,stamp}{,tz} are similar to varchar for these purposes, except that, for
example, plain timestamptz is equivalent to timestamptz(6).  interval has
a vastly different typmod format, but the principles applicable to length
coercion remain the same.

Under --disable-integer-datetimes, I'm not positive that timestamp_scale() is
always a no-op when one would logically expect as much.  Does there exist a
timestamp such that v::timestamp(2) differs from v:timestamp(2)::timestamp(4)
due to floating point rounding?  Even if so, I'm fairly comfortable calling it
a feature rather than a bug to avoid perturbing values that way.

After these patches, the only core length coercion casts not having
protransform functions are those for bpchar and bit.  For those, we could
only optimize trivial cases of no length change.  I'm not planning to do so.

Thanks,
nm
diff --git a/src/backend/utils/adt/varbit.c b/src/backend/utils/adt/varbit.c
index 3fa8117..e5a03b9 100644
*** a/src/backend/utils/adt/varbit.c
--- b/src/backend/utils/adt/varbit.c
***
*** 18,23 
--- 18,25 
  
  #include access/htup.h
  #include libpq/pqformat.h
+ #include nodes/nodeFuncs.h
+ #include parser/parse_clause.h
  #include utils/array.h
  #include utils/varbit.h
  
***
*** 646,651  varbit_send(PG_FUNCTION_ARGS)
--- 648,686 
  }
  
  /*
+  * varbit_transform()
+  * Flatten calls to our length coercion function that leave the new maximum
+  * length = the previous maximum length.  We ignore the isExplicit argument,
+  * which only affects truncation.
+  */
+ Datum
+ varbit_transform(PG_FUNCTION_ARGS)
+ {
+   FuncExpr   *expr = (FuncExpr *) PG_GETARG_POINTER(0);
+   Node   *typmod;
+   Node   *ret = NULL;
+ 
+   if (!IsA(expr, FuncExpr))
+   PG_RETURN_POINTER(ret);
+ 
+   Assert(list_length(expr-args) == 3);
+   typmod = lsecond(expr-args);
+ 
+   if (IsA(typmod, Const))
+   {
+   Node   *source = linitial(expr-args);
+   int32   new_typmod = DatumGetInt32(((Const *) 
typmod)-constvalue);
+   int32   old_max = exprTypmod(source);
+   int32   new_max = new_typmod;
+ 
+   if (new_max = 0 || (old_max = 0  old_max = new_max))
+   ret = relabel_to_typmod(source, new_typmod);
+   }
+ 
+   PG_RETURN_POINTER(ret);
+ }
+ 
+ /*
   * varbit()
   * Converts a varbit() type to a specific internal length.
   * len is the maximum bitlength specified in the column definition.
diff --git a/src/include/catalog/pg_pindex c893c3a..2a71f82 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***
*** 2005,2011  DESCR(convert bitstring to int4);
  
  DATA(insert OID = 1685 (  bitPGNSP PGUID 12 1 0 0 0 f f f 
t f i 3 0 1560 1560 23 16 _null_ _null_ _null_ _null_ bit _null_ _null_ 
_null_ ));
  DESCR(adjust bit() to typmod length);
! DATA(insert OID = 1687 (  varbit PGNSP PGUID 12 1 0 0 0 f f f 
t f i 3 0 1562 1562 23 16 _null_ _null_ _null_ _null_ varbit _null_ _null_ 
_null_ ));
  DESCR(adjust varbit() to typmod length);
  
  DATA(insert OID = 1698 (  position   PGNSP PGUID 12 1 0 0 0 f f f 
t f i 2 0 23 1560 1560 _null_ _null_ _null_ _null_ bitposition _null_ _null_ 
_null_ ));
--- 2005,2013 
  
  DATA(insert OID = 1685 (  bitPGNSP PGUID 12 1 0 0 0 f f f 
t f i 3 0 1560 1560 23 16 _null_ _null_ _null_ _null_ bit _null_ _null_ 
_null_ ));
  DESCR(adjust bit() to typmod length);
! DATA(insert OID = 3145 ( varbit_transform  PGNSP PGUID 12 1 0 0 0 f f f t f i 
1 0 2281 2281 _null_ _null_ _null_ _null_ varbit_transform _null_ _null_ 
_null_ ));
! DESCR(transform a varbit length coercion);
! DATA(insert OID = 1687 (  varbit PGNSP PGUID 12 1 0 0 3145 f 
f f t f i 3 0 1562 1562 23 16 _null_ _null_ _null_ _null_ varbit _null_ 
_null_ _null_ ));
  DESCR(adjust varbit() to typmod length);
  
  DATA(insert OID = 1698 (  position   PGNSP PGUID 12 1 0 0 0 f f f 
t f i 2 0 23 1560 1560 _null_ _null_ _null_ _null_ bitposition _null_ _null_ 
_null_ ));

Re: [HACKERS] pg_upgrade and relkind filtering

2011-12-31 Thread Noah Misch

On Mon, Dec 05, 2011 at 05:06:37PM -0500, Bruce Momjian wrote:
 Pg_upgrade has the following check to make sure the cluster is safe for
 upgrading:
 
 res = executeQueryOrDie(conn,
 SELECT n.nspname, c.relname, a.attname
 
 FROM   pg_catalog.pg_class c, 
pg_catalog.pg_namespace n, 
pg_catalog.pg_attribute a 
 WHERE  c.oid = a.attrelid AND 
NOT a.attisdropped AND 
a.atttypid IN ( 
'pg_catalog.regproc'::pg_catalog.regtype, 
'pg_catalog.regprocedure'::pg_catalog.regtype, 
'pg_catalog.regoper'::pg_catalog.regtype, 
'pg_catalog.regoperator'::pg_catalog.regtype, 
 /* regclass.oid is preserved, so 'regclass' is OK */
 /* regtype.oid is preserved, so 'regtype' is OK */
'pg_catalog.regconfig'::pg_catalog.regtype, 
'pg_catalog.regdictionary'::pg_catalog.regtype) AND
 
c.relnamespace = n.oid AND 
n.nspname != 'pg_catalog' AND 
n.nspname != 'information_schema');
 
 Based on a report from EnterpriseDB, I noticed that we check all
 pg_class entries, while there are cases where this is unnecessary
 because there is no data behind the entry, e.g. views.  Here are the
 relkinds supported:
 
   #define   RELKIND_RELATION'r'   /* ordinary table */
   #define   RELKIND_INDEX   'i'   /* secondary index */
   #define   RELKIND_SEQUENCE'S'   /* sequence object */
   #define   RELKIND_TOASTVALUE  't'   /* for out-of-line 
 values */
   #define   RELKIND_VIEW'v'   /* view */
   #define   RELKIND_COMPOSITE_TYPE  'c'   /* composite type */
   #define   RELKIND_FOREIGN_TABLE   'f'   /* foreign table */
   #define   RELKIND_UNCATALOGED 'u'   /* not yet cataloged */
 
 What types, other than views, can we skip in this query?

RELKIND_UNCATALOGED should never appear on disk, and RELKIND_SEQUENCE and
RELKIND_TOASTVALUE do not allow adding columns or changing column types.  We
might as well keep validating them.  RELKIND_RELATION and RELKIND_INDEX have
storage, so we must check those.

The remaining three relkinds (RELKIND_VIEW, RELKIND_COMPOSITE_TYPE,
RELKIND_FOREIGN_TABLE) have no storage, but all are usable as column types in
other relations that do have storage.  You could skip them iff they're unused
that way, per a check like find_composite_type_dependencies().

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2012-01-02 Thread Noah Misch

On Mon, Jan 02, 2012 at 05:09:16PM +, Simon Riggs wrote:
 Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
 the start of each scan iff SnapshotNow is passed as the scan's
 snapshot. It's fairly brief but seems to do the trick.

That's a neat trick.  However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2012-01-02 Thread Noah Misch

On Mon, Jan 02, 2012 at 06:41:31PM +, Simon Riggs wrote:
 On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch n...@leadboat.com wrote:
  On Mon, Jan 02, 2012 at 05:09:16PM +, Simon Riggs wrote:
  Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
  the start of each scan iff SnapshotNow is passed as the scan's
  snapshot. It's fairly brief but seems to do the trick.
 
  That's a neat trick. ?However, if you start a new SnapshotNow scan while 
  one is
  ongoing, the primordial scan's snapshot will change mid-stream.
 
 Do we ever do that? (and if so, Why?!? or perhaps just Where?)

I hacked up your patch a bit, as attached, to emit a WARNING for any nested
use of SnapshotNow.  This made 97/127 test files fail.  As one example,
RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its
SnapshotNow scan.  That may need a detoast, which itself runs a scan.

 We can use more complex code if required, but we'll be adding
 complexity and code into the main path that I'd like to avoid.

Agreed.
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 1201,1206  heap_beginscan_internal(Relation relation, Snapshot snapshot,
--- 1201,1208 
 */
scan-rs_pageatatime = IsMVCCSnapshot(snapshot);
  
+   InitSnapshotNowIfNeeded(snapshot);
+ 
/*
 * For a seqscan in a serializable transaction, acquire a predicate lock
 * on the entire relation. This is required not only to lock all the
***
*** 1270,1275  heap_endscan(HeapScanDesc scan)
--- 1272,1279 
if (BufferIsValid(scan-rs_cbuf))
ReleaseBuffer(scan-rs_cbuf);
  
+   FinishSnapshotNow(scan-rs_snapshot);
+ 
/*
 * decrement relation reference count and free scan descriptor storage
 */
***
*** 1680,1685  heap_get_latest_tid(Relation relation,
--- 1684,1691 
if (!ItemPointerIsValid(tid))
return;
  
+   InitSnapshotNowIfNeeded(snapshot);
+ 
/*
 * Since this can be called with user-supplied TID, don't trust the 
input
 * too much.  (RelationGetNumberOfBlocks is an expensive check, so we
***
*** 1775,1780  heap_get_latest_tid(Relation relation,
--- 1781,1788 
priorXmax = HeapTupleHeaderGetXmax(tp.t_data);
UnlockReleaseBuffer(buffer);
}   /* end of loop 
*/
+ 
+   FinishSnapshotNow(snapshot);
  }
  
  
*** a/src/backend/access/index/indexam.c
--- b/src/backend/access/index/indexam.c
***
*** 292,297  index_beginscan_internal(Relation indexRelation,
--- 292,299 
 */
RelationIncrementReferenceCount(indexRelation);
  
+   InitSnapshotNowIfNeeded(snapshot);
+ 
/*
 * Tell the AM to open a scan.
 */
***
*** 370,375  index_endscan(IndexScanDesc scan)
--- 372,379 
/* End the AM's scan */
FunctionCall1(procedure, PointerGetDatum(scan));
  
+   FinishSnapshotNow(scan-xs_snapshot);
+ 
/* Release index refcount acquired by index_beginscan */
RelationDecrementReferenceCount(scan-indexRelation);
  
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***
*** 205,210  GetLatestSnapshot(void)
--- 205,235 
return SecondarySnapshot;
  }
  
+ static bool SnapshotNowActive;
+ 
+ /*
+  * InitSnapshotNowIfNeeded
+  *If passed snapshot is SnapshotNow then refresh it with latest 
info.
+  */
+ void
+ InitSnapshotNowIfNeeded(Snapshot snap)
+ {
+   if (!IsSnapshotNow(snap))
+   return;
+ 
+   if (SnapshotNowActive)
+   elog(WARNING, SnapshotNow used concurrently);
+ 
+   snap = GetSnapshotData(SnapshotNowData);
+   SnapshotNowActive = true;
+ }
+ 
+ void FinishSnapshotNow(Snapshot snap)
+ {
+   if (IsSnapshotNow(snap))
+   SnapshotNowActive = false;
+ }
+ 
  /*
   * SnapshotSetCommandId
   *Propagate CommandCounterIncrement into the static snapshots, if 
set
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***
*** 67,73 
  
  
  /* Static variables representing various special snapshot semantics */
! SnapshotData SnapshotNowData = {HeapTupleSatisfiesNow};
  SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
--- 67,73 
  
  
  /* Static variables representing various special snapshot semantics */
! SnapshotData SnapshotNowData = {HeapTupleSatisfiesMVCC};
  SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
***
*** 293,298  HeapTupleSatisfiesSelf(HeapTupleHeader tuple

Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2012-01-02 Thread Noah Misch

On Mon, Jan 02, 2012 at 04:33:28PM -0300, Alvaro Herrera wrote:
 
 Excerpts from Noah Misch's message of lun ene 02 16:25:25 -0300 2012:
  On Mon, Jan 02, 2012 at 06:41:31PM +, Simon Riggs wrote:
   On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch n...@leadboat.com wrote:
On Mon, Jan 02, 2012 at 05:09:16PM +, Simon Riggs wrote:
Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.
   
That's a neat trick. ?However, if you start a new SnapshotNow scan 
while one is
ongoing, the primordial scan's snapshot will change mid-stream.
   
   Do we ever do that? (and if so, Why?!? or perhaps just Where?)
  
  I hacked up your patch a bit, as attached, to emit a WARNING for any nested
  use of SnapshotNow.  This made 97/127 test files fail.  As one example,
  RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its
  SnapshotNow scan.  That may need a detoast, which itself runs a scan.
 
 Uh, I thought detoasting had its own visibility test function .. I mean,
 otherwise, what is HeapTupleSatisfiesToast for?

The SnapshotNow scan was actually to build the relcache entry for the toast
table before scanning the toast table itself.  Stack trace:

#0  InitSnapshotNowIfNeeded (snap=0xb7e8c0) at snapmgr.c:216
#1  0x0048f773 in index_beginscan_internal 
(indexRelation=0x7f9a091e5cc8, nkeys=2, norderbys=0, snapshot=0xb7e8c0) at 
indexam.c:295
#2  0x0048f7e9 in index_beginscan (heapRelation=0x7f9a091a6c60, 
indexRelation=0xb7e8c0, snapshot=0xb7e8c0, nkeys=152984776, norderbys=0) at 
indexam.c:238
#3  0x0048e1de in systable_beginscan (heapRelation=0x7f9a091a6c60, 
indexId=value optimized out, indexOK=value optimized out, 
snapshot=0xb7e8c0, nkeys=2, key=0x7fffe211b610) at genam.c:289
#4  0x007287db in RelationBuildTupleDesc (targetRelId=value optimized 
out, insertIt=1 '\001') at relcache.c:455
#5  RelationBuildDesc (targetRelId=value optimized out, insertIt=1 '\001') at 
relcache.c:880
#6  0x0072a050 in RelationIdGetRelation (relationId=2838) at 
relcache.c:1567
#7  0x004872c0 in relation_open (relationId=2838, lockmode=1) at 
heapam.c:910
#8  0x00487328 in heap_open (relationId=12052672, lockmode=152984776) 
at heapam.c:1052
#9  0x0048a737 in toast_fetch_datum (attr=value optimized out) at 
tuptoaster.c:1586
#10 0x0048bf74 in heap_tuple_untoast_attr (attr=0xb7e8c0) at 
tuptoaster.c:135
#11 0x00737d77 in pg_detoast_datum_packed (datum=0xb7e8c0) at 
fmgr.c:2266
#12 0x006f0fb0 in text_to_cstring (t=0xb7e8c0) at varlena.c:135
#13 0x00727083 in RelationBuildRuleLock (relation=0x7f9a091b3818) at 
relcache.c:673
#14 0x0072909e in RelationBuildDesc (targetRelId=value optimized out, 
insertIt=1 '\001') at relcache.c:886
#15 0x0072a050 in RelationIdGetRelation (relationId=9) at 
relcache.c:1567
#16 0x004872c0 in relation_open (relationId=9, lockmode=0) at 
heapam.c:910
#17 0x00487483 in relation_openrv_extended (relation=value optimized 
out, lockmode=value optimized out, missing_ok=value optimized out) at 
heapam.c:1011
#18 0x0048749c in heap_openrv_extended (relation=0xb7e8c0, 
lockmode=152984776, missing_ok=5 '\005') at heapam.c:1110
#19 0x0053570c in parserOpenTable (pstate=0xc885f8, relation=0xc88258, 
lockmode=1) at parse_relation.c:835
#20 0x0053594d in addRangeTableEntry (pstate=0xc885f8, 
relation=0xc88258, alias=0x0, inh=1 '\001', inFromCl=1 '\001') at 
parse_relation.c:901
#21 0x00524dbf in transformTableEntry (pstate=0xc885f8, n=0xc88258, 
top_rte=0x7fffe211bd78, top_rti=0x7fffe211bd84, relnamespace=0x7fffe211bd70, 
containedRels=0x7fffe211bd68) at parse_clause.c:445
#22 transformFromClauseItem (pstate=0xc885f8, n=0xc88258, 
top_rte=0x7fffe211bd78, top_rti=0x7fffe211bd84, relnamespace=0x7fffe211bd70, 
containedRels=0x7fffe211bd68) at parse_clause.c:683
#23 0x00526124 in transformFromClause (pstate=0xc885f8, frmList=value 
optimized out) at parse_clause.c:129
#24 0x0050c321 in transformSelectStmt (pstate=0xb7e8c0, 
parseTree=0xc88340) at analyze.c:896
#25 transformStmt (pstate=0xb7e8c0, parseTree=0xc88340) at analyze.c:186
#26 0x0050d583 in parse_analyze (parseTree=0xc88340, 
sourceText=0xc87828 table pg_stat_user_tables;, paramTypes=0x0, numParams=0) 
at analyze.c:94
#27 0x006789c5 in pg_analyze_and_rewrite (parsetree=0xc88340, 
query_string=0xc87828 table pg_stat_user_tables;, paramTypes=0x0, 
numParams=0) at postgres.c:583
#28 0x00678c77 in exec_simple_query (query_string=0xc87828 table 
pg_stat_user_tables;) at postgres.c:941
#29 0x00679997 in PostgresMain (argc=value optimized out, argv=value 
optimized out, username=0xbe03c0 nm) at postgres.c:3881
#30 0x00633d41 in BackendRun () at postmaster.c:3587
#31 BackendStartup

Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2012-01-02 Thread Noah Misch

On Tue, Jan 03, 2012 at 01:18:41AM +, Simon Riggs wrote:
 Just for the record, yes we do run multiple catalog scans in some
 parts of the code.
 
 So I can see how we might trigger 4 nested scans, using cache
 replacement while scanning, so best assume more, with no guarantee of
 them being neatly stacked for pop/push type access.

Yeah, I wouldn't want to commit to a nesting limit.  However, I _would_ have
expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot()
is adequate for a great deal of the backend, after all.  In what sort of
situation do catalog scans not strictly nest?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to allow users to kill their own queries

2012-01-03 Thread Noah Misch

On Tue, Dec 20, 2011 at 02:30:08PM +0100, Magnus Hagander wrote:
 That said - can someone who knows the translation stuff better than me
 comment on if this is actually going to be translatable, or if it
 violates too many translation rules?

 +pg_signal_backend(int pid, int sig, bool allow_same_role, const char 
 *actionstr, const char *hint)

 + ereport(ERROR,
 + 
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 +  errmsg(must be superuser to %s other 
 server processes, actionstr),
 +  errhint(%s, hint)));

 + PG_RETURN_BOOL(pg_signal_backend(PG_GETARG_INT32(0), SIGTERM, false,
 +  
 gettext_noop(terminate),
 +  
 gettext_noop(You can cancel your own processes with pg_cancel_backend().)));
  }

You need errhint(%s, _(hint)) or errhint(hint) to substitute the
translation at runtime; only the printf-pattern string gets an automatic
message catalog lookup.

Regarding the other message, avoid composing a translated message from
independently-translated parts.  The translator sees this:


#: utils/adt/misc.c:110
#, c-format
msgid must be superuser to %s other server processes
msgstr 

#: utils/adt/misc.c:166
msgid terminate
msgstr 

#: utils/adt/misc.c:167
msgid You can cancel your own processes with pg_cancel_backend().
msgstr 


He'll probably need to read the code to see how the strings go together.  If
we add an alternative to terminate, not all languages will necessarily have
a translation of the outer message that works for both inner fragments.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-03 Thread Noah Misch

On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:
 Thanks for your great work on reviewing this patch. Now I'm trying to find
 memory corruption bug. Unfortunately it doesn't appears on my system. Can
 you check if this bug remains in attached version of patch. If so, please
 provide me information about system you're running (processor, OS etc.).

I get the same diagnostic from this version.  Opteron processor, operating
system is Ubuntu 8.04 (64-bit).  You're using --enable-cassert, right?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-04 Thread Noah Misch

Git master can already avoid rewriting the table for column type changes like
varchar(10) - varchar(20).  However, if the column in question is on either
side of a FK relationship, we always revalidate the foreign key.  Concretely,
I wanted these no-rewrite type changes to also assume FK validity:
- Any typmod-only change
- text - varchar
- domain - base type

To achieve that, this patch skips the revalidation when two conditions hold:

(a) Old and new pg_constraint.conpfeqop match exactly.  This is actually
stronger than needed; we could loosen things by way of operator families.
However, no core type would benefit, and my head exploded when I tried to
define the more-generous test correctly.

(b) The functions, if any, implementing a cast from the foreign type to the
primary opcintype are the same.  For this purpose, we can consider a binary
coercion equivalent to an exact type match.  When the opcintype is
polymorphic, require that the old and new foreign types match exactly.  (Since
ri_triggers.c does use the executor, the stronger check for polymorphic types
is no mere future-proofing.  However, no core type exercises its necessity.)

These follow from the rules used to decide when to rebuild an index.  I
further justify them in source comments.

To implement this, I have ATPostAlterTypeParse() stash the content of the old
pg_constraint.conpfeqop in the Constraint node.  ATAddForeignKeyConstraint()
notices that and evaluates the above rules.  If they both pass, it omits the
validation step just as though skip_validation had been given.

Thanks,
nm
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3b52415..0cde503 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 276,281  static Oid transformFkeyCheckAttrs(Relation pkrel,
--- 276,282 
int numattrs, int16 *attnums,
Oid *opclasses);
  static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts);
+ static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid 
*funcid);
  static void validateCheckConstraint(Relation rel, HeapTuple constrtup);
  static void validateForeignKeyConstraint(char *conname,
 Relation rel, Relation 
pkrel,
***
*** 357,362  static void ATPostAlterTypeCleanup(List **wqueue, 
AlteredTableInfo *tab, LOCKMOD
--- 358,364 
  static void ATPostAlterTypeParse(Oid oldId, char *cmd,
 List **wqueue, LOCKMODE lockmode, bool 
rewrite);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
+ static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
 Oid oldOwnerId, Oid 
newOwnerId);
  static void change_owner_recurse_to_sequences(Oid relationOid,
***
*** 5696,5701  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5698,5705 
numpks;
Oid indexOid;
Oid constrOid;
+   boolold_check_ok;
+   ListCell   *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop);
  
/*
 * Grab an exclusive lock on the pk table, so that someone doesn't 
delete
***
*** 5812,5817  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5816,5828 
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
 errmsg(number of referencing and referenced 
columns for foreign key disagree)));
  
+   /*
+* On the strength of a previous constraint, we might avoid scanning
+* tables to validate this one.  See below.
+*/
+   old_check_ok = (fkconstraint-old_conpfeqop != NIL);
+   Assert(!old_check_ok || numfks == 
list_length(fkconstraint-old_conpfeqop));
+ 
for (i = 0; i  numpks; i++)
{
Oid pktype = pktypoid[i];
***
*** 5826,5831  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5837,5843 
Oid ppeqop;
Oid ffeqop;
int16   eqstrategy;
+   Oid pfeqop_right;
  
/* We need several fields out of the pg_opclass entry */
cla_ht = SearchSysCache1(CLAOID, 
ObjectIdGetDatum(opclasses[i]));
***
*** 5868,5877  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
ffeqop =

Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-04 Thread Noah Misch

I neglected to commit after revising the text of a few comments; use this
version instead.  Thanks.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3b52415..9eba8e8 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 276,281  static Oid transformFkeyCheckAttrs(Relation pkrel,
--- 276,282 
int numattrs, int16 *attnums,
Oid *opclasses);
  static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts);
+ static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid 
*funcid);
  static void validateCheckConstraint(Relation rel, HeapTuple constrtup);
  static void validateForeignKeyConstraint(char *conname,
 Relation rel, Relation 
pkrel,
***
*** 357,362  static void ATPostAlterTypeCleanup(List **wqueue, 
AlteredTableInfo *tab, LOCKMOD
--- 358,364 
  static void ATPostAlterTypeParse(Oid oldId, char *cmd,
 List **wqueue, LOCKMODE lockmode, bool 
rewrite);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
+ static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
 Oid oldOwnerId, Oid 
newOwnerId);
  static void change_owner_recurse_to_sequences(Oid relationOid,
***
*** 5696,5701  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5698,5705 
numpks;
Oid indexOid;
Oid constrOid;
+   boolold_check_ok;
+   ListCell   *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop);
  
/*
 * Grab an exclusive lock on the pk table, so that someone doesn't 
delete
***
*** 5812,5817  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5816,5828 
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
 errmsg(number of referencing and referenced 
columns for foreign key disagree)));
  
+   /*
+* On the strength of a previous constraint, we might avoid scanning
+* tables to validate this one.  See below.
+*/
+   old_check_ok = (fkconstraint-old_conpfeqop != NIL);
+   Assert(!old_check_ok || numfks == 
list_length(fkconstraint-old_conpfeqop));
+ 
for (i = 0; i  numpks; i++)
{
Oid pktype = pktypoid[i];
***
*** 5826,5831  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5837,5843 
Oid ppeqop;
Oid ffeqop;
int16   eqstrategy;
+   Oid pfeqop_right;
  
/* We need several fields out of the pg_opclass entry */
cla_ht = SearchSysCache1(CLAOID, 
ObjectIdGetDatum(opclasses[i]));
***
*** 5868,5877  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

 eqstrategy);
else
!   ffeqop = InvalidOid;/* keep compiler quiet */
  
if (!(OidIsValid(pfeqop)  OidIsValid(ffeqop)))
{
--- 5880,5896 
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
+   {
+   pfeqop_right = fktyped;
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

 eqstrategy);
+   }
else
!   {
!   /* keep compiler quiet */
!   pfeqop_right = InvalidOid;
!   ffeqop = InvalidOid;
!   }
  
if (!(OidIsValid(pfeqop)  OidIsValid(ffeqop)))
{
***
*** 5893,5899  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5912,5921 
target_typeids[1] = opcintype;
if (can_coerce_type(2, input_typeids, target_typeids,

COERCION_IMPLICIT))
+   {

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-06 Thread Noah Misch

Corrections:

On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote:
 On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:
  +  *We set s to be the estimated frequency of the K'th element in a 
  natural
  +  *language's frequency table, where K is the target number of 
  entries in
  +  *the MCELEM array. We assume that the distribution of element 
  frequencies
  +  *follows Zipf's law with an exponent of 1.
  +  *
  +  *Assuming Zipfian distribution, the frequency of the K'th 
  element is equal
  +  *to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is 
  the number of
  +  *elements in the language.   Putting W as one million, we 
  get roughly
  +  *0.07/K. This gives s = 0.07/K.  We set epsilon = s/10, which 
  gives bucket
  +  *width w = K/0.007 and maximum expected hashtable size of about 
  1000 * K.
 
 These last two paragraphs, adapted from ts_typanalyze.c, assume natural
 language documents.  To what extent do these parameter choices remain sensible
 for arbitrary data such as users may place in arrays?  In any event, we need a
 different justification, even if it's just a hand-wavy justification.
 
 If I'm following this correctly, this choice of s makes the algorithm
 guaranteed to find only elements constituting = 7% of the input elements.
 Incidentally, isn't that far too high for natural language documents?  If the
 English word the only constitutes 7% of typical documents, then this s
 value would permit us to discard practically every word; we'd be left with
 words read while filling the last bucket and words that happened to repeat
 exceedingly often in the column.  I haven't tried to make a test case to
 observe this problem; am I missing something?  (This question is largely
 orthogonal to your patch.)

No, we'll find elements of frequency at least 0.07/(default_statistics_target
* 10) -- in the default configuration, 0.007%.  Also, ts_typanalyze() counts
the number of documents that contain one or more instances of each lexeme,
ignoring the number of appearances within each document.  The word the may
constitute 7% of a typical document, but it will appear at least once in
nearly 100% of documents.  Therefore, this s value is adequate even for the
pathological case of each document containing just one lexeme.

  +  *
  +  *Note: in the above discussion, s, epsilon, and f/N are in terms 
  of a
  +  *element's frequency as a fraction of all elements seen in the 
  input.
  +  *However, what we actually want to store in the finished 
  pg_statistic
  +  *entry is each element's frequency as a fraction of all rows 
  that it occurs
  +  *in. Elements might be repeated in the same array. Since 
  operators
  +  *@, , @ takes care only about element occurence itself and 
  not about
  +  *occurence count, function takes additional care about 
  uniqueness of
  +  *counting. Also we need to change the divisor from N to 
  nonnull_cnt to get
  +  *the number we want.
 
 On the same tangent, why does ts_typanalyze() not deduplicate the same way?
 The @@ operator has the same property.

Answer: to_tsvector() will have already done so.

  +   /*
  +* We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
  +* comment above.
  +*/
  +   bucket_width = num_mcelem * 1000 / 7;
 
 The addend mentioned is not present in the code or discussed in the comment
 above.  (I see the comment is copied verbatim from ts_typanalyze(), where the
 addend *is* present, though again the preceding comment says nothing of it.)

The addend rationale in fact does appear in the ts_typanalyze() comment.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-06 Thread Noah Misch

In 367bc426a1c22b9f6badb06cd41fc438fd034639, I introduced a
CheckIndexCompatible() that approves btree and hash indexes having changed to
a different operator class within the same operator family.  To make that
valid, I also tightened the operator family contracts for those access methods
to address casts.  However, I've found two retained formal hazards.  They're
boring and perhaps unlikely to harm real users, but I'd rather nail them down
just in case.


First, opclasses like array_ops have polymorphic opcintype.  Members of such
operator classes could, in general, change behavior arbitrarily in response to
get_fn_expr_argtype().  No core polymorphic operator family member does this.
Nor could they, for no core index access method sets fn_expr.  In the absence
of responses to my previous inquiry[1] on the topic, I'm taking the position
that the lack of fn_expr in these calls is an isolated implementation detail
that could change at any time.  Therefore, we should only preserve an index of
polymorphic operator class when the old and new opcintype match exactly.  This
patch's test suite addition illustrates one ALTER TABLE ALTER TYPE affected by
this new restriction: a conversion from an array type to a domain over that
array type will now require an index rebuild.


Second, as a thought experiment, consider a database with these three types:
1. int4, the core type.
2. int4neg, stores to disk as the negation of its logical value.  Includes a
complete set of operators in the integer_ops btree family, but defines no
casts to other integer_ops-represented types.
3. int4other, stores to disk like int4.  No operators.  Has a implicit binary
coercion cast to int4 and an explicit binary coercion cast to int4neg.

Suppose a table in this database has a column of type int4other with an index
of default operator class.  By virtue of the implicit binary coercion and lack
of other candidates, the index will use int4_ops.  Now, change the type of
that column from int4other to int4neg.  The operator class changes to
int4neg_ops, still within the integer_ops family, and we do not rebuild the
index.  However, the logical sign of each value just flipped, making the index
invalid.  Where did CheckIndexCompatible() miss?  An operator family only
promises cast-compatibility when there exists an implicit or binary coercion
cast between the types in question.  There's no int4-int4neg cast here, not
even a multiple-step pathway.  CheckIndexCompatible() assumes that our ability
to perform a no-rewrite ALTER TABLE ALTER TYPE implies the existence of such a
cast.  It does imply that for the before-and-after _table column types_, but
it's the before-and-after _opcintype_ that matter here.

I think we could close this hazard by having CheckIndexCompatible() test for
an actual implicit or binary coercion cast between the opcintype of the old
and new operator classes.  However, I'm not confident to say that no similar
problem would remain.  Therefore, I propose ceasing to optimize intra-family
index operator transitions and simply requiring exact matches of operator
classes and exclusion operators.  This does not remove any actual optimization
for changes among core types, and it will be simpler to validate.  (I designed
the current rules under the misapprehension that varchar_ops was the default
operator class for varchar columns.  However, varchar_ops is a copy of
text_ops apart from the name and being non-default.  We ship no operator with
a varchar operand, relying instead on binary coercion to text.)

This patch does not, however, remove the new terms from the operator family
contracts.  While core PostgreSQL will no longer depend on them, they may
again prove handy for future optimizations like this.  I remain of the opinion
that they're already widely (perhaps even universally) obeyed.


This patch conflicts trivially with my patch Avoid FK validations for
no-rewrite ALTER TABLE ALTER TYPE in that they both add test cases to the
same location in alter_table.sql.  They're otherwise independent, albeit
reflecting parallel principles.

Thanks,
nm

[1] 
http://archives.postgresql.org/message-id/20111229211711.ga8...@tornado.leadboat.com
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 712b0b0..1bf1de5 100644
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***
*** 23,28 
--- 23,29 
  #include catalog/pg_opclass.h
  #include catalog/pg_opfamily.h
  #include catalog/pg_tablespace.h
+ #include catalog/pg_type.h
  #include commands/dbcommands.h
  #include commands/defrem.h
  #include commands/tablecmds.h
***
*** 52,57 
--- 53,59 
  /* non-export function prototypes */
  static void CheckPredicate(Expr *predicate);
  static void ComputeIndexAttrs(IndexInfo *indexInfo,
+ Oid *typeOidP,
  Oid *collationOidP,
  Oid *classOidP,

Re: [HACKERS] Disabled features on Hot Standby

2012-01-13 Thread Noah Misch

On Fri, Jan 13, 2012 at 12:08:05PM -0500, Robert Haas wrote:
 On Fri, Jan 13, 2012 at 11:13 AM, Simon Riggs si...@2ndquadrant.com wrote:
  Many WAL records have latestRemovedXid on them. We can use the same
  idea with XLOG_HEAP2_VISIBLE records, so we add a field to send the
  latest vacrelstats-latestRemovedXid. That then creates a recovery
  snapshot conflict that would cancel any query that might then see a
  page of the vis map that was written when the xmin was later than on
  the standby. If replication disconnects briefly and a vimap bit is
  updated that would cause a problem, just as the same situation would
  cause a problem because of other record types.
 
 That could create a lot of recovery conflicts when
 hot_standby_feedback=off, I think, but it might work when
 hot_standby_feedback=on.  I don't fully understand the
 latestRemovedXid machinery, but I guess the idea would be to kill any
 standby transaction whose proc-xmin precedes the oldest committed
 xmin or xmax on the page.  If hot_standby_feedback=on then there
 shouldn't be any, except in the case where it's just been enabled or
 the SR connection is bouncing.

FWIW, Tom aired the same idea here:
http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us

While reviewing the crash-safe visibility map patch, I echoed the idea and
explained why the extra conflicts would be immaterial:
http://archives.postgresql.org/message-id/20110618133703.ga1...@tornado.leadboat.com

 Also, what happens if an all-visible bit gets set on the standby
 through some other mechanism - e.g. restored from an FPI or
 XLOG_HEAP_NEWPAGE?  I'm not sure whether we ever do an FPI of the
 visibility map page itself, but we certainly do it for the heap pages.
  So it might be that this infrastructure would (somewhat bizarrely)
 trust the visibility map bits but not the PD_ALL_VISIBLE bits.

Simon spoke to the FPI side of the question.  For heap pages, the
XLOG_HEAP_NEWPAGE consumers are CLUSTER, VACUUM FULL and ALTER TABLE SET
TABLESPACE.  For the last, we will have already logged any PD_ALL_VISIBLE bits
through normal channels.  CLUSTER and VACUUM FULL never set PD_ALL_VISIBLE or
visibility map bits.  When, someday, they do, we might emit a separate WAL
record to force the recovery conflict.  However, CLUSTER/VACUUM FULL already
remove tuples still-visible to standby snapshots without provoking a recovery
conflict.  (Again only with hot_standby_feedback=off.)

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disabled features on Hot Standby

2012-01-13 Thread Noah Misch

On Sat, Jan 14, 2012 at 12:42:02AM -0500, Robert Haas wrote:
 On Fri, Jan 13, 2012 at 8:02 PM, Noah Misch n...@leadboat.com wrote:
  Simon spoke to the FPI side of the question. ?For heap pages, the
  XLOG_HEAP_NEWPAGE consumers are CLUSTER, VACUUM FULL and ALTER TABLE SET
  TABLESPACE. ?For the last, we will have already logged any PD_ALL_VISIBLE 
  bits
  through normal channels. ?CLUSTER and VACUUM FULL never set PD_ALL_VISIBLE 
  or
  visibility map bits. ?When, someday, they do, we might emit a separate WAL
  record to force the recovery conflict. ?However, CLUSTER/VACUUM FULL already
  remove tuples still-visible to standby snapshots without provoking a 
  recovery
  conflict. ?(Again only with hot_standby_feedback=off.)
 
 Is the big about CLUSTER/VACUUM FULL a preexisting bug?  If not, why not?

I suspect it goes back to 9.0, yes.  I'm on the fence regarding whether to call
it a bug or an unimplemented feature.  In any case, +1 for improving it.

 Other than that, it seems like we might be converging on a workable
 solution: if hot_standby_feedback=off, disable index-only scans for
 snapshots taken during recovery; if hot_standby_feedback=on, generate
 recovery conflicts when a snapshot's xmin precedes the youngest xmin
 on a page marked all-visible, but allow the use of index-only scans,
 and allow sequential scans to trust PD_ALL_VISIBLE.  Off the top of my
 head, I don't see a hole in that logic...

I wouldn't check hot_standby_feedback.  Rather, mirror what we do for
XLOG_HEAP2_CLEAN.  Unconditionally add an xid to xl_heap_visible bearing the
youngest xmin on the page (alternately, some convenient upper bound thereof).
Have heap_xlog_visible() call ResolveRecoveryConflictWithSnapshot() on that
xid.  Now, unconditionally trust PD_ALL_VISIBLE and permit index-only scans.

The user's settings of hot_standby_feedback, vacuum_defer_cleanup_age,
max_standby_streaming_delay and max_standby_archive_delay will drive the
consequential trade-off:  nothing, query cancellation, or recovery delay.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disabled features on Hot Standby

2012-01-14 Thread Noah Misch

On Sat, Jan 14, 2012 at 08:08:29AM +, Simon Riggs wrote:
 On Sat, Jan 14, 2012 at 1:02 AM, Noah Misch n...@leadboat.com wrote:
  However, CLUSTER/VACUUM FULL already
  remove tuples still-visible to standby snapshots without provoking a 
  recovery
  conflict. ?(Again only with hot_standby_feedback=off.)
 
 If that were the case it would be a bug.
 
 CLUSTER/VACUUM FULL emit an AccessExclusiveLock record that would
 conflict with any current lock holders, so should be fine on that.

I speak of this sequence (M = master connection, S = standby connection):

M: CREATE TABLE t AS SELECT * FROM generate_series(1,1000) t(n);
S: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 0;
M: DELETE FROM t WHERE n = 10;
M: VACUUM FULL t;
S: SELECT count(*) FROM t; -- 990, should be 1000

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-01-14 Thread Noah Misch

On Sat, Jan 14, 2012 at 04:41:57AM -0500, Jaime Casanova wrote:
 pgstattuple and relation_free_space are very close in all the numbers
 except for 2 indexes pgbench_branches_pkey and pgbench_tellers_pkey;
 after a VACUUM FULL and  a REINDEX (and the difference persistence) i
 checked pgbench_tellers_pkey contents (it has only one page besides
 the metapage) and the numbers that i look at where:
 
 page size: 8192
 free size:  4148
 
 which in good romance means 50% of free space... so, answering Noah's
 question: if that difference has some meaning i can't see it but
 looking at the evidence the measure relation_free_space() is giving is
 the good one
 
 so, tomorrow (or ...looking at the clock... later today) i will update
 the relation_free_space() patch to accept toast tables and other kind
 of indexes and add it to the commitfest unless someone says that my
 math is wrong and somehow there is a more accurate way of getting the
 free space (which is entirely possible)

Did you see this followup[1]?  To summarize:

- pgstattuple() and relation_free_space() should emit the same number, even if
  that means improving pgstattuple() at the same time.
- relation_free_space() belongs in the pgstattuple extension, because its role
  is cheaper access to a single pgstattuple() metric.

Thanks,
nm

[1] 
http://archives.postgresql.org/message-id/20111218165625.gb6...@tornado.leadboat.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] psql filename completion: quoting

2012-01-14 Thread Noah Misch

Occasionally, I have a SQL file destined for psql's \i command whose name
contains a space.  Less often, I'll have a .csv destined for \copy with the
same problem.  psql's filename completion does not handle these well.  It
completes on the literal name, but the commands will only recognize quoted
names.  For example, given a file foo bar, \i fTAB will complete to \i
foo bar, which will not execute.  If I type \i 'fTAB, completion will not
help at all.

The attached patch wraps rl_filename_completion_function() to dequote on input
and enquote on output.  Now, \i fTAB and \i 'fTAB will both complete
to \i 'foo bar', which executes as expected.  The new code handles embedded
whitespace, quotes, and backslashes.

tab-complete.c works in terms of whitespace-separated words.  As such, \i
'foo bTAB does not complete, because tab-complete.c has no notion of quotes
affecting token boundaries.  It thinks 'foo is one token and b is another.
I'm sure we could fix this (Bash gets it right).  It seemed rather independent
code-wise, so I did not attempt that for this patch.

Thanks,
nm
diff --git a/src/bin/psql/stringutils.c b/src/bin/psql/stringutils.c
index 3b5ce1b..77387dc 100644
*** a/src/bin/psql/stringutils.c
--- b/src/bin/psql/stringutils.c
***
*** 272,274  strip_quotes(char *source, char quote, char escape, int 
encoding)
--- 272,343 
  
*dst = '\0';
  }
+ 
+ 
+ /*
+  * quote_if_needed
+  *
+  * Opposite of strip_quotes().  If source denotes itself literally without
+  * quoting or escaping, returns NULL.  Otherwise, returns a malloc'd copy with
+  * quoting and escaping applied:
+  *
+  * source -   string to parse
+  * entails_quote -any of these present?  need outer quotes
+  * quote -doubled within string, affixed to both ends
+  * escape -   doubled within string
+  * encoding - the active character-set encoding
+  *
+  * Do not use this as a substitute for PQescapeStringConn().  Use it for
+  * strings to be parsed by strtokx() or psql_scan_slash_option().
+  */
+ char *
+ quote_if_needed(const char *source, const char *entails_quote,
+   char quote, char escape, int encoding)
+ {
+   const char *src;
+   char   *ret;
+   char   *dst;
+   boolneed_quotes = false;
+ 
+   psql_assert(source);
+   psql_assert(quote);
+ 
+   src = source;
+   dst = ret = pg_malloc(2 * strlen(src) + 3); /* excess */
+ 
+   *dst++ = quote;
+ 
+   while (*src)
+   {
+   charc = *src;
+   int i;
+ 
+   if (c == quote)
+   {
+   need_quotes = true;
+   *dst++ = quote;
+   }
+   else if (c == escape)
+   {
+   need_quotes = true;
+   *dst++ = escape;
+   }
+   else if (strchr(entails_quote, c))
+   need_quotes = true;
+ 
+   i = PQmblen(src, encoding);
+   while (i--)
+   *dst++ = *src++;
+   }
+ 
+   *dst++ = quote;
+   *dst = '\0';
+ 
+   if (!need_quotes)
+   {
+   free(ret);
+   ret = NULL;
+   }
+ 
+   return ret;
+ }
diff --git a/src/bin/psql/stringuindex c7c5f38..c64fc58 100644
*** a/src/bin/psql/stringutils.h
--- b/src/bin/psql/stringutils.h
***
*** 19,22  extern char *strtokx(const char *s,
--- 19,25 
bool del_quotes,
int encoding);
  
+ extern char *quote_if_needed(const char *source, const char *entails_quote,
+   char quote, char escape, int encoding);
+ 
  #endif   /* STRINGUTILS_H */
diff --git a/src/bin/psql/tab-comindex a27ef69..d226106 100644
*** a/src/bin/psql/tab-complete.c
--- b/src/bin/psql/tab-complete.c
***
*** 670,675  static char *complete_from_list(const char *text, int state);
--- 670,676 
  static char *complete_from_const(const char *text, int state);
  static char **complete_from_variables(char *text,
const char *prefix, const char 
*suffix);
+ static char *complete_from_files(const char *text, int state);
  
  static PGresult *exec_query(const char *query);
  
***
*** 1619,1625  psql_completion(char *text, int start, int end)
  pg_strcasecmp(prev3_wd, BINARY) == 0) 
 (pg_strcasecmp(prev_wd, FROM) == 0 ||
  pg_strcasecmp(prev_wd, TO) == 0))
!   matches = completion_matches(text, 
filename_completion_function);
  
/* Handle COPY|BINARY sth FROM|TO filename */
else if ((pg_strcasecmp(prev4_wd, COPY) == 0 ||
--- 1620,1629 
  pg_strcasecmp(prev3_wd, BINARY) == 0)

[HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings

2012-01-14 Thread Noah Misch

It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting.
Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all
backslash commands, does not route through SendQuery().  Looking into this
turned up several other weaknesses in psql's handling of COPY.  For example,
SendQuery() does not release the savepoint after an ordinary COPY:

[local] test=# set log_statement = 'all'; set client_min_messages = 'log'; copy 
(select 0) to stdout;
SET
SET
LOG:  statement: RELEASE pg_psql_temporary_savepoint
LOG:  statement: SAVEPOINT pg_psql_temporary_savepoint
LOG:  statement: copy (select 0) to stdout;
0

When psql sends a command string containing more than one command, at least
one of which is a COPY, we stop processing results at the first COPY:

[local] test=# set log_statement = 'all'; set client_min_messages = 'log'; copy 
(select 0) to stdout\; select 1/0; select 1;
SET
SET
LOG:  statement: RELEASE pg_psql_temporary_savepoint
LOG:  statement: SAVEPOINT pg_psql_temporary_savepoint
LOG:  statement: copy (select 0) to stdout; select 1/0;
0
LOG:  statement: select 1;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block

To make the above work normally, this patch improves SendQuery()-based COPY
command handling to process the entire queue of results whenever we've
processed a COPY.  It also brings sensible handling in the face of multiple
COPY commands in a single command string.  See the included test cases for
some examples.

We must prepare for COPY to fail for a local reason, like client-side malloc()
failure or network I/O problems.  The server will still have the connection in
a COPY mode, and we must get it out of that mode.  The \copy command was
prepared for the COPY FROM case with this code block:

/*
 * Make sure we have pumped libpq dry of results; else it may still be 
in
 * ASYNC_BUSY state, leading to false readings in, eg, get_prompt().
 */
while ((result = PQgetResult(pset.db)) != NULL)
{
success = false;
psql_error(\\copy: unexpected response (%d)\n,
   PQresultStatus(result));
/* if still in COPY IN state, try to get out of it */
if (PQresultStatus(result) == PGRES_COPY_IN)
PQputCopyEnd(pset.db, _(trying to exit copy mode));
PQclear(result);
}

It arose from these threads:
http://archives.postgresql.org/pgsql-hackers/2006-11/msg00694.php
http://archives.postgresql.org/pgsql-general/2009-08/msg00268.php

However, psql still enters an infinite loop when COPY TO STDOUT encounters a
client-side failure, such as malloc() failure.  I've merged the above
treatment into lower-level routines, granting plain COPY commands similar
benefits, and fixed the COPY TO handling.  To help you test the corner cases
involved here, I've attached a gdb script that will inject client side
failures into both kinds of COPY commands.  Attach gdb to your psql process,
source the script, and compare the behavior of commands like these with and
without the patch:

\copy (select 0) to pstdout

create table t (c int);
\copy t from stdin
1
\.

With plain COPY handled thoroughly, I fixed \copy by having it override the
source or destination stream, then call SendQuery().  This gets us support for
ON_ERROR_ROLLBACK, \timing, \set ECHO and \set SINGLESTEP for free.  I found
it reasonable to treat \copy's SQL commmand more like a user-entered command
than an internal command, because \copy is such a thin wrapper around COPY
FROM STDIN/COPY TO STDOUT.

This patch makes superfluous the second argument of PSQLexec(), but I have not
removed that argument.

Incidentally, psql's common.c had several switches on PQresultStatus(res) that
did not include a case for PGRES_COPY_BOTH and also silently assumed any
unlisted status was some kind of error.  I tightened these to distinguish all
known statuses and give a diagnostic upon receiving an unknown status.

Thanks,
nm

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 69fac83..c7d024a 100644
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
***
*** 320,344  exec_command(const char *cmd,
/* \copy */
else if (pg_strcasecmp(cmd, copy) == 0)
{
-   /* Default fetch-it-all-and-print mode */
-   instr_time  before,
-   after;
- 
char   *opt = psql_scan_slash_option(scan_state,

 OT_WHOLE_LINE, NULL, false);
  
-   if (pset.timing)
-   INSTR_TIME_SET_CURRENT(before);
- 
success = do_copy(opt);
- 
-   if (pset.timing  success)
-   {
-   INSTR_TIME_SET_CURRENT(after);
-

Re: [HACKERS] Measuring relation free space

2012-01-16 Thread Noah Misch

On Sat, Jan 14, 2012 at 02:40:46PM -0500, Jaime Casanova wrote:
 On Sat, Jan 14, 2012 at 6:26 AM, Noah Misch n...@leadboat.com wrote:
 
  - pgstattuple() and relation_free_space() should emit the same number, even 
  if
  ?that means improving pgstattuple() at the same time.
 
 yes, i just wanted to understand which one was more accurate and
 why... and give the opportunity for anyone to point my error if any

pgstattuple()'s decision to treat half-dead pages like deleted pages is
better.  That transient state can only end in the page's deletion.

I don't know about counting non-leaf pages, but I personally wouldn't revisit
pgstattuple()'s decision there.  In the indexes I've briefly surveyed, the
ratio of leaf pages to non-leaf pages is 100:1 or better.  No amount of bloat
in that 1% will matter.  Feel free to make the argument if you think
otherwise, though; I've only taken a brief look at the topic.

  - relation_free_space() belongs in the pgstattuple extension, because its 
  role
  ?is cheaper access to a single pgstattuple() metric.
 
 oh! right! so, what about just fixing the free_percent that
 pgstattuple is providing

If pgstattuple() meets your needs, you might indeed no longer need any patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-01-17 Thread Noah Misch

On Sun, Jan 15, 2012 at 01:49:54AM -0300, Alvaro Herrera wrote:
 - I'm not sure that the multixact truncation code is sane on
 checkpoints.  It might be that I need to tweak more the pg_control info
 we keep about truncation.  The whole truncation thing needs more
 testing, too.

My largest outstanding concern involves the possibility of MultiXactId
wraparound.  From my last review:

This raises a notable formal hazard: it's possible to burn through the
MultiXactId space faster than the regular TransactionId space.  We 
could get
into a situation where pg_clog is covering 2B xids, and yet we need 4B
MultiXactId to cover that period.  We had better at least notice this 
and
halt, if not have autovacuum actively prevent it.

(That should have been 2B rather than 4B, since MultiXactId uses the same
2B-in-past, 2B-in-future behavior as regular xids.)

Are we willing to guess that this will never happen and make recovery
minimally possible?  If so, we could have GetNewMultiXactId() grow defenses
similar to GetNewTransactionId() and leave it at that.  If not, we need to
involve autovacuum.


The other remaining high-level thing is to have key_attrs contain only columns
actually referenced by FKs.

 - Go over Noah's two reviews again and see if I missed anything; also
 make sure I haven't missed anything from other reviewers.

There are some, yes.

 *** a/src/backend/access/heap/heapam.c
 --- b/src/backend/access/heap/heapam.c

 ***
 *** 2773,2783  l2:
   }
   else if (result == HeapTupleBeingUpdated  wait)
   {
 ! TransactionId xwait;
   uint16  infomask;
   
   /* must copy state data before unlocking buffer */
 ! xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
   infomask = oldtup.t_data-t_infomask;
   
   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 --- 2848,2871 
   }
   else if (result == HeapTupleBeingUpdated  wait)
   {
 ! TransactionId   xwait;
   uint16  infomask;
 + boolnone_remain = false;

Nothing can ever set this variable to anything different.  It seems that
keep_xact == InvalidTransactionId substitutes well enough, though.


   /*
* We may overwrite if previous xmax aborted, or if it 
 committed but
 !  * only locked the tuple without updating it, or if we are 
 going to
 !  * keep it around in Xmax.
*/

The second possibility is just a subset of the third.

 ! if (TransactionIdIsValid(keep_xmax) ||
 ! none_remain ||
 ! (oldtup.t_data-t_infomask  HEAP_XMAX_INVALID))
   result = HeapTupleMayBeUpdated;
   else
   result = HeapTupleUpdated;

 ***
 *** 3314,3323  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
*/
   static bool
   HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 !HeapTuple oldtup, HeapTuple newtup)
   {
   int attrnum;
   
   while ((attrnum = bms_first_member(hot_attrs)) = 0)
   {
   /* Adjust for system attributes */
 --- 3537,3549 
*/
   static bool
   HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 !HeapTuple oldtup, HeapTuple newtup, 
 bool empty_okay)
   {
   int attrnum;
   
 + if (!empty_okay  bms_is_empty(hot_attrs))
 + return false;

When a table contains no key attributes, it seems arbitrary whether we call
the key revoked or not.  What is the motivation for this change?

 ! /*
 !  * If we're requesting KeyShare, and there's no update present, 
 we
 !  * don't need to wait.  Even if there is an update, we can still
 !  * continue if the key hasn't been modified.
 !  *
 !  * However, if there are updates, we need to walk the update 
 chain
 !  * to mark future versions of the row as locked, too.  That 
 way, if
 !  * somebody deletes that future version, we're protected 
 against the
 !  * key going away.  This locking of future versions could block
 !  * momentarily, if a concurrent transaction is deleting a key; 
 or it
 !  * could return a value to the effect that the transaction 
 deleting the
 !  * key has already committed.  So we do this before re-locking 
 the
 !  * buffer; otherwise this would be prone to deadlocks.  Note 
 that the TID
 !  * we're locking was grabbed before we unlocked the buffer.  
 For it to
 !  * change while we're not looking, the other properties we're 
 testing
 !  * for below after re-locking the buffer would also change, in 
 which
 !

Re: [HACKERS] foreign key locks, 2nd attempt

2012-01-17 Thread Noah Misch

On Mon, Jan 16, 2012 at 04:52:36PM -0300, Alvaro Herrera wrote:
 Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012:
  On 15.01.2012 06:49, Alvaro Herrera wrote:
   --- 164,178 
 #define HEAP_HASVARWIDTH0x0002/* has variable-width 
   attribute(s) */
 #define HEAP_HASEXTERNAL0x0004/* has external stored 
   attribute(s) */
 #define HEAP_HASOID0x0008/* has an object-id field 
   */
   ! #define HEAP_XMAX_KEYSHR_LOCK0x0010/* xmax is a key-shared 
   locker */
 #define HEAP_COMBOCID0x0020/* t_cid is a combo cid */
 #define HEAP_XMAX_EXCL_LOCK0x0040/* xmax is exclusive 
   locker */
   ! #define HEAP_XMAX_IS_NOT_UPDATE0x0080/* xmax, if valid, is only 
   a locker.
   !  * Note this is not set unless
   !  * XMAX_IS_MULTI is also set. */
   !
   ! #define HEAP_LOCK_BITS(HEAP_XMAX_EXCL_LOCK | 
   HEAP_XMAX_IS_NOT_UPDATE | \
   !  HEAP_XMAX_KEYSHR_LOCK)
 #define HEAP_XMIN_COMMITTED0x0100/* t_xmin committed */
 #define HEAP_XMIN_INVALID0x0200/* t_xmin invalid/aborted */
 #define HEAP_XMAX_COMMITTED0x0400/* t_xmax committed */
  
  HEAP_XMAX_IS_NOT_UPDATE is a pretty opaque name for that. From the name, 
  I'd say that a DELETE would set that, but the comment says it has to do 
  with locking. After going through all the combinations in my mind, I 
  think I finally understood it: HEAP_XMAX_IS_NOT_UPDATE is set if xmax is 
  a multi-xact, that represent only locking xids, no updates. How about 
  calling it HEAP_XMAX_LOCK_ONLY, and setting it whenever whether or not 
  xmax is a multi-xid?
 
 Hm, sounds like a good idea.  Will do.

+1

  Why are you renaming HeapTupleHeaderGetXmax() into 
  HeapTupleHeaderGetRawXmax()? Any current callers of 
  HeapTupleHeaderGetXmax() should already check that HEAP_XMAX_IS_MULTI is 
  not set.
 
 I had this vague impression that it'd be better to break existing
 callers so that they ensure they decide between
 HeapTupleHeaderGetRawXmax and HeapTupleHeaderGetUpdateXid.  Noah
 suggested changing the macro name, too.  It's up to each caller to
 decide what's the sematics they want.  Most want the latter; and callers
 outside core are more likely to want that one.  If we kept the old name,
 they would get the wrong value.

My previous suggestion was to have both macros:

#define HeapTupleHeaderGetXmax(tup) \
( \
AssertMacro(!((tup)-t_infomask  HEAP_XMAX_IS_MULTI)), \
HeapTupleHeaderGetRawXmax(tup) \
)

#define HeapTupleHeaderGetRawXmax(tup) \
( \
(tup)-t_choice.t_heap.t_xmax \
)

No strong preference, though.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-17 Thread Noah Misch

On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote:
 Thanks for your fixes to the patch. Them looks correct to me. I did some
 fixes in the patch. The proof of some concepts is still needed. I'm going
 to provide it in a few days.

Your further fixes look good.  Could you also answer my question about the
header comment of mcelem_array_contained_selec()?

/*
 * Estimate selectivity of column @ const based on most common element
 * statistics.  Independent element occurrence would imply a particular
 * distribution of distinct element counts among matching rows.  Real data
 * usually falsifies that assumption.  For example, in a set of 1-element
 * integer arrays having elements in the range [0;10], element occurrences are
 * not independent.  If they were, a sufficiently-large set would include all
 * distinct element counts 0 through 11.  We correct for this using the
 * histogram of distinct element counts.
 *
 * In the column @ const and column  const cases, we usually have
 * const with low summary frequency of elements (otherwise we have
 * selectivity close to 0 or 1 correspondingly).  That's why the effect of
 * dependence related to distinct element counts distribution is negligible
 * there.  In the column @ const case, summary frequency of elements is
 * high (otherwise we have selectivity close to 0).  That's why we should do
 * correction due to array distinct element counts distribution.
 */

By summary frequency of elements, do you mean literally P_0 + P_1 ... + P_N?
If so, I can follow the above argument for column  const and column @
const, but not for column @ const.  For column @ const, selectivity
cannot exceed the smallest frequency among const elements.  A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql \timing vs failed statements

2012-01-17 Thread Noah Misch

On Tue, Jan 17, 2012 at 04:01:23PM +0100, Magnus Hagander wrote:
 Thus - if I were to change psql to output timing on failed queries as
 well, will anybody object? ;)

+1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-01-18 Thread Noah Misch

On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote:
 On Mon, Jan 16, 2012 at 5:09 AM, Noah Misch n...@leadboat.com wrote:
 
  pgstattuple()'s decision to treat half-dead pages like deleted pages is
  better. ?That transient state can only end in the page's deletion.
 
 
 the only page in that index has 200 records (all live 0 dead) using
 half the page size (which is a leaf page and is not half dead, btw).
 so, how do you justify that pgstattuple say we have just 25% of free
 space?
 
 postgres=# SELECT * from bt_page_stats('pgbench_tellers_pkey', 1);
 -[ RECORD 1 ]-+-
 blkno  | 1
 type   | l
 live_items   | 200
 dead_items | 0
 avg_item_size | 16
 page_size   | 8192
 free_size | 4148
 btpo_prev| 0
 btpo_next| 0
 btpo| 0
 btpo_flags   | 3
 
  I don't know about counting non-leaf pages
 
 ignoring all non-leaf pages still gives a considerable difference
 between pgstattuple and relation_free_space()

pgstattuple() counts the single B-tree meta page as always-full, while
relation_free_space() skips it for all purposes.  For tiny indexes, that can
shift the percentage dramatically.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-01-18 Thread Noah Misch

On Wed, Jan 18, 2012 at 05:18:31PM -0300, Alvaro Herrera wrote:
 Excerpts from Heikki Linnakangas's message of mar ene 17 03:21:28 -0300 2012:
  On 16.01.2012 21:52, Alvaro Herrera wrote:
I was initially thinking that pg_multixact should return the
   empty set if requested members of a multi that preceded the freeze
   point.  That way, I thought, we would just never try to access a page
   originated in the older version (assuming the freeze point is set to
   current whenever pg_upgrade runs).  However, as things currently
   stand, accessing an old multi raises an error.  So maybe we need a
   scheme a bit more complex to handle this.
  
  Hmm, could we create new multixact files filled with zeros, covering the 
  range that was valid in the old cluster?
 
 Hm, we could do something like that I guess.  I'm not sure that just
 zeroes is the right pattern, but it should be something simple.

PostgreSQL 9.1 can have all ~4B MultiXactId on disk at any given time.

We could silently ignore the lookup miss when HEAP_XMAX_LOCK_ONLY is also set.
That makes existing data files acceptable while still catching data loss
scenarios going forward.  (It's tempting to be stricter when we know the
cluster data files originated in PostgreSQL 9.2+, but I'm not sure whether
that's worth its weight.)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility

2012-01-19 Thread Noah Misch

Hi Mikko,

First, for everyone else: this patch provides a more-compact binary output
format for arrays.  When the array contains no NULL elements and has a
fixed-length element type, the new format saves four bytes per array element.
We could do more.  We could add support for arrays containing NULLs by way of
a nulls bitmap.  We could reduce the per-array overhead, currently 20 bytes
for a 1-D array, in addition to the per-element overhead.  Does anyone care
about these broader cases?  If so, speak now.

On Thu, Dec 01, 2011 at 04:42:43PM +0200, Mikko Tiihonen wrote:
 On 11/30/2011 06:52 PM, Merlin Moncure wrote:
 On Mon, Nov 28, 2011 at 9:18 AM, Mikko Tiihonen
 mikko.tiiho...@nitorcreations.com  wrote:
 Hi,

 As discussed few days ago it would be beneficial if we could change the v3
 backend-client protocol without breaking backwards compatibility.

 Here is a working patch that exports a supported_binary_minor constant and a
 binary_minor session variable and a that can be used by clients to enable
 newer features.

 I also added an example usage where the array encoding for constant size
 elements is optimized if binary_minor version is new enough.

 I have coded the client side support for binary_minor for jdbc driver and
 tested that it worked. But lets first discuss if this is an acceptable path
 forward.

 I think all references to
 'protocol' should be removed;  Binary wire formats != protocol: the
 protocol could bump to v4 but the wire formats (by happenstance) could
 all still continue to work -- therefore the version isn't minor (it's
 not dependent on protocol version but only on itself).  Therefore,
 don't much like the name 'supported_binary_minor'.  How about
 binary_format_version?

 I was thinking that it would be possible use the new minor version to
 introduce also new protocol messages such as streaming of large data
 into database without knowing it's size beforehand.

I agree with Merlin; the frontend/backend protocol is logically distinct from
the binary send/recv formats of data types.  For one key point, the latter is
not exclusively core-defined; third-party extensions change their send/recv
formats on a different schedule.  They can add myext.binary_format_version
GUCs of their own to cope in a similar way.

Client interfaces that do not interpret individual result column data, such as
libpq, require no update for send/recv format changes.  In contrast, all
client interfaces would need changes for a new protocol message.  Most clients
doing so may as well then advertise their support unconditionally.  In
contrast, send/recv format is something individual _users_ of the same client
library may set differently.  It's reasonable to have an application that
still reads its data in send/recv format v0 even after upgrading to a version
of libpq that talks frontend/backend protocol v4.

I do think this is a great way to handle send/recv format changes.

 There needs to be decision on official policy about breaking backwards
 compatibility of postgresql clients. Is it:

 A) Every x.y postgres release is free to break any parts of the
* protocol
* text encoding
* binary protocol
as long as it is documented
- all client libraries should be coded so that they refuse to
   support version x.y+1 or newer (they might have a option to
   override this but there are no guarantees that it would work)
Good: no random bugs when using old client library
Bad: initial complaints from users before they understand that
 this is the best option for everyone

The ability to use old client libraries with new servers is more valuable than
the combined benefits of all currently-contemplated protocol improvements.

 D) My proposed compromise where there is one minor_version setting
and code in server to support all different minor versions.
The client requests the minor version so that all releases
default to backwards compatible version.

This is workable.

When there combinations starts to create too much maintenance
overhead a new clean v4 version of protocol is specified.
Good: Keeps full backwards compatibility
Good: Allows introducing changes at any time
Bad: Accumulates conditional code to server and clients until
 new version of protocol is released

We would retain support for protocol V3 for years following the first release
to support protocol V4, so think of the conditional code as lasting forever.


The patch works as advertised.  After set binary_minor = 1, the output of
this command shrinks from 102 MiB to 63 MiB:

\copy (select array[0,1,2,3,4,5,6,7,8,9] from 
generate_series(1,100)) to mynums with binary

 --- a/src/backend/utils/misc/guc.c
 +++ b/src/backend/utils/misc/guc.c
 @@ -477,6 +477,7 @@ static intsegment_size;
  static int   wal_block_size;
  static int   wal_segment_size;
  static bool integer_datetimes;
 +static int  supported_binary_minor;
  static int

Re: [HACKERS] Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility

2012-01-19 Thread Noah Misch

On Thu, Jan 19, 2012 at 02:00:20PM -0500, Robert Haas wrote:
 On Thu, Jan 19, 2012 at 10:37 AM, Noah Misch n...@leadboat.com wrote:
  I agree with Merlin; the frontend/backend protocol is logically distinct 
  from
  the binary send/recv formats of data types. ?For one key point, the latter 
  is
  not exclusively core-defined; third-party extensions change their send/recv
  formats on a different schedule. ?They can add myext.binary_format_version
  GUCs of their own to cope in a similar way.
 
 I agree.  It occurs to me that we recently changed the default *text*
 output format for bytea for reasons not dissimilar to those
 contemplated here.  Presumably, that's a much more disruptive change,
 and yet we've had minimal complaints because anyone who gets bitten
 can easily set bytea_output='escape' and the problem goes away.  The
 same thing seems like it would work here, only the number of people
 needing to change the parameter will probably be even smaller, because
 fewer people use binary than text.
 
 Having said that, if we're to follow the precedent set by
 bytea_format, maybe we ought to just add
 binary_array_format={huge,ittybitty} and be done with it, rather than
 inventing a minor protocol version GUC for something that isn't really
 a protocol version change at all.  We could introduce a
 differently-named general mechanism, but I guess I'm not seeing the
 point of that either.  Just because someone has a
 backward-compatibility issue with one change of this type doesn't mean
 they have a similar issue with all of them.  So I think adding a
 special-purpose GUC is more logical and more parallel to what we've
 done in the past, and it doesn't saddle us with having to be certain
 that we've designed the mechanism generally enough to handle all the
 cases that may come later.

That makes sense.  An attraction of a single binary format version was avoiding
the Is this worth a GUC? conversation for each change.  However, adding a GUC
should be no more notable than bumping a binary format version.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-01-20 Thread Noah Misch

On Fri, Jan 20, 2012 at 07:03:22PM -0500, Jaime Casanova wrote:
 On Wed, Jan 18, 2012 at 7:01 PM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote:
 
  ignoring all non-leaf pages still gives a considerable difference
  between pgstattuple and relation_free_space()
 
  pgstattuple() counts the single B-tree meta page as always-full, while
  relation_free_space() skips it for all purposes. ?For tiny indexes, that can
  shift the percentage dramatically.
 
 
 ok, i will reformulate the question. why is fine ignoring non-leaf
 pages but is not fine to ignore the meta page?

pgstattuple() figures the free_percent by adding up all space available to
hold tuples and dividing that by the simple size of the relation.  Non-leaf
pages and the meta page get identical treatment: both never hold tuples, so
they do not contribute to the free space.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-01-21 Thread Noah Misch

On Sat, Jan 14, 2012 at 08:18:48PM +0100, Marco Nenciarini wrote:
 This is our latest version of the patch. Gabriele, Gianni and I have
 discussed a lot and decided to send an initial patch which uses EACH
 REFERENCES instead of ARRAY REFERENCES. The reason behind this is that
 ARRAY REFERENCES generates an ambiguous grammar, and we all agreed that
 EACH REFERENCES makes sense (and the same time does not introduce any
 new keyword). This is however open for discussion.

I greatly like that name; it would still make sense for other aggregate types,
should we ever expand its use.  Please complete the name change: the
documentation, catalog entries, etc should all call them something like each
foreign key constraints (I don't particularly like that exact wording).

You currently forbid multi-column EACH FKs.  I agree that we should allow only
one array column per FK; with more, the set of required PK rows would be
something like the Cartesian product of the elements of array columns.
However, there are no definitional problems, at least for NO ACTION, around a
FK constraint having one array column and N scalar columns.  Whether or not
you implement that now, let's choose a table_constraint syntax leaving that
opportunity open.  How about:
FOREIGN KEY(col_a, EACH col_b, col_c) REFERENCES pktable (a, b, c)

You've identified that we cannot generally implement the ON DELETE ARRAY
CASCADE action on multidimensional arrays.  This patch chooses to downgrade to
ON DELETE ARRAY SET NULL in that case.  My initial reaction is that it would
be better to forbid multidimensional arrays in the column when the delete
action is ON DELETE ARRAY SET NULL.  That's not satisfying, either, because it
makes the definition of conforming data depend on the ON DELETE action.  Do we
have other options?

 --- - -
|   ON|   ON|
 Action | DELETE  | UPDATE  |
 --- - -
 CASCADE|   Row   |  Error  |
 SET NULL   |   Row   |   Row   |
 SET DEFAULT|   Row   |   Row   |
 ARRAY CASCADE  | Element | Element |
 ARRAY SET NULL | Element | Element |
 NO ACTION  |-|-|
 RESTRICT   |-|-|
 --- - - 

To complete the ARRAY - EACH transition, I would suggest names like CASCADE
EACH/SET EACH NULL.

I like the extensive test cases you have included.  There's one more thing
they should do: leave tables having EACH REFERENCES relationships in the
regression database.  This way, pg_dump tests of the regression database will
exercise pg_dump with respect to this feature.

The patch emits several warnings:

heap.c: In function `StoreRelCheck':
heap.c:1947: warning: passing argument 17 of `CreateConstraintEntry' makes 
integer from pointer without a cast
index.c: In function `index_constraint_create':
index.c:1160: warning: passing argument 17 of `CreateConstraintEntry' makes 
integer from pointer without a cast
In file included from gram.y:13051:
scan.c: In function `yy_try_NUL_trans':
scan.c:16243: warning: unused variable `yyg'
trigger.c: In function `CreateTrigger':
trigger.c:454: warning: passing argument 17 of `CreateConstraintEntry' makes 
integer from pointer without a cast
typecmds.c: In function `domainAddConstraint':
typecmds.c:2960: warning: passing argument 17 of `CreateConstraintEntry' makes 
integer from pointer without a cast
arrayfuncs.c: In function `array_remove':
arrayfuncs.c:5197: warning: unused variable `dimresult'
ri_triggers.c: In function `RI_FKey_check':
ri_triggers.c:484: warning: too many arguments for format

This test case, copied from my previous review except for updating the syntax,
still fails:

BEGIN;
CREATE TABLE parent (c int PRIMARY KEY);
CREATE TABLE child  (c int[]);
INSERT INTO parent VALUES (1);
INSERT INTO child VALUES ('{3,1,2}');
ALTER TABLE child ADD FOREIGN KEY (c) EACH REFERENCES parent; -- should error
INSERT INTO child VALUES ('{3,1,2}'); -- does error, as expected
ROLLBACK;

Most of my code comments concern minor matters:

 *** a/doc/src/sgml/ddl.sgml
 --- b/doc/src/sgml/ddl.sgml

 *** CREATE TABLE order_items (
 *** 852,857 
 --- 882,931 
  /para
   
  para
 + When working with foreign key arrays, you have two more
 + options that can be used with your
 + literalEACH REFERENCES/literal definition:
 + literalARRAY CASCADE/literal and
 + literalARRAY SET NULL/literal. Depending on
 + the triggering action (commandDELETE/command or
 + commandUPDATE/command) on the referenced table,
 + every element in the referencing array will be either
 + deleted/updated or set to NULL.
 + For more detailed information on foreign key arrays
 + options and special cases, please refer to the
 + documentation for xref linkend=sql-createtable.
 +/para
 + 
 + para
 + For instance, in the example below, a commandDELETE/command
 + from the literalcountries/literal table

Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-21 Thread Noah Misch

On Sat, Jan 21, 2012 at 08:04:20PM -0800, Jeff Janes wrote:
 This is failing make check for me.
 
 
 *** /tmp/foo/src/test/regress/expected/alter_table.out  Sat Jan 21 19:51:46 
 2012
 --- /tmp/foo/src/test/regress/results/alter_table.out   Sat Jan 21 19:54:18 
 2012
 ***
 *** 1662,1667 
 --- 1662,1668 
   NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
 norewrite1_parent_pkey for table norewrite1_parent
   DEBUG:  building index norewrite1_parent_pkey on table norewrite1_parent
   CREATE TABLE norewrite1_child(c text REFERENCES norewrite1_parent);
 + DEBUG:  building index pg_toast_41491_index on table pg_toast_41491
   ALTER TABLE norewrite1_child ALTER c TYPE varchar; -- recheck FK
   DEBUG:  validating foreign key constraint norewrite1_child_c_fkey
   CREATE DOMAIN other_int AS int;

Thanks.  I've attached a new version fixing this problem.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cc210f0..bb2cf62 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 276,281  static Oid transformFkeyCheckAttrs(Relation pkrel,
--- 276,282 
int numattrs, int16 *attnums,
Oid *opclasses);
  static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts);
+ static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid 
*funcid);
  static void validateCheckConstraint(Relation rel, HeapTuple constrtup);
  static void validateForeignKeyConstraint(char *conname,
 Relation rel, Relation 
pkrel,
***
*** 357,362  static void ATPostAlterTypeCleanup(List **wqueue, 
AlteredTableInfo *tab, LOCKMOD
--- 358,364 
  static void ATPostAlterTypeParse(Oid oldId, char *cmd,
 List **wqueue, LOCKMODE lockmode, bool 
rewrite);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
+ static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
 Oid oldOwnerId, Oid 
newOwnerId);
  static void change_owner_recurse_to_sequences(Oid relationOid,
***
*** 5574,5579  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5576,5583 
numpks;
Oid indexOid;
Oid constrOid;
+   boolold_check_ok;
+   ListCell   *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop);
  
/*
 * Grab an exclusive lock on the pk table, so that someone doesn't 
delete
***
*** 5690,5695  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5694,5706 
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
 errmsg(number of referencing and referenced 
columns for foreign key disagree)));
  
+   /*
+* On the strength of a previous constraint, we might avoid scanning
+* tables to validate this one.  See below.
+*/
+   old_check_ok = (fkconstraint-old_conpfeqop != NIL);
+   Assert(!old_check_ok || numfks == 
list_length(fkconstraint-old_conpfeqop));
+ 
for (i = 0; i  numpks; i++)
{
Oid pktype = pktypoid[i];
***
*** 5704,5709  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5715,5721 
Oid ppeqop;
Oid ffeqop;
int16   eqstrategy;
+   Oid pfeqop_right;
  
/* We need several fields out of the pg_opclass entry */
cla_ht = SearchSysCache1(CLAOID, 
ObjectIdGetDatum(opclasses[i]));
***
*** 5746,5755  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

 eqstrategy);
else
!   ffeqop = InvalidOid;/* keep compiler quiet */
  
if (!(OidIsValid(pfeqop)  OidIsValid(ffeqop)))
{
--- 5758,5774 
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
+   {
+   pfeqop_right = fktyped;
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-21 Thread Noah Misch

New version that repairs a defective test case.
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 712b0b0..1bf1de5 100644
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***
*** 23,28 
--- 23,29 
  #include catalog/pg_opclass.h
  #include catalog/pg_opfamily.h
  #include catalog/pg_tablespace.h
+ #include catalog/pg_type.h
  #include commands/dbcommands.h
  #include commands/defrem.h
  #include commands/tablecmds.h
***
*** 52,57 
--- 53,59 
  /* non-export function prototypes */
  static void CheckPredicate(Expr *predicate);
  static void ComputeIndexAttrs(IndexInfo *indexInfo,
+ Oid *typeOidP,
  Oid *collationOidP,
  Oid *classOidP,
  int16 *colOptionP,
***
*** 87,104  static void RangeVarCallbackForReindexIndex(const RangeVar 
*relation,
   * of columns and that if one has an expression column or predicate, both do.
   * Errors arising from the attribute list still apply.
   *
!  * Most column type changes that can skip a table rewrite will not invalidate
!  * indexes.  For btree and hash indexes, we assume continued validity when
!  * each column of an index would have the same operator family before and
!  * after the change.  Since we do not document a contract for GIN or GiST
!  * operator families, we require an exact operator class match for them and
!  * for any other access methods.
!  *
!  * DefineIndex always verifies that each exclusion operator shares an operator
!  * family with its corresponding index operator class.  For access methods
!  * having no operator family contract, confirm that the old and new indexes
!  * use the exact same exclusion operator.  For btree and hash, there's nothing
!  * more to check.
   *
   * We do not yet implement a test to verify compatibility of expression
   * columns or predicates, so assume any such index is incompatible.
--- 89,105 
   * of columns and that if one has an expression column or predicate, both do.
   * Errors arising from the attribute list still apply.
   *
!  * Most column type changes that can skip a table rewrite do not invalidate
!  * indexes.  We ackowledge this when all operator classes, collations and
!  * exclusion operators match.  Though we could further permit intra-opfamily
!  * changes for btree and hash indexes, that adds subtle complexity with no
!  * concrete benefit for core types.
! 
!  * When a comparison or exclusion operator has a polymorphic input type, the
!  * actual input types must also match.  This defends against the possibility
!  * that operators could vary behavior in response to get_fn_expr_argtype().
!  * At present, this hazard is theoretical: check_exclusion_constraint() and
!  * all core index access methods decline to set fn_expr for such calls.
   *
   * We do not yet implement a test to verify compatibility of expression
   * columns or predicates, so assume any such index is incompatible.
***
*** 111,116  CheckIndexCompatible(Oid oldId,
--- 112,118 
 List *exclusionOpNames)
  {
boolisconstraint;
+   Oid*typeObjectId;
Oid*collationObjectId;
Oid*classObjectId;
Oid accessMethodId;
***
*** 123,132  CheckIndexCompatible(Oid oldId,
int numberOfAttributes;
int old_natts;
boolisnull;
-   boolfamily_am;
boolret = true;
oidvector  *old_indclass;
oidvector  *old_indcollation;
int i;
Datum   d;
  
--- 125,134 
int numberOfAttributes;
int old_natts;
boolisnull;
boolret = true;
oidvector  *old_indclass;
oidvector  *old_indcollation;
+   Relationirel;
int i;
Datum   d;
  
***
*** 168,177  CheckIndexCompatible(Oid oldId,
indexInfo-ii_ExclusionOps = NULL;
indexInfo-ii_ExclusionProcs = NULL;
indexInfo-ii_ExclusionStrats = NULL;
collationObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid));
classObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid));
coloptions = (int16 *) palloc(numberOfAttributes * sizeof(int16));
!   ComputeIndexAttrs(indexInfo, collationObjectId, classObjectId,
  coloptions, attributeList,
  exclusionOpNames, relationId,
  accessMethodName, accessMethodId,
--- 170,181 
indexInfo-ii_ExclusionOps = NULL;

Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-01-22 Thread Noah Misch

On Sun, Jan 22, 2012 at 09:06:49PM +, Simon Riggs wrote:
 On Sat, Jan 21, 2012 at 8:42 PM, Noah Misch n...@leadboat.com wrote:
 
  You currently forbid multi-column EACH FKs. ?I agree that we should allow 
  only
  one array column per FK; with more, the set of required PK rows would be
  something like the Cartesian product of the elements of array columns.
  However, there are no definitional problems, at least for NO ACTION, around 
  a
  FK constraint having one array column and N scalar columns. ?Whether or not
  you implement that now, let's choose a table_constraint syntax leaving that
  opportunity open. ?How about:
  ? ? ? ?FOREIGN KEY(col_a, EACH col_b, col_c) REFERENCES pktable (a, b, c)
 
 
 I don't think we should be trying to cover every possible combination
 of arrays, non-arrays and all the various options. The number of
 combinations is making this patch larger than it needs to be and as a
 result endangers its being committed in this release just on committer
 time to cope with the complexity. We have a matter of weeks to get
 this rock solid.
 
 Yes, lets keep syntax open for future additions, but lets please
 focus/edit this down to a solid, useful patch for 9.2.

+1.  Let's change the syntax to leave that door open but not use the
flexibility at this time.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-22 Thread Noah Misch

On Sun, Jan 22, 2012 at 11:47:06PM +0200, Mikko Tiihonen wrote:
 On 01/20/2012 04:45 AM, Noah Misch wrote:
 On Thu, Jan 19, 2012 at 02:00:20PM -0500, Robert Haas wrote:
 Having said that, if we're to follow the precedent set by
 bytea_format, maybe we ought to just add
 binary_array_format={huge,ittybitty} and be done with it, rather than
 inventing a minor protocol version GUC for something that isn't really
 a protocol version change at all.  We could introduce a
 differently-named general mechanism, but I guess I'm not seeing the
 point of that either.  Just because someone has a
 backward-compatibility issue with one change of this type doesn't mean
 they have a similar issue with all of them.  So I think adding a
 special-purpose GUC is more logical and more parallel to what we've
 done in the past, and it doesn't saddle us with having to be certain
 that we've designed the mechanism generally enough to handle all the
 cases that may come later.

 That makes sense.  An attraction of a single binary format version was 
 avoiding
 the Is this worth a GUC? conversation for each change.  However, adding a 
 GUC
 should be no more notable than bumping a binary format version.

 I see the main difference between the GUC per feature vs minor version being 
 that
 in versioned changes old clients keep working because the have to explicitly
 request a specific version. Whereas in separate GUC variables each feature 
 will be
 enabled by default and users have to either keep up with new client versions 
 or
 figure out how to explicitly disable the changes.

No, we can decide that anew for each GUC.  If you'd propose that array_output
= full be the default, that works for me.

 Here is a second version of the patch. The binary array encoding changes
 stay the same but all code around was rewritten.

 Changes from previous versions based on received comments:
 * removed the minor protocol version concept
 * introduced a new GUC variable array_output copying the current
   bytea_output type, with values full (old value) and
   smallfixed (new default)

How about the name array_binary_output?

 * added documentation for the new GUC variable
 * used constants for the array flags variable values

 *** a/doc/src/sgml/config.sgml
 --- b/doc/src/sgml/config.sgml
 *** COPY postgres_log FROM '/full/path/to/lo
 *** 4888,4893 
 --- 4888,4910 
 /listitem
/varlistentry
   
 +  varlistentry id=guc-array-output xreflabel=array_output
 +   termvarnamearray_output/varname (typeenum/type)/term
 +   indexterm
 +primaryvarnamearray_output/ configuration parameter/primary
 +   /indexterm
 +   listitem
 +para
 + Sets the output format for binary encoding of values of
 + type typearray/type. Valid values are
 + literalsmallfixed/literal (the default)
 + and literalfull/literal (the traditional PostgreSQL
 + format).  The typearray/type type always

It's not The array type but Array types, a class.

 + accepts both formats on input, regardless of this setting.
 +/para
 +   /listitem
 +  /varlistentry

The section Array Input and Output Syntax should reference this GUC.

 *** a/src/backend/utils/misc/guc.c
 --- b/src/backend/utils/misc/guc.c
 ***
 *** 64,69 
 --- 64,70 
   #include storage/predicate.h
   #include tcop/tcopprot.h
   #include tsearch/ts_cache.h
 + #include utils/array.h
   #include utils/builtins.h
   #include utils/bytea.h
   #include utils/guc_tables.h
 *** static const struct config_enum_entry by
 *** 225,230 
 --- 226,243 
   };
   
   /*
 +  * Options for enum values defined in this module.
 +  *
 +  * NOTE! Option values may not contain double quotes!
 +  */

Don't replicate this comment.

 + 
 + static const struct config_enum_entry array_output_options[] = {
 + {full, ARRAY_OUTPUT_FULL, false},
 + {smallfixed, ARRAY_OUTPUT_SMALLFIXED, false},
 + {NULL, 0, false}
 + };
 + 
 + /*
* We have different sets for client and server message level options 
 because
* they sort slightly different (see log level)
*/
 *** static struct config_enum ConfigureNames
 *** 3047,3052 
 --- 3060,3075 
   NULL, NULL, NULL
   },
   
 + {
 + {array_output, PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,

You've put the GUC in COMPAT_OPTIONS_PREVIOUS, but you've put its
documentation in the section for CLIENT_CONN_STATEMENT.  I don't have a strong
opinion on which one to use, but be consistent.

 + gettext_noop(Sets the binary output format for 
 arrays.),
 + NULL
 + },
 + bytea_output,

array_output

 + ARRAY_OUTPUT_SMALLFIXED, array_output_options,
 + NULL, NULL, NULL
 + },
 + 
   {
   {client_min_messages, PGC_USERSET, LOGGING_WHEN,
   gettext_noop(Sets the message levels

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-23 Thread Noah Misch

On Mon, Jan 23, 2012 at 01:21:20AM +0400, Alexander Korotkov wrote:
 Updated patch is attached. I've updated comment
 of mcelem_array_contained_selec with more detailed description of
 probability distribution assumption. Also, I found that rest behavious
 should be better described by Poisson distribution, relevant changes were
 made.

Thanks.  That makes more of the math clear to me.  I do not follow all of it,
but I feel that the comments now have enough information that I could go about
doing so.

 + /* Take care about events with low probabilities. */
 + if (rest  DEFAULT_CONTAIN_SEL)
 + {

Why the change from rest  0 to this in the latest version?

 + /* emit some statistics for debug purposes */
 + elog(DEBUG3, array: target # mces = %d, bucket width = %d, 
 +  # elements = %llu, hashtable size = %d, usable 
 entries = %d,
 +  num_mcelem, bucket_width, element_no, i, track_len);

That should be UINT64_FMT.  (I introduced that error in v0.10.)


I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files.  I see no
other issues precluding commit, so I am marking the patch Ready for Committer.
If I made any of the comments worse, please post another update.

Thanks,
nm


arrayanalyze-0.13.patch.gz
Description: application/gunzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility

2012-01-23 Thread Noah Misch

On Mon, Jan 23, 2012 at 10:34:12AM -0500, Robert Haas wrote:
 On Mon, Jan 23, 2012 at 9:36 AM, Florian Weimer fwei...@bfk.de wrote:
  * Robert Haas:
  In this particular case, I knew that the change was coming and could
  push updated Java and Perl client libraries well before the server-side
  change hit our internal repository, but I really don't want to have to
  pay attention to such details.
 
  But if we *don't* turn this on by default, then chances are very good
  that it will get much less use. ?That doesn't seem good either. ?If
  it's important enough to do it at all, then IMHO it's important enough
  for it to be turned on by default. ?We have never made any guarantee
  that the binary format won't change from release to release.
 
  The problem here are libpq-style drivers which expose the binary format
  to the application. ?The Java driver doesn't do that, but the Perl
  driver does. ?(However, both transparently decode BYTEA values received
  in text format, which led to the compatibility issue.)
 
 I can see where that could cause some headaches... but it seems to me
 that if we take that concern seriously, it brings us right back to
 square one.  If breaking the binary format causes too much pain to
 actually go do it, then we shouldn't change it until we're breaking
 everything else anyway (i.e. new protocol version, as Tom suggested).

As I said upthread, and you appeared to agree, the protocol is independent of
individual data type send/recv formats.  Even if we were already adding
protocol v4 to PostgreSQL 9.2, having array_send() change its behavior in
response to the active protocol version would be wrong.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-01-23 Thread Noah Misch

On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote:
 Excerpts from Noah Misch's message of vie ene 20 22:33:30 -0300 2012:
  pgstattuple() figures the free_percent by adding up all space available to
  hold tuples and dividing that by the simple size of the relation.  Non-leaf
  pages and the meta page get identical treatment: both never hold tuples, so
  they do not contribute to the free space.
 
 Hm.  Leaf pages hold as much tuples as non-leaf pages, no?  I mean
 for each page element there's a value and a CTID.  In non-leaf those
 CTIDs point to other index pages, one level down the tree; in leaf pages
 they point to the heap.

That distinction seemed important when I sent my last message, but now I agree
that it's largely irrelevant for free space purposes.  If someone feels like
doing it, +1 for making pgstattuple() count non-leaf free space.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-01-25 Thread Noah Misch

On Tue, Jan 24, 2012 at 11:24:08AM -0500, Jaime Casanova wrote:
 On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch n...@leadboat.com wrote:
  If someone feels like
  doing it, +1 for making pgstattuple() count non-leaf free space.
 
 actually i agreed that non-leaf pages are irrelevant... i just
 confirmed that in a production system with 300GB none of the indexes
 in an 84M rows table nor in a heavily updated one has more than 1 root
 page, all the rest are deleted, half_dead or leaf. so the posibility
 of bloat coming from non-leaf pages seems very odd

FWIW, the number to look at is internal_pages from pgstatindex():

[local] test=# create table t4 (c) as select * from generate_series(1,100);
SELECT 100
[local] test=# alter table t4 add primary key(c);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index t4_pkey for 
table t4
ALTER TABLE
[local] test=# select * from pgstatindex('t4_pkey');
-[ RECORD 1 ]--+-
version| 2
tree_level | 2
index_size | 22478848
root_block_no  | 290
internal_pages | 10
leaf_pages | 2733
empty_pages| 0
deleted_pages  | 0
avg_leaf_density   | 90.06
leaf_fragmentation | 0

So, 0.4% of this index.  They appear in proportion to the logarithm of the
total index size.  I agree that bloat centered on them is unlikely.  Counting
them would be justified, but that is a question of formal accuracy rather than
practical importance.

 but the possibility of bloat coming from the meta page doesn't exist,
 AFAIUI at least
 
 we need the most accurate value about usable free space, because the
 idea is to add a sampler mode to the function so we don't scan the
 whole relation. that's why we still need the function.

I doubt we'd add this function solely on the basis that a future improvement
will make it useful.  For the patch to go in now, it needs to be useful now.
(This is not a universal principle, but it mostly holds for low-complexity
patches like this one.)

All my comments below would also apply to such a broader patch.

 btw... pgstattuple also has the problem that it's not using a ring buffer
 
 
 attached are two patches:
 - v5: is the same original patch but only track space in leaf, deleted
 and half_dead pages
 - v5.1: adds the same for all kind of indexes (problem is that this is
 inconsistent with the fact that pageinspect only manages btree indexes
 for everything else)

Let's take a step back.  Again, what you're proposing is essentially a faster
implementation of SELECT free_percent FROM pgstattuple(rel).  If this code
belongs in core at all, it belongs in the pgstattuple module.  Share as much
infrastructure as is reasonable between the user-visible functions of that
module.  For example, I'm suspecting that the pgstat_index() call tree should
be shared, with pgstat_index_page() checking a flag to decide whether to
gather per-tuple stats.

Next, compare the bits of code that differ between pgstattuple() and
relation_free_space(), convincing yourself that the differences are justified.
Each difference will yield one of the following conclusions:

1. Your code contains an innovation that would apply to both functions.  Where
not too difficult, merge these improvements into pgstattuple().  In order for
a demonstration of your new code's better performance to be interesting, we
must fix the same low-hanging fruit in its competitor.  One example is the use
of the bulk read strategy.  Another is the support for SP-GiST.

2. Your code is missing an essential behavior of pgstattuple().  Add it to
your code.  One example is the presence of CHECK_FOR_INTERRUPTS() calls.

3. Your code behaves differently from pgstattuple() due to a fundamental
difference in their tasks.  These are the only functional differences that
ought to remain in your finished patch; please point them out in comments.
For example, pgstat_heap() visits every tuple in the heap.  You'll have no
reason to do that; pgstattuple() only needs it to calculate statistics other
than free_percent.

In particular, I call your attention to the fact that pgstattuple() takes
shared buffer content locks before examining pages.  Your proposed patch does
not do so.  I do not know with certainty whether that falls under #1 or #2.
The broad convention is to take such locks, because we elsewhere want an exact
answer.  These functions are already inexact; they make no effort to observe a
consistent snapshot of the table.  If you convince yourself that the error
arising from not locking buffers is reasonably bounded, we can lose the locks
(in both functions -- conclusion #1).  Comments would then be in order.

With all that done, run some quick benchmarks: see how SELECT free_percent
FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for
a large heap and for a large B-tree index.  If the timing difference is too
small to be interesting to you, remove relation_free_space() and submit your
pgstattuple() improvements alone.  Otherwise

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Noah Misch

On Wed, Jan 25, 2012 at 03:32:49PM -0500, Robert Haas wrote:
 On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
  New version that repairs a defective test case.
 
 Committed.  I don't find this to be particularly good style:

Thanks.

 +   for (i = 0; i  old_natts  ret; i++)
 +   ret = 
 (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
 +  irel-rd_att-attrs[i]-atttypid == 
 typeObjectId[i]);
 
 ...but I am not sure whether we have any formal policy against it, so
 I just committed it as-is for now.  I would have surrounded the loop
 with an if (ret) block and written the body of the loop as if
 (condition) { ret = false; break; }.

I value the savings in vertical space more than the lost idiomaticness.  This
decision is 90+% subjective, so I cannot blame you for concluding otherwise.
I do know the feeling of looking at PostgreSQL source code and wishing the
author had not attempted to conserve every line.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-25 Thread Noah Misch

On Wed, Jan 25, 2012 at 11:17:27AM -0300, Alvaro Herrera wrote:
 
 Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012:
 
  Thanks.  I've attached a new version fixing this problem.
 
 Looks good to me.  Can you please clarify this bit?
 
* Since we elsewhere require that all collations share 
 the same
* notion of equality, don't compare collation here.
 
 Since I'm not familiar with this code, it's difficult for me to figure
 out where this elsewhere is, and what does same notion of equality
 mean.

Good questions.  See the comment in front of ri_GenerateQualCollation(), the
comment in process_ordered_aggregate_single() at nodeAgg.c:605, the larger
comment in add_sort_column(), and the two XXX comments in
relation_has_unique_index_for().  We should probably document that assumption
in xindex.sgml to keep type implementors apprised.

Same notion of equality just means that a COLLATE x = b COLLATE y has the
same value for every x, y.

In any event, the patch needed a rebase, so I've attached it rebased and with
that comment edited to reference ri_GenerateQualCollation(), that being the
most-relevant source for the assumption in question.

Thanks for reviewing,
nm
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cb8ac67..ba20950 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 276,281  static Oid transformFkeyCheckAttrs(Relation pkrel,
--- 276,282 
int numattrs, int16 *attnums,
Oid *opclasses);
  static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts);
+ static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid 
*funcid);
  static void validateCheckConstraint(Relation rel, HeapTuple constrtup);
  static void validateForeignKeyConstraint(char *conname,
 Relation rel, Relation 
pkrel,
***
*** 357,362  static void ATPostAlterTypeCleanup(List **wqueue, 
AlteredTableInfo *tab, LOCKMOD
--- 358,364 
  static void ATPostAlterTypeParse(Oid oldId, char *cmd,
 List **wqueue, LOCKMODE lockmode, bool 
rewrite);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
+ static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
 Oid oldOwnerId, Oid 
newOwnerId);
  static void change_owner_recurse_to_sequences(Oid relationOid,
***
*** 5591,5596  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5593,5600 
numpks;
Oid indexOid;
Oid constrOid;
+   boolold_check_ok;
+   ListCell   *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop);
  
/*
 * Grab an exclusive lock on the pk table, so that someone doesn't 
delete
***
*** 5707,5712  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5711,5723 
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
 errmsg(number of referencing and referenced 
columns for foreign key disagree)));
  
+   /*
+* On the strength of a previous constraint, we might avoid scanning
+* tables to validate this one.  See below.
+*/
+   old_check_ok = (fkconstraint-old_conpfeqop != NIL);
+   Assert(!old_check_ok || numfks == 
list_length(fkconstraint-old_conpfeqop));
+ 
for (i = 0; i  numpks; i++)
{
Oid pktype = pktypoid[i];
***
*** 5721,5726  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5732,5738 
Oid ppeqop;
Oid ffeqop;
int16   eqstrategy;
+   Oid pfeqop_right;
  
/* We need several fields out of the pg_opclass entry */
cla_ht = SearchSysCache1(CLAOID, 
ObjectIdGetDatum(opclasses[i]));
***
*** 5763,5772  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

 eqstrategy);
else
!   ffeqop = InvalidOid;/* keep compiler quiet */
  
if (!(OidIsValid(pfeqop)  OidIsValid(ffeqop)))
{
--- 5775,5791

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-26 Thread Noah Misch

On Wed, Jan 25, 2012 at 11:53:10PM -0500, Tom Lane wrote:
 Not only is that code spectacularly unreadable, but has nobody noticed
 that this commit broke the buildfarm?

Thanks for reporting the problem.  This arose because the new test case
temporarily sets client_min_messages=DEBUG1.  The default buildfarm
configuration sets log_statement=all in its postgresql.conf, so the client
gets those log_statement lines.  I would fix this as attached, resetting the
optional logging to defaults during the test cases in question.  Not
delightful, but that's what we have to work with.

nm
*** a/src/test/regress/expected/alter_table.out
--- b/src/test/regress/expected/alter_table.out
***
*** 1665,1671  where oid = 'test_storage'::regclass;
--- 1665,1679 
   t
  (1 row)
  
+ --
  -- SET DATA TYPE without a rewrite
+ --
+ -- Temporarily disable optional logging that make installcheck testers
+ -- reasonably might enable.
+ SET log_duration = off;
+ SET log_lock_waits = off;
+ SET log_statement = none;
+ SET log_temp_files = -1;
  CREATE DOMAIN other_textarr AS text[];
  CREATE TABLE norewrite_array(c text[] PRIMARY KEY);
  NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 
norewrite_array_pkey for table norewrite_array
***
*** 1674,1679  ALTER TABLE norewrite_array ALTER c TYPE text[]; -- no work
--- 1682,1691 
  ALTER TABLE norewrite_array ALTER c TYPE other_textarr; -- rebuild index
  DEBUG:  building index norewrite_array_pkey on table norewrite_array
  RESET client_min_messages;
+ RESET log_duration;
+ RESET log_lock_waits;
+ RESET log_statement;
+ RESET log_temp_files;
  --
  -- lock levels
  --
*** a/src/test/regress/sql/alter_table.sql
--- b/src/test/regress/sql/alter_table.sql
***
*** 1197,1203  select reltoastrelid  0 as has_toast_table
--- 1197,1213 
  from pg_class
  where oid = 'test_storage'::regclass;
  
+ --
  -- SET DATA TYPE without a rewrite
+ --
+ 
+ -- Temporarily disable optional logging that make installcheck testers
+ -- reasonably might enable.
+ SET log_duration = off;
+ SET log_lock_waits = off;
+ SET log_statement = none;
+ SET log_temp_files = -1;
+ 
  CREATE DOMAIN other_textarr AS text[];
  CREATE TABLE norewrite_array(c text[] PRIMARY KEY);
  SET client_min_messages = debug1;
***
*** 1205,1210  ALTER TABLE norewrite_array ALTER c TYPE text[]; -- no work
--- 1215,1225 
  ALTER TABLE norewrite_array ALTER c TYPE other_textarr; -- rebuild index
  RESET client_min_messages;
  
+ RESET log_duration;
+ RESET log_lock_waits;
+ RESET log_statement;
+ RESET log_temp_files;
+ 
  --
  -- lock levels
  --

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-26 Thread Noah Misch

On Wed, Jan 25, 2012 at 10:39:56PM -0500, Noah Misch wrote:
 In any event, the patch needed a rebase, so I've attached it rebased and with
 that comment edited to reference ri_GenerateQualCollation(), that being the
 most-relevant source for the assumption in question.

Commit 9d35116611e6a1fc10f2298944fbf0e4e1a826be invalidated the test case hunks
again.  We'll need to either remove the test cases, as Robert chose to do for
that other patch, or bolster them per
http://archives.postgresql.org/message-id/20120126115536.GD15670%40tornado.leadboat.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-26 Thread Noah Misch

On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:
 v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

 --- a/doc/src/sgml/config.sgml
 +++ b/doc/src/sgml/config.sgml

 +   para
 +Turning this parameter off speeds normal operation, but
 +might allow data corruption to go unnoticed. The checksum uses
 +16-bit checksums, using the fast Fletcher 16 algorithm. With this
 +parameter enabled there is still a non-zero probability that an error
 +could go undetected, as well as a non-zero probability of false
 +positives.
 +   /para

What sources of false positives do you intend to retain?

 --- a/src/backend/catalog/storage.c
 +++ b/src/backend/catalog/storage.c
 @@ -20,6 +20,7 @@
  #include postgres.h
  
  #include access/visibilitymap.h
 +#include access/transam.h
  #include access/xact.h
  #include access/xlogutils.h
  #include catalog/catalog.h
 @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of 
 linked list */
  /* XLOG gives us high 4 bits */
  #define XLOG_SMGR_CREATE 0x10
  #define XLOG_SMGR_TRUNCATE   0x20
 +#define XLOG_SMGR_HINT   0x40
  
  typedef struct xl_smgr_create
  {
 @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
   smgrDoPendingDeletes(false);
  }
  
 +/*
 + * Write a backup block if needed when we are setting a hint.
 + *
 + * Deciding the if needed bit is delicate and requires us to either
 + * grab WALInsertLock or check the info_lck spinlock. If we check the
 + * spinlock and it says Yes then we will need to get WALInsertLock as well,
 + * so the design choice here is to just go straight for the WALInsertLock
 + * and trust that calls to this function are minimised elsewhere.
 + *
 + * Callable while holding share lock on the buffer content.
 + *
 + * Possible that multiple concurrent backends could attempt to write
 + * WAL records. In that case, more than one backup block may be recorded
 + * though that isn't important to the outcome and the backup blocks are
 + * likely to be identical anyway.
 + */
 +#define  SMGR_HINT_WATERMARK 13579
 +void
 +smgr_buffer_hint(Buffer buffer)
 +{
 + /*
 +  * Make an XLOG entry reporting the hint
 +  */
 + XLogRecPtr  lsn;
 + XLogRecData rdata[2];
 + int watermark = SMGR_HINT_WATERMARK;
 +
 + /*
 +  * Not allowed to have zero-length records, so use a small watermark
 +  */
 + rdata[0].data = (char *) (watermark);
 + rdata[0].len = sizeof(int);
 + rdata[0].buffer = InvalidBuffer;
 + rdata[0].buffer_std = false;
 + rdata[0].next = (rdata[1]);
 +
 + rdata[1].data = NULL;
 + rdata[1].len = 0;
 + rdata[1].buffer = buffer;
 + rdata[1].buffer_std = true;
 + rdata[1].next = NULL;
 +
 + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
 +
 + /*
 +  * Set the page LSN if we wrote a backup block.
 +

Re: [HACKERS] initdb and fsync

2012-01-27 Thread Noah Misch

On Fri, Jan 27, 2012 at 04:19:41PM -0800, Jeff Davis wrote:
 It looks like initdb doesn't fsync all the files it creates, e.g. the
 PG_VERSION file.
 
 While it's unlikely that it would cause any real data loss, it can be
 inconvenient in some testing scenarios involving VMs.
 
 Thoughts? Would a patch to add a few fsync calls to initdb be accepted?

+1.  If I'm piloting strace -f right, initdb currently issues *no* syncs.

We'd probably, then, want a way to re-disable the fsyncs for hacker benefit.

 Is a platform-independent fsync be available at initdb time?

Not sure.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-28 Thread Noah Misch

On Thu, Jan 26, 2012 at 08:23:34AM -0500, Robert Haas wrote:
 On Thu, Jan 26, 2012 at 6:55 AM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 25, 2012 at 11:53:10PM -0500, Tom Lane wrote:
  Not only is that code spectacularly unreadable, but has nobody noticed
  that this commit broke the buildfarm?
 
  Thanks for reporting the problem. ?This arose because the new test case
  temporarily sets client_min_messages=DEBUG1. ?The default buildfarm
  configuration sets log_statement=all in its postgresql.conf, so the client
  gets those log_statement lines. ?I would fix this as attached, resetting the
  optional logging to defaults during the test cases in question. ?Not
  delightful, but that's what we have to work with.
 
 I'm just going to remove the test.  This is not very future-proof and

It would deserve an update whenever we add a new optional-logging GUC
pertaining to user backends.  Other tests require similarly-infrequent
refreshes in response to other changes.  Of course, buildfarm members would
not be setting those new GUCs the day they're available.  Calling for an
update to the test could reasonably fall to the first buildfarm member owner
who actually decides to use a new GUC in his configuration.

 an ugly pattern if it gets copied to other places.  We need to work on

I would rather folks introduce ugliness to the test suite than introduce
behaviors having no test coverage.

 a more sensible way for ALTER TABLE to report what it did, but a
 solution based on what GUCs the build-farm happens to set doesn't seem
 like it's justified for the narrowness of the case we're testing here.

It's not based on what GUCs the buildfarm happens to set.  I looked up all
GUCs that might create problems such as the one observed here.  They were the
four I included in the patch, plus debug_pretty_print, debug_print_parse,
debug_print_plan and debug_print_rewritten.  I concluded that the four debug_*
ones were materially less likely to ever get set in postgresql.conf for a
make installcheck run, so I left them out for brevity.

The setting changed *by default* for buildfarm clients is log_statement.
Buildfarm member owners can do as they wish, though.

  Whether or not we allow this case to work without a rewrite is in
 some sense arbitrary. There's no real reason it can't be done; rather,
 we're just exercising restraint to minimize the risk of future bugs.
 So I don't want to go to great lengths to test it.

I used the same strategy in another ALTER TABLE patch this CF:
http://archives.postgresql.org/message-id/20120126033956.gc15...@tornado.leadboat.com
If we pay its costs for those tests, we then may as well let this test case
ride their coattails.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-01-30 Thread Noah Misch

On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
 The biggest item remaining is the point you raised about multixactid
 wraparound.  This is closely related to multixact truncation and the way
 checkpoints are to be handled.  If we think that MultiXactId wraparound
 is possible, and we need to involve autovacuum to keep it at bay, then I

To prove it possible, we need prove there exists some sequence of operations
consuming N xids and M  N multixactids.  Have N transactions key-lock N-1
rows apiece, then have each of them key-lock one of the rows locked by each
other transaction.  This consumes N xids and N(N-1) multixactids.  I believe
you could construct a workload with N! multixact usage, too.

Existence proofs are one thing, real workloads another.  My unsubstantiated
guess is that multixactid use will not overtake xid use in bulk on a workload
not specifically tailored to do so.  So, I think it's enough to notice it,
refuse to assign a new multixactid, and tell the user to clear the condition
with a VACUUM FREEZE of all databases.  Other opinions would indeed be useful.

 think the only way to make that work is to add another column to
 pg_class so that each table's oldest multixact is tracked, same as we do
 with relfrozenxid for Xids.  If we do that, I think we can do away with
 most of the MultiXactTruncate junk I added -- things would become a lot
 simpler.  The cost would be bloating pg_class a bit more.  Are we okay
 with paying that cost?  I asked this question some months ago and I
 decided that I would not add the column, but I am starting to lean the
 other way.  I would like some opinions on this.

That's not the only way; autovacuum could just accelerate normal freezing to
advance the multixactid horizon indirectly.  Your proposal could make it far
more efficient, though.

 You asked two questions about WAL-logging locks: one was about the level
 of detail we log for each lock we grab; the other was about
 heap_xlog_update logging the right info.  AFAICS, the main thing that
 makes detailed WAL logging necessary is hot standbys.  That is, the
 standby must have all the locking info so that concurrent transactions
 are similarly locked as in the master ... or am I wrong in that?  (ISTM
 this means I need to fix heap_xlog_update so that it faithfully
 registers the lock info we're storing, not just the current Xid).

Standby servers do not need accurate tuple locks, because it's not possible to
wait on a tuple lock during recovery.  (By the way, the question about log
detail was just from my own curiosity.  We don't especially need an answer to
move forward with this patch, unless you want one.)

 * Columns that are part of the key
   Noah thinks the set of columns should only consider those actually 
 referenced
   by keys, not those that *could* be referenced.

Well, do you disagree?  To me it's low-hanging fruit, because it isolates the
UPDATE-time overhead of this patch to FK-referenced tables rather than all
tables having a PK or PK-like index (often just all tables).

   Also, in a table without columns, are all columns part of the key, or is the
   key the empty set?  I changed HeapSatisfiesHOTUpdate but that seems 
 arbitrary.

It does seem arbitrary.  What led you to switch in a later version?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] show Heap Fetches in EXPLAIN for index-only scans

2012-02-02 Thread Noah Misch

On Wed, Jan 25, 2012 at 08:42:05PM -0500, Robert Haas wrote:
 Only the first pass of vacuum knows how to mark pages all-visible.
 After the update, the first pass of the first vacuum sees a dead tuple
 on the old page and truncates it to a dead line pointer.  When it
 comes to the new page, it observes that the page is now all-visible
 and marks it so.  It then does index vacuuming and returns to the
 first page, marking the dead line pointer unused.  But during this
 second visit to the old page, there's no possibility of marking the
 page as all-visible, because the code doesn't know how to do that.
 The next vacuum's first pass, however, can do so, because there are no
 longer any dead tuples on the page.
 
 We could fix this by modifying lazy_vacuum_page(): since we have to
 dirty the buffer anyway, we could recheck whether all the remaining
 tuples on the page are now all-visible, and if so set the visibility
 map bit.  This is probably desirable even apart from index-only scans,
 because it will frequently save the next vacuum the cost of reading,
 dirtying, and writing extra pages.  There will be some incremental CPU
 cost, but that seems likely to be more than repaid by the I/O savings.
 
 Thoughts?  Should we do this at all?  If so, should we squeeze it into
 9.2 or leave it for 9.3?

Sounds like a good idea.  It has bothered me that two consecutive VACUUMs of a
table, with no intervening activity, can dirty a page twice.  Making that less
frequent is a good thing.  I'd hold the change for 9.3, but that's probably an
unusually-conservative viewpoint.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Noah Misch

On Fri, Feb 03, 2012 at 11:12:33AM -0500, Robert Haas wrote:
 After looking at this a bit, I think we can make the interface a bit
 more convenient.  I'd like to propose that we call the EXPLAIN option
 ROWS rather than TIMING, and that we provide the row-count
 information if *either* ROWS or ANALYZE is specified.
 
 Then, if you want rows without timing, you can say this:
 
 EXPLAIN (ROWS) query...
 
 Rather than, as in the approach taken by the current patch:
 
 EXPLAIN (ANALYZE, TIMING off) query...
 
 ...which is a little more long-winded.

I favor the originally-submitted syntax.  I like having a single option,
ANALYZE, signal whether to actually run the statement.  The submitted syntax
also aligns with the fact that would motivate someone to use it: Timing has
high overhead on my system. or Timing makes the output unstable.

We have both estimated row counts and actual row counts.  It may someday prove
useful to have an invocation that plans the query and shows estimated row
counts, omitting only cost estimates.  Having a ROWS option that implies
running the query could complicate that effort.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] initdb and fsync

2012-02-04 Thread Noah Misch

On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote:
 On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
  Yeah.  Personally I would be sad if initdb got noticeably slower, and
  I've never seen or heard of a failure that this would fix.
 
 I worked up a patch, and it looks like it does about 6 file fsync's and
 a 7th for the PGDATA directory. That degrades the time from about 1.1s
 to 1.4s on my workstation.

 So, is it worth it? Should we make it an option that can be specified?

If we add fsync calls to the initdb process, they should cover the entire data
directory tree.  This patch syncs files that initdb.c writes, but we ought to
also sync files that bootstrap-mode backends had written.  An optimization
like the pg_flush_data() call in copy_file() may reduce the speed penalty.

initdb should do these syncs by default and offer an option to disable them.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] initdb and fsync

2012-02-05 Thread Noah Misch

On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote:
 On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
  If we add fsync calls to the initdb process, they should cover the entire 
  data
  directory tree.  This patch syncs files that initdb.c writes, but we ought 
  to
  also sync files that bootstrap-mode backends had written.
 
 It doesn't make sense for initdb to take responsibility to sync files
 created by the backend. If there are important files that the backend
 creates, it should be the backend's responsibility to fsync them (and
 their parent directory, if needed). And if they are unimportant to the
 backend, then there is no reason for initdb to fsync them.

I meant primarily to illustrate the need to be comprehensive, not comment on
which executable should fsync a particular file.  Bootstrap-mode backends do
not sync anything during an initdb run on my system.  With your patch, we'll
fsync a small handful of files and leave nearly everything else vulnerable.

That being said, having each backend fsync its own writes will mean syncing
certain files several times within a single initdb run.  If the penalty from
that proves high enough, we may do well to instead have initdb.c sync
everything just once.

  initdb should do these syncs by default and offer an option to disable them.
 
 For test frameworks that run initdb often, that makes sense.
 
 But for developers, it doesn't make sense to spend 0.5s typing an option
 that saves you 0.3s. So, we'd need some more convenient way to choose
 the no-fsync option, like an environment variable that developers can
 set. Or maybe developers don't care about 0.3s?

Developers have shell aliases/functions/scripts and command history.  I
wouldn't object to having an environment variable control it, but I would not
personally find that more convenient than a command-line switch.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-07 Thread Noah Misch

On Tue, Feb 07, 2012 at 12:43:11PM -0500, Robert Haas wrote:
 I've committed the numeric and varbit patches and will look at the
 temporal one next.

Thanks.  The comment you added to numeric_transform() has a few typos,
constained - constrained and nodes - notes.

 I did notice one odd thing, though: sometimes we
 seem to get a rebuild on the toast index for no obvious reason:
 
 rhaas=# set client_min_messages=debug1;
 SET
 rhaas=# create table foo (a serial primary key, b varbit);
 NOTICE:  CREATE TABLE will create implicit sequence foo_a_seq for
 serial column foo.a
 DEBUG:  building index pg_toast_16430_index on table pg_toast_16430
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
 foo_pkey for table foo
 DEBUG:  building index foo_pkey on table foo
 CREATE TABLE
 rhaas=# alter table foo alter column b set data type varbit(4);
 DEBUG:  rewriting table foo
 DEBUG:  building index foo_pkey on table foo
 ALTER TABLE

When we rebuild the table at this point, it has a small maximum tuple size.
Therefore, we omit the toast table entirely.

 rhaas=# alter table foo alter column b set data type varbit;
 DEBUG:  building index pg_toast_16430_index on table pg_toast_16430
 ALTER TABLE

This command makes the tuple size unbounded again, so we create a toast table.

 Strangely, it doesn't happen if I add another column to the table:
 
 rhaas=# set client_min_messages=debug1;
 SET
 rhaas=# create table foo (a serial primary key, b varbit, c varbit);

With the extra unconstrained varbit column, the tuple size is continuously
unbounded and the table has a toast table at all stages.  So, the difference
in behavior is correct, albeit unintuitive.

 NOTICE:  CREATE TABLE will create implicit sequence foo_a_seq for
 serial column foo.a
 DEBUG:  building index pg_toast_16481_index on table pg_toast_16481
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
 foo_pkey for table foo
 DEBUG:  building index foo_pkey on table foo
 CREATE TABLE
 rhaas=# alter table foo alter column b set data type varbit(4);
 DEBUG:  building index pg_toast_16490_index on table pg_toast_16490
 DEBUG:  rewriting table foo
 DEBUG:  building index foo_pkey on table foo
 ALTER TABLE
 rhaas=# alter table foo alter column b set data type varbit;
 ALTER TABLE

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-07 Thread Noah Misch

On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
 On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:

  This patch uses FPIs to guard against torn hint writes, even when the
  checksums are disabled. ?One could not simply condition them on the
  page_checksums setting, though. ?Suppose page_checksums=off and we're 
  hinting
  a page previously written with page_checksums=on. ?If that write tears,
  leaving the checksum intact, that block will now fail verification. ?A 
  couple
  of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if 
  the
  old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
  Whenever the cluster starts with checksums disabled, set the field to
  InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and 
  the
  field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
  failure occurs in a page with LSN older than the stored one, emit either a
  softer warning or no message at all.
 
 We can only change page_checksums at restart (now) so the above seems moot.
 
 If we crash with FPWs enabled we repair any torn pages.

There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.

  PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
  holding
  is insufficient.
 
 Am serialising this by only writing PageLSN while holding buf hdr lock.

That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock.  Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?

  I can see value in an option to exclude local buffers, since corruption 
  there
  may be less exciting. ?It doesn't seem important for an initial patch, 
  though.
 
 I'm continuing to exclude local buffers. Let me know if that should change.

Seems reasonable.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-08 Thread Noah Misch

On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote:
 On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
  On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
   This patch uses FPIs to guard against torn hint writes, even when the
   checksums are disabled. ?One could not simply condition them on the
   page_checksums setting, though. ?Suppose page_checksums=off and we're 
   hinting
   a page previously written with page_checksums=on. ?If that write tears,
   leaving the checksum intact, that block will now fail verification. ?A 
   couple
   of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only 
   if the
   old page had a checksum. ?(b) Add a checksumEnableLSN field to 
   pg_control.
   Whenever the cluster starts with checksums disabled, set the field to
   InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled 
   and the
   field is InvalidXLogRecPtr, set the field to the next LSN. ?When a 
   checksum
   failure occurs in a page with LSN older than the stored one, emit either 
   a
   softer warning or no message at all.
 
  We can only change page_checksums at restart (now) so the above seems moot.
 
  If we crash with FPWs enabled we repair any torn pages.
 
  There's no live bug, but that comes at a high cost: the patch has us emit
  full-page images for hint bit writes regardless of the page_checksums 
  setting.
 
 Sorry, I don't understand what you mean. I don't see any failure cases
 that require that.
 
 page_checksums can only change at a shutdown checkpoint,
 
 The statement If that write tears,
   leaving the checksum intact, that block will now fail verification.
 cannot happen, ISTM.
 
 If we write out a block we update the checksum if page_checksums is
 set, or we clear it. If we experience a torn page at crash, the FPI
 corrects that, so the checksum never does fail verification. We only
 need to write a FPI when we write with checksums.
 
 If that's wrong, please explain a failure case in detail.

In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both.  (This is
just the normal torn-page scenario.)  Given that, consider the following
sequence of events:

1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown

5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
7. begin write of P
8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT

9. startup with page_checksums = off
10. crash recovery.  does not touch P, because step 6 generated no WAL
11. clean shutdown

12. startup with page_checksums = on
13. read P.  checksum failure

Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL.  That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.

   PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
   holding
   is insufficient.
 
  Am serialising this by only writing PageLSN while holding buf hdr lock.
 
  That means also taking the buffer header spinlock in every PageGetLSN() 
  caller
  holding only a shared buffer content lock. ?Do you think that will pay off,
  versus the settled pattern of trading here your shared buffer content lock 
  for
  an exclusive one?
 
 Yes, I think it will pay off. This is the only code location where we
 do that, and we are already taking the buffer header lock, so there is
 effectively no overhead.

The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
spinlock at either of those locations.  If they remain safe under the new
rules, we'll need comments explaining why.  I think they may indeed be safe,
but it's far from obvious.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-09 Thread Noah Misch

On Wed, Feb 08, 2012 at 09:37:01AM -0500, Robert Haas wrote:
 On Tue, Feb 7, 2012 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:
  I've committed the numeric and varbit patches and will look at the
  temporal one next.
 
 Committed, after changing the OIDs so they don't conflict.
 
 Here's one more case for you to ponder:
 
 rhaas=# create table noah (i interval day);
 CREATE TABLE
 rhaas=# alter table noah alter column i set data type interval second(3);
 DEBUG:  rewriting table noah
 ALTER TABLE
 
 Do we really need a rewrite in that case?  The code acts like the
 interval range and precision are separate beasts, but is that really
 true?

The code has a thinko; a given interval typmod ultimately implies a single
point from which we truncate rightward.  The precision only matters if the
range covers SECOND.  Thanks; the attached patch improves this.
*** a/src/backend/utils/adt/timestamp.c
--- b/src/backend/utils/adt/timestamp.c
***
*** 958,963  interval_transform(PG_FUNCTION_ARGS)
--- 958,964 
int new_range = INTERVAL_RANGE(new_typmod);
int new_precis = 
INTERVAL_PRECISION(new_typmod);
int new_range_fls;
+   int old_range_fls;
  
if (old_typmod == -1)
{
***
*** 974,985  interval_transform(PG_FUNCTION_ARGS)
 * Temporally-smaller fields occupy higher positions in the 
range
 * bitmap.  Since only the temporally-smallest bit matters for 
length
 * coercion purposes, we compare the last-set bits in the 
ranges.
 */
new_range_fls = fls(new_range);
if (new_typmod == -1 ||
((new_range_fls = SECOND ||
! new_range_fls = fls(old_range)) 
!(new_precis = MAX_INTERVAL_PRECISION ||
  new_precis = old_precis)))
ret = relabel_to_typmod(source, new_typmod);
}
--- 975,990 
 * Temporally-smaller fields occupy higher positions in the 
range
 * bitmap.  Since only the temporally-smallest bit matters for 
length
 * coercion purposes, we compare the last-set bits in the 
ranges.
+* Precision, which is to say, sub-second precision, only 
affects
+* ranges that include SECOND.
 */
new_range_fls = fls(new_range);
+   old_range_fls = fls(old_range);
if (new_typmod == -1 ||
((new_range_fls = SECOND ||
! new_range_fls = old_range_fls) 
!(old_range_fls  SECOND ||
! new_precis = MAX_INTERVAL_PRECISION ||
  new_precis = old_precis)))
ret = relabel_to_typmod(source, new_typmod);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Noah Misch

On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code. Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue. The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types. Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have? Perhaps
the binary would grow by, I don't know, 10%. Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios? I think it's a bit like going out to dinner and spending
$50. It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well. I'm not really sure what the
budget is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having such a metric would resolve this discussion, but formulating it will be
all but impossible. We don't know how much time PostgreSQL installations now
spend sorting or how much additional time they would spend on cache misses,
TLB misses, page faults, etc. That data won't show up on this thread.

You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
lookup table. I would not feel bad about rejecting that, because it can live
quite comfortably as an external module. Indeed, I think the best thing we
can do to constrain long-term bloat in the postgres executable is to improve
pluggability. Let a wider range of features live outside the postmaster
binary. For example, if we had the infrastructure to let hash, GIN, GiST and
SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
today), I would support doing that. Extensibility is a hallmark of
PostgreSQL. It's a bad day when we reject a minority-interest feature or
optimization that has no other way to exist.

Better pluggability can't ease the decision process for Peter's patch, because
its specializations essentially must live in the postgres executable to live
at all. Nominally, we could let external modules inject sort specializations
for core types, and Peter could publish an all_fast_sorts extension. That
would be useless punting: if we can't make a principled decision on whether
these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
the module make that decision?

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed. On my system, fastpath_sort_2012_01_19.patch adds 85
KiB to the postgres binary. A recent commit added 57 KiB and bloat never came
up in discussions thereof. I appreciate your concern over a slippery slope of
inlining proposals, but I also don't wish to see the day when every patch
needs to declare and justify its binary byte count. Unless, of course, we
discover that elusive metric for evaluating it fairly.

If we'd like to start taking interest in binary bloat, how about having the
buildfarm log capture an ls -l on the bin directory? We'll then have data
to mine if we ever wish to really take action.

All that being said, I'd want to see a 15-30% (depending on how contrived)
improvement to a microbenchmark or a 5% improvement to a generic benchmark
(pgbench, DBT-N) before adopting an optimization of this complexity. Based
on your measurements[2], per-type inlining gives us a 7% microbenchmark
improvement. I'd therefore excise the per-type inlining. (For the record,
given a 20% improvement to the same benchmark, I'd vote yea on the submission
and perhaps even suggest int2 and uuid support.)

[1]
http://archives.postgresql.org/message-id/ca+tgmozo1xsz+yiqz2mrokmcmqtb+jir0lz43cne6de7--q...@mail.gmail.com
[2]
http://archives.postgresql.org/message-id/ca+tgmobfz8sqtgtaaryerywuzfodgetprbuygbhsplr4+li...@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-10 Thread Noah Misch

On Thu, Feb 09, 2012 at 11:11:16PM +0200, Marti Raudsepp wrote:
 I've always been a little wary of using the TRUNCATE command due to
 the warning in the documentation about it not being MVCC-safe:
 queries may silently give wrong results and it's hard to tell when
 they are affected.
 
 That got me thinking, why can't we handle this like a standby server
 does -- if some query has data removed from underneath it, it aborts
 with a serialization failure.
 
 Does this solution sound like a good idea?
 
 The attached patch is a lame attempt at implementing this. I added a
 new pg_class.relvalidxmin attribute which tracks the Xid of the last
 TRUNCATE (maybe it should be called reltruncatexid?). Whenever
 starting a relation scan with a snapshot older than relvalidxmin, an
 error is thrown. This seems to work out well since TRUNCATE updates
 pg_class anyway, and anyone who sees the new relfilenode automatically
 knows when it was truncated.

I like the design you have chosen.  It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound.  For example, older
snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC.  When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

 Should I also add another counter to pg_stat_database_conflicts?
 Currently this table is only used on standby servers.

 ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
 DETAIL:  Rows visible to this transaction have been removed.

My initial reaction is not to portray this like a recovery conflict, since
several aspects distinguish it from all recovery conflict types.

(I have not read your actual patch.)

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-10 Thread Noah Misch

On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote:
  I like the design you have chosen. ?It would find applications beyond
  TRUNCATE, so your use of non-specific naming is sound. ?For example, older
  snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
  commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe 
  commands
  should perhaps just become MVCC-safe, but there will always be use cases for
  operations that shortcut MVCC. ?When one truly does want that, your proposal
  for keeping behavior consistent makes plenty of sense.
 
 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED.  They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ.  I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

 But I have to admit I'm intrigued by the idea of extending this to
 other cases, if there's a reasonable way to do that.  For example, if
 we could fix things up so that we don't see a table at all if it was
 created after we took our snapshot, that would remove one of the
 obstacles to marking pages bulk-loaded into that table with FrozenXID
 and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
 mighty happy about that.

Exactly.

 But the necessary semantics seem somewhat different.  For TRUNCATE,
 you want to throw a serialization error; but is that also what you
 want for CREATE TABLE?  Or do you just want it to appear empty?

I think an error helps just as much there.  If you create a table and populate
it with data in the same transaction, letting some snapshot see an empty table
is an atomicity failure.

Your comment illustrates a helpful point: this is just another kind of
ordinary serialization failure, one that can happen at any isolation level.
No serial transaction sequence can yield one of these errors.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-21 Thread Noah Misch

On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote:
 On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs si...@2ndquadrant.com wrote:
  What straightforward implementation is that?? This *is* the straightforward 
  one.
 
  God knows what else we'd break if we drop the lock, reacquire as an
  exclusive, then drop it again and reacquire in shared mode. Code tends
  to assume that when you take a lock you hold it until you release;
  doing otherwise would allow all manner of race conditions to emerge.
 
  And do you really think that would be faster?
 
 I don't know, but neither do you, because you apparently haven't tried
 it.   Games where we drop the shared lock and get an exclusive lock
 are used in numerous places in the source code; see, for example,
 heap_update(), so I don't believe that there is any reason to reject
 that a priori.  Indeed, I can think of at least one pretty good reason
 to do it exactly that way: it's the way we've handled all previous
 problems of this type, and in general it's better to make new code
 look like existing code than to invent something new, especially when
 you haven't made any effort to quantify the problem or the extent to
 which the proposed solution mitigates it.

We do, in numerous places, drop a shared buffer content lock and reacquire it
in exclusive mode.  However, the existing users of that pattern tend to trade
the lock, complete subsequent work, and unlock the buffer all within the same
function.  So they must, because several of them recheck facts that can change
during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
do not adhere to that pattern; they can be quite distant from the original
lock acquisition and eventual lock release.  Take the prototypical case of
SetHintBits().  Our shared lock originated in a place like heapgettup(), which
expects to hold it continually until finished.  We would need to somehow push
up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
lock, then have heapgettup() reorient itself as needed.  Every caller of code
that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
Perhaps there's a different way to apply the lock-trade technique that avoids
this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one.  Within that umbrella, some details need attention:

- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
  public API, perhaps LockBufferForLSN().

- The use of some spinlock need not imply using the buffer header spinlock.
  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
  trade-off of splitting a lock: less contention at the cost of more
  acquisitions.  I have no intuition on which approach would perform better.

I agree that src/backend/storage/buffer/README is the place to document the
new locking rules.

I do share your general unease about adding new twists to the buffer access
rules.  Some of our hairiest code is also the code manipulating buffer locks
most extensively, and I would not wish to see that code get even more
difficult to understand.  However, I'm not seeing a credible alternative that
retains the same high-level behaviors.

A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby.  Then there's no new
WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits?  Does anyone recall the result?

Thanks,
nm

[1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise.  The
comment is true today, but the same patch makes it false by having
XLogSaveBufferForHint() call XLogInsert() under a share lock.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-02-21 Thread Noah Misch

On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote:
 On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch n...@leadboat.com wrote:
 
  With all that done, run some quick benchmarks: see how SELECT free_percent
  FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) 
  for
  a large heap and for a large B-tree index. ?If the timing difference is too
  small to be interesting to you, remove relation_free_space() and submit your
  pgstattuple() improvements alone. ?Otherwise, submit as written.
 
 
 Ok. I split this in three patches.
 
 1) pgstattuple-gin_spgist.patch
 This first patch adds gin and spgist support to pgstattuple, also
 makes pgstattuple use a ring buffer when reading tables or indexes.

The buffer access strategy usage bits look fine to commit.  The gin and spgist
support has problems, detailed below.

 2) pgstattuple-relation_free_space.patch
 This patch adds the relation_free_space function to pgstattuple.
 
 the function relation_free_space() is faster than pgstattuple(), to
 test that i initialize pgbench with a scale of 40.
 In that context pgstattuple() tooks 1.4s to process pgbench_account
 table and relation_free_space() tooks 730ms (half the time!)
 In the index the difference is less notorious, 170ms the former and
 150ms the latter.

Benchmarks lasting on the order of one second are far too short.  I tried the
first two patches on this 6914 MiB table and 4284 MiB index:

create table t(n) as select * from generate_series(1,2);
create index on t(n);

This machine has about 1 GiB of memory available for disk cache, and I used
shared_buffers = 128MB.  I used a regular assert-enabled build with
debug_assertions = off.  Timings:

pgstattuple.free_percent, heap: runtime 166.2s; answer 0.34
pgstattuple.free_percent, index:runtime 110.9s; answer 9.83
relation_free_space, heap:  runtime 165.1s; answer 
0.00838721
relation_free_space, index: runtime 108.7s; answer 2.23692

Note the disagreement in answers and the nonsensical answer from the last
test.  The numbers do line up for smaller tables and indexes that I tried.

During the pgstattuple() runs on the heap, CPU usage divided evenly between
user and iowait time.  With relation_free_space(), iowait kept above 80%.  For
the index, pgstattuple() managed 60-80% iowait and relation_free_space() again
kept above 80%.  Even so, that did not produce any significant change in
runtime.  I'm guessing that readahead was highly effective here, so the I/O
bound dictated elapsed time.

Bottom line, this patch can probably achieve 50% speedups on already-in-memory
relations.  It can reduce the contribution to CPU load, but not the elapsed
runtime, for relations we largely pull from disk.  Do those benefits justify
the additional user-visible interface?  I suppose the sort of installation
that would benefit most is one just short of the tipping point of the database
size exceeding memory size.  Larger databases could not afford either
function, and smaller databases don't need to watch bloat so closely.
Offhand, I think that the I/O savings of sampling will be the real win, and
it's not worth an extra user-visible function to get the CPU usage savings
this patch offers.  Other opinions welcome.

 3) pgstattuple-stats_target.patch
 This patch adds a stats_target parameter to the relation_free_space()
 function, it mimics the way analyze choose the blocks to read and is
 faster than plain relation_free_space() but of course could be inexact
 if the pages that we don't read are the ones with more free space

This part is a fresh submission.  It is simple enough that I have reviewed it.
It gives the expected speedup.  However, the results are wrong:

3 runs of pgstattuple('t', 5):  0.000171412, 0.000171876, 0.000169326
3 runs of pgstattuple('t', 10): 0.000336525, 0.000344404, 0.000341314

I can get an apparent infinite loop:

create table t0(n) as select * from generate_series(1,400);
create index on t0(n);

[local] test=# select relation_free_space('t0_n_idx', 100);
 relation_free_space
-
   0.0982675

Time: 133.788 ms

[local] test=# select relation_free_space('t0_n_idx', 5);
Cancel request sent
ERROR:  canceling statement due to user request
Time: 24655.481 ms


As a way forward, I suggest abandoning relation_free_space() entirely and
adding a sampling capability to pgstattuple().  There are clear gains to be
had from that method.  The gains of splitting out the free percent calculation
from the other pgstattuple() calculations seem meager.

If you would like, submit the buffer strategy bits as a separate patch with
its own CF entry, noting that it arose from here.  That part can be Ready for
Committer.  I'm marking the original CF entry Returned with Feedback.


Patch 1:

 *** a/contrib/pgstattuple/pgstatindex.c
 --- b/contrib/pgstattuple/pgstatindex.c

 + /* 
 +  * Buffer access strategy for reading relations, it's simpler to keep

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-21 Thread Noah Misch

On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
 On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
  So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
  doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
  do?
 
  As explained in detailed comments, the purpose of this is to implement
  Heikki's suggestion that we have a bit set to zero so we can detect
  failures that cause a run of 1s.
 
 I think it's nonsensical to pretend that there's anything special
 about that particular bit.  If we want to validate the page header
 before trusting the lack of a checksum bit, we can do that far more
 thoroughly than just checking that one bit.  There are a whole bunch
 of bits that ought to always be zero, and there are other things we
 can validate as well (e.g. LSN not in future).  If we're concerned
 about the checksum-enabled bit getting flipped (and I agree that we
 should be), we can check some combination of that stuff in the hope of
 catching it, and that'll be a far better guard than just checking one
 arbitrarily selected bit.

PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
(page-pd_flags  ~PD_VALID_FLAG_BITS) == 0.  Explicitly naming another bit
and keeping it unset is redundant with that existing check.  It would cease to
be redundant if we ever allocate all the flag bits, but then we also wouldn't
have a bit to spare as PD_CHECKSUM2.  I agree with you on this point.

 That having been said, I don't feel very good about the idea of
 relying on the contents of the page to tell us whether or not the page
 has a checksum.  There's no guarantee that an error that flips the
 has-checksum bit will flip any other bit on the page, or that it won't
 flip everything else we're relying on as a backstop in exactly the
 manner that foils whatever algorithm we put in place.  Random
 corruption is, perhaps, unlikely to do that, but somehow I feel like
 it'll happen more often than random chance suggests.  Things never
 fail the way you want them to.
 
 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.  If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward.  Did you have a particular
scenario in mind?

 I'm tempted to suggest a relation-level switch: when you want
 checksums, you use ALTER TABLE to turn them on, and when you don't
 want them any more you use ALTER TABLE to shut them off again, in each
 case rewriting the table.  That way, there's never any ambiguity about
 what's in the data pages in a given relation: either they're either
 all checksummed, or none of them are.  This moves the decision about
 whether checksums are enabled or disabled quite a but further away
 from the data itself, and also allows you to determine (by catalog
 inspection) which parts of the cluster do in fact have checksums.  It
 might be kind of a pain to implement, though: you'd have to pass the
 information about how any given relation was configured down to the
 place where we validate page sanity.  I'm not sure whether that's
 practical.

This patch implies future opportunities to flesh out its UI, and I don't see
it locking us out of implementing the above.  We'll want a weak-lock command
that adds checksums to pages lacking them.  We'll want to expose whether a
given relation has full checksum coverage.  With that, we could produce an
error when an apparently-non-checksummed page appears in a relation previously
known to have full checksum coverage.

Even supposing an ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS as the only
tool for enabling or disabling them, it's helpful to have each page declare
whether it has a checksum.  That way, the rewrite can take as little as an
AccessShareLock, and any failure need not lose work already done.  If you rely
on anticipating the presence of a checksum based on a pg_class column, that
ALTER needs to perform an atomic rewrite.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-22 Thread Noah Misch

On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote:
 Okay, so this patch fixes the truncation and wraparound issues through a
 mechanism much like pg_clog's: it keeps track of the oldest possibly
 existing multis on each and every table, and then during tuple freezing
 those are removed.  I also took the liberty to make the code remove
 multis altogether (i.e. resetting the IS_MULTI hint bit) when only the
 update remains and lockers are all gone.
 
 I also cleaned up the code in heapam so that there's a couple of tables
 mapping MultiXactStatus to LockTupleMode and back, and to heavyweight
 lock modes (the older patches used functions to do this, which was
 pretty ugly).  I had to add a little helper function to lock.c to make
 this work.  I made a rather large bunch of other minor changes to close
 minor bugs here and there.
 
 Docs have been added, as have new tests for the isolation harness, which
 I've ensured pass in both read committed and serializable modes; WAL
 logging for locking updated versions of a tuple, when an old one is
 locked due to an old snapshot, was also added; there's plenty of room
 for growth in the MultiXact flag bits; the bit that made tables with no
 keys lock the entire row all the time was removed; multiple places in
 code comments were cleaned up that referred to this feature as FOR KEY
 LOCK and ensured that it also mentions FOR KEY UPDATE; the pg_rowlocks,
 pageinspect, pg_controldata, pg_resetxlog utilities have been updated.

All of the above sounds great.  I especially like the growing test coverage.

 All in all, I think this is in pretty much final shape.  Only pg_upgrade
 bits are still missing.  If sharp eyes could give this a critical look
 and knuckle-cracking testers could give it a spin, that would be
 helpful.

Lack of pg_upgrade support leaves this version incomplete, because that
omission would constitute a blocker for beta 2.  This version changes as much
code compared to the version I reviewed at the beginning of the CommitFest as
that version changed overall.  In that light, it's time to close the books on
this patch for the purpose of this CommitFest; I'm marking it Returned with
Feedback.  Thanks for your efforts thus far.

On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote:
 On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
  * Columns that are part of the key
Noah thinks the set of columns should only consider those actually 
  referenced
by keys, not those that *could* be referenced.
 
 Well, do you disagree?  To me it's low-hanging fruit, because it isolates the
 UPDATE-time overhead of this patch to FK-referenced tables rather than all
 tables having a PK or PK-like index (often just all tables).

You have not answered my question above.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-23 Thread Noah Misch

On Thu, Feb 23, 2012 at 02:08:28PM +0100, Jeroen Vermeulen wrote:
 On 2012-02-23 10:18, Simon Riggs wrote:

 However, review of such a large patch should not be simply pass or
 fail. We should be looking back at the original problem and ask
 ourselves whether some subset of the patch could solve a useful subset
 of the problem. For me, that seems quite likely and this is very
 definitely an important patch.

 Even if we can't solve some part of the problem we can at least commit
 some useful parts of infrastructure to allow later work to happen more
 smoothly and quickly.

 So please let's not focus on the 100%, lets focus on 80/20.

 The suggested immutable-column constraint was meant as a potential  
 80/20 workaround.  Definitely not a full solution, helpful to some,  
 probably easier to do.  I don't know if an immutable key would actually  
 be enough to elide foreign-key locks though.

That alone would not simplify the patch much.  INSERT/UPDATE/DELETE on the
foreign side would still need to take some kind of tuple lock on the primary
side to prevent primary-side DELETE.  You then still face the complicated case
of a tuple that's both locked and updated (non-key/immutable columns only).
Updates that change keys are relatively straightforward, following what we
already do today.  It's the non-key updates that complicate things.

If you had both an immutable column constraint and a never-deleted table
constraint, that combination would be sufficient to simplify the picture.
(Directly or indirectly, it would not actually be a never-deleted constraint
so much as a you must take AccessExclusiveLock to DELETE constraint.)
Foreign-side DML would then take an AccessShareLock on the parent table with
no tuple lock at all.

By then, though, that change would share little or no code with the current
patch.  It may have its own value, but it's not a means for carving a subset
from the current patch.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-23 Thread Noah Misch

On Thu, Feb 23, 2012 at 06:36:42PM -0300, Alvaro Herrera wrote:
 
 Excerpts from Noah Misch's message of mi?? feb 22 14:00:07 -0300 2012:
  
  On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote:
 
  On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote:
   On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
* Columns that are part of the key
  Noah thinks the set of columns should only consider those actually 
referenced
  by keys, not those that *could* be referenced.
   
   Well, do you disagree?  To me it's low-hanging fruit, because it isolates 
   the
   UPDATE-time overhead of this patch to FK-referenced tables rather than all
   tables having a PK or PK-like index (often just all tables).
  
  You have not answered my question above.
 
 Sorry.  The reason I didn't research this is that at the very start of
 the discussion it was said that having heapam.c figure out whether
 columns are being used as FK destinations or not would be more of a
 modularity violation than indexed columns already are for HOT support
 (this was a contentious issue for HOT, so I don't take it lightly.  I
 don't think I need any more reasons for Tom to object to this patch, or
 more bulk into it.  Both are already serious issues.)

That's fair.

 In any case, with the way we've defined FOR KEY SHARE locks (in the docs
 it's explicitely said that the set of columns considered could vary in
 the future), it's a relatively easy patch to add on top of what I've
 submitted.  Just as the ALTER TABLE bits to add columns to the set of
 columns considered, it could be left for a second pass on the issue.

Agreed.  Let's have that debate another day, as a follow-on patch.

Thanks for shedding this light.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-23 Thread Noah Misch

On Thu, Feb 23, 2012 at 12:43:11PM -0300, Alvaro Herrera wrote:
 Excerpts from Simon Riggs's message of jue feb 23 06:18:57 -0300 2012:
  On Wed, Feb 22, 2012 at 5:00 PM, Noah Misch n...@leadboat.com wrote:
  
   All in all, I think this is in pretty much final shape. ??Only pg_upgrade
   bits are still missing. ??If sharp eyes could give this a critical look
   and knuckle-cracking testers could give it a spin, that would be
   helpful.
  
   Lack of pg_upgrade support leaves this version incomplete, because that
   omission would constitute a blocker for beta 2. ??This version changes as 
   much
   code compared to the version I reviewed at the beginning of the 
   CommitFest as
   that version changed overall. ??In that light, it's time to close the 
   books on
   this patch for the purpose of this CommitFest; I'm marking it Returned 
   with
   Feedback. ??Thanks for your efforts thus far.
 
 Now this is an interesting turn of events.  I must thank you for your
 extensive review effort in the current version of the patch, and also
 thank you and credit you for the idea that initially kicked this patch
 from the older, smaller, simpler version I wrote during the 9.1 timeline
 (which you also reviewed exhaustively).  Without your and Simon's
 brilliant ideas, this patch wouldn't exist at all.
 
 I completely understand that you don't want to review this latest
 version of the patch; it's a lot of effort and I wouldn't inflict it on
 anybody who hasn't not volunteered.  However, it doesn't seem to me that
 this is reason to boot the patch from the commitfest.  I think the thing
 to do would be to remove yourself from the reviewers column and set it
 back to needs review, so that other reviewers can pick it up.

It would indeed be wrong to change any patch from Needs Review to Returned
with Feedback on account of a personal distaste for reviewing the patch.  I
hope I did not harbor such a motive here.  Rather, this CommitFest has given
your patch its fair shake, and I and other reviewers would better serve the
CF's needs by reviewing, say, ECPG FETCH readahead instead of your latest
submission.  Likewise, you would better serve the CF by evaluating one of the
four non-committer patches that have been Ready for Committer since January.
That's not to imply that the goals of the CF align with my goals, your goals,
or broader PGDG goals.  The patch status on commitfest.postgresql.org does
exist solely for the advancement of the CF, and I have set it in accordingly.

 As for the late code churn, it mostly happened as a result of your
 own feedback; I would have left most of it in the original state, but as
 I went ahead it seemed much better to refactor things.  This is mostly
 in heapam.c.  As for multixact.c, it also had a lot of churn, but that
 was mostly to restore it to the state it has in the master branch,
 dropping much of the code I had written to handle multixact truncation.
 The new code there and in the vacuum code path (relminmxid and so on) is
 a lot smaller than that other code was, and it's closely based on
 relfrozenxid which is a known piece of technology.

I appreciate that.

  However, review of such a large patch should not be simply pass or
  fail. We should be looking back at the original problem and ask
  ourselves whether some subset of the patch could solve a useful subset
  of the problem. For me, that seems quite likely and this is very
  definitely an important patch.

Incidentally, I find it harmful to think of Returned with Feedback as
fail.  For large patches, it's healthier to think of a CF as a bimonthly
project status meeting with stakeholders.  When the project is done,
wonderful!  When there's work left, that's no great surprise.

  Even if we can't solve some part of the problem we can at least commit
  some useful parts of infrastructure to allow later work to happen more
  smoothly and quickly.
  
  So please let's not focus on the 100%, lets focus on 80/20.
 
 Well, we have the patch I originally posted in the 9.1 timeframe.
 That's a lot smaller and simpler.  However, that solves only part of the
 blocking problem, and in particular it doesn't fix the initial deadlock
 reports from Joel Jacobson at Glue Finance (now renamed Trustly, in case
 you wonder about his change of email address) that started this effort
 in the first place.  I don't think we can cut down to that and still
 satisfy the users that requested this; and Glue was just the first one,
 because after I started blogging about this, some more people started
 asking for it.
 
 I don't think there's any useful middlepoint between that one and the
 current one, but maybe I'm wrong.

Nothing additional comes to my mind, either.  This patch is monolithic.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-02-24 Thread Noah Misch

Hi Marco,

This version fixes everything I noted in my last review.  Apart from corner
cases I note, the patch is technically sound.  I pose a few policy-type
questions; comments from a broader audience would help those.

On Mon, Feb 06, 2012 at 07:04:42PM +0100, Marco Nenciarini wrote:
 Please find attached version 3 of our patch. We thoroughly followed your
 suggestions and were able to implement EACH foreign key constraints 
 with multi-column syntax as well.

[actual review based on v3b]

 * support for EACH foreign keys in multi-column foreign key table 
   constraints
   - e.g. FOREIGN KEY (c1, EACH c2) REFERENCES t1 (u1, u2)

You support, and have test cases for, constraints with multiple EACH columns.
The documentation and comments do not explain their semantics.  On reviewing
the behavior and code, you have implemented it in terms of a Cartesian
product.  That's logical, but when is such a constraint actually useful?  If
someone can think of an application, great; let's document his example.
Otherwise, let's reject such constraints at DDL time.

 As previously said, we preferred to keep this patch simple for 9.2 and 
 forbid EACH CASCADE and EACH SET NULL on multi-column foreign keys.
 After all, majority of use cases is represented by EACH foreign key
 column constraints (single-column foreign key arrays), and more
 complicated use cases can be discussed for 9.3 - should this patch make
 it. :)

Good call.

 We can use multi-dimensional arrays as well as referencing columns. In 
 that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH
 SET NULL. This is a safe way of implementing the action. 
 We have some ideas on how to implement this, but we feel it is better to
 limit the behaviour for now.

This still feels awfully unprincipled to me.  How about just throwing an error
when we need to remove an element from a multidimensional array?  Essentially,
have ON DELETE EACH CASCADE downgrade to ON DELETE RESTRICT when it encounters
a multidimensional array.  That's strictly less code than what you have now,
and it keeps our options open.  We can always change from error to set-null
later, but we can't so readily change from set-null to anything else.


As I pondered this patch again, I came upon formal hazards around non-default
operator classes.  Today, ATAddForeignKeyConstraint() always chooses pfeqop,
ffeqop and ppeqop in the same operator family.  If it neglected this, we would
get wrong behavior when one of the operators is sensitive to a change to which
another is insensitive.  For EACH FK constraints, this patch sets ffeqop =
ARRAY_EQ_OP unconditionally.  array_eq() uses the TYPECACHE_EQ_OPR (usually
from the default B-tree operator class).  That operator may not exist at all,
let alone share an operator family with pfeqop.  Shall we deal with this by
retrieving TYPECACHE_EQ_OPR in ATAddForeignKeyConstraint() and rejecting the
constraint creation if it does not share an operator family with pfeqop?  The
same hazard affects ri_triggers.c use of array_remove() and array_replace(),
and the same change mitigates it.

Let me say again how much I like the test coverage from this patch.  It's
great to think of something worth verifying and find existing coverage for it
in the submitted test cases.


 *** a/doc/src/sgml/catalogs.sgml
 --- b/doc/src/sgml/catalogs.sgml

 ***
 *** 2102,2107 
 --- 2106,2118 
 entry/entry
 entryIf a check constraint, a human-readable representation of the 
 expression/entry
/row
 + 
 +  row
 +   entrystructfieldconfisarray/structfield/entry

Name is now confiseach.

pg_constraint.confeach needs documentation.

 +   entrytypebool/type/entry
 +   entry/entry
 +   entryIf a foreign key, true if a EACH REFERENCE foreign key/entry
 +  /row
   /tbody
  /tgroup
 /table
 *** a/doc/src/sgml/ddl.sgml
 --- b/doc/src/sgml/ddl.sgml

 +para
 + Another option you have with foreign keys is to use a
 + referencing column which is an array of elements with
 + the same type (or a compatible one) as the referenced
 + column in the related table. This feature, commonly known
 + as firsttermforeign key arrays/firstterm, is implemented

Is it indeed commonly known by that name?  My web searching did not turn up
any independent use of the term.

In any event, this should be the only place where we mention multiple names
for the feature.  Pick one preferred term and use it for all other mentions.
The documentation, comments and messages currently have a mix of each foreign
key, EACH foreign key and foreign key array.

 +para
 + Even though the most common use case for foreign key arrays
 + is on a single column key, you can define an quoteeach foreign
 + key constraint/quote on a group of columns. As the following
 + contrived example shows, it needs to be written in table constraint 
 form:
 + programlisting
 + CREATE TABLE t1 (
 +   a integer PRIMARY KEY,

Re: [HACKERS] leakproof

2012-02-24 Thread Noah Misch

On Wed, Feb 22, 2012 at 06:30:37PM -0500, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
  On 02/22/2012 04:29 PM, Marc Munro wrote:
  As the developer of veil I feel marginally qualified to bikeshed here:
  how about silent?  A silent function being one that will not blab.
 
  I also made this suggestion later in the day.
 
 SILENT isn't a bad idea.  It seems to lead the mind in the right
 direction, or at least not encourage people to guess the wrong meaning.

+1 for SILENT.  I also liked Kevin's suggestion of DISCREET; it comes closer
than SILENT to denoting the property in question, but it also carries a social
undertone that feels out of place in a software function declaration.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings

2012-02-24 Thread Noah Misch

On Wed, Jan 25, 2012 at 06:25:46PM -0300, Alvaro Herrera wrote:
 Excerpts from Noah Misch's message of s??b ene 14 12:40:02 -0300 2012:
  It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting.
  Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all
  backslash commands, does not route through SendQuery().  Looking into this
  turned up several other weaknesses in psql's handling of COPY.
 
 Interesting.
 
 Committed, thanks.

Thanks.  While testing a crashing function, I noticed that my above patch
added some noise to psql output when the server crashes:

[local] test=# select crashme();
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
unexpected transaction status (4)
Time: 6.681 ms
 ! \q

Status 4 is PQTRANS_UNKNOWN, which is expected when the connection is not
CONNECTION_OK.  The double message arrives because ProcessResult() now calls
CheckConnection() at least twice, for the benefit of COPY.  (Incidentally, the
reconnect fails because the server has not yet finished recovering; that part
is nothing new.)

The attached small patch has SendQuery() keep quiet about PQTRANS_UNKNOWN when
the connection is down.  It makes ProcessResult() skip the second
CheckConnection() when the command string had no COPY results.  This restores
the pre-08146775acd8bfe0fcc509c71857abb928697171 psql output:

[local] test=# select crashme();
The connection to the server was lost. Attempting reset: Failed.
Time: 4.798 ms
 ! \q

Thanks,
nm
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 5c9bd96..715e231 100644
*** a/src/bin/psql/common.c
--- b/src/bin/psql/common.c
***
*** 740,746  ProcessResult(PGresult **results)
} while (next_result);
  
/* may need this to recover from conn loss during COPY */
!   if (!CheckConnection())
return false;
  
return success;
--- 740,746 
} while (next_result);
  
/* may need this to recover from conn loss during COPY */
!   if (!first_cycle  !CheckConnection())
return false;
  
return success;
***
*** 1015,1022  SendQuery(const char *query)
case PQTRANS_UNKNOWN:
default:
OK = false;
!   psql_error(unexpected transaction status 
(%d)\n,
!  transaction_status);
break;
}
  
--- 1015,1024 
case PQTRANS_UNKNOWN:
default:
OK = false;
!   /* PQTRANS_UNKNOWN is expected given a broken 
connection. */
!   if (transaction_status != PQTRANS_UNKNOWN || 
ConnectionUp())
!   psql_error(unexpected transaction 
status (%d)\n,
!  transaction_status);
break;
}
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-27 Thread Noah Misch

On Mon, Feb 27, 2012 at 02:13:32PM +0200, Heikki Linnakangas wrote:
 On 23.02.2012 18:01, Alvaro Herrera wrote:
 As far as complexity, yeah, it's a lot more complex now -- no question
 about that.

 How about assigning a new, real, transaction id, to represent the group  
 of transaction ids. The new transaction id would be treated as a  
 subtransaction of the updater, and the xids of the lockers would be  
 stored in the multixact-members slru. That way the multixact structures  
 wouldn't need to survive a crash; you don't care about the shared  
 lockers after a crash, and the xid of the updater would be safely stored  
 as is in the xmax field.

 That way you wouldn't need to handle multixact wraparound, because we  
 already handle xid wraparound, and you wouldn't need to make multixact  
 slrus crash-safe.

 Not sure what the performance implications would be. You would use up  
 xids more quickly, which would require more frequent anti-wraparound  
 vacuuming. And if we just start using real xids as the key to  
 multixact-offsets slru, we would need to extend that a lot more often.  
 But I feel it would probably be acceptable.

When a key locker arrives after the updater and creates this implicit
subtransaction of the updater, how might you arrange for the xid's clog status
to eventually get updated in accordance with the updater's outcome?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ECPG FETCH readahead

2012-03-02 Thread Noah Misch

On Thu, Dec 29, 2011 at 10:46:23AM +0100, Boszormenyi Zoltan wrote:
 2011-11-16 20:51 keltez?ssel, Boszormenyi Zoltan ?rta:
  2010-10-14 11:56 keltez?ssel, Boszormenyi Zoltan ?rta:
  On Thu, Jun 24, 2010 at 8:19 AM, Michael Meskes mes...@postgresql.org 
  wrote:
  On Thu, Jun 24, 2010 at 12:04:30PM +0300, Heikki Linnakangas wrote:
  Is there a reason not to enable it by default? I'm a bit worried
  that it will receive no testing if it's not always on.

  Yes, there is a reason, namely that you don't need it in native mode at 
  all.
  ECPG can read as many records as you want in one go, something ESQL/C
  apparently cannot do. This patch is a workaround for that restriction. I 
  still
  do not really see that this feature gives us an advantage in native mode
  though.

We yet lack a consensus on whether native ECPG apps should have access to the
feature.  My 2c is to make it available, because it's useful syntactic sugar.
If your program independently processes each row of an arbitrary-length result
set, current facilities force you to add an extra outer loop to batch the
FETCHes for every such code site.  Applications could define macros to
abstract that pattern, but this seems common-enough to justify bespoke
handling.  Besides, minimalists already use libpq directly.

I suggest enabling the feature by default but drastically reducing the default
readahead chunk size from 256 to, say, 5.  That still reduces the FETCH round
trip overhead by 80%, but it's small enough not to attract pathological
behavior on a workload where each row is a 10 MiB document.  I would not offer
an ecpg-time option to disable the feature per se.  Instead, let the user set
the default chunk size at ecpg time.  A setting of 1 effectively disables the
feature, though one could later re-enable it with ECPGFETCHSZ.

  The ASAP took a little long. The attached patch is in git diff format,
  because (1) the regression test intentionally doesn't do ECPGdebug()
  so the patch isn't dominated by a 2MB stderr file, so this file is empty
  and (2) regular diff cannot cope with empty new files.

Avoid the empty file with fprintf(STDERR, Dummy non-empty error output\n);

  - *NEW FEATURE* Readahead can be individually enabled or disabled
by ECPG-side grammar:
  DECLARE curname [ [ NO ] READAHEAD ] CURSOR FOR ...
Without [ NO ] READAHEAD, the default behaviour is used for cursors.

Likewise, this may as well take a chunk size rather than a yes/no.


The patch adds warnings:
error.c: In function `ecpg_raise':
error.c:281: warning: format not a string literal and no format arguments
error.c:281: warning: format not a string literal and no format arguments

The patch adds few comments and no larger comments explaining its higher-level
ideas.  That makes it much harder to review.  In this regard it follows the
tradition of the ECPG code, but let's depart from that tradition for new work.
I mention a few cases below where the need for commentary is acute.

I tested full reads of various SELECT * FROM generate_series(1, $1) commands
over a 50ms link, and the patch gives a sound performance improvement.  With
no readahead, a mere N=100 takes 5s to drain.  With readahead enabled, N=1
takes only 3s.  Performance was quite similar to that of implementing my own
batching with FETCH 256 FROM cur.  When I kicked up ECPGFETCHSZ (after
fixing its implementation -- see below) past the result set size, performance
was comparable to that of simply passing the query through psql.

 --- a/doc/src/sgml/ecpg.sgml
 +++ b/doc/src/sgml/ecpg.sgml

 @@ -5289,6 +5315,17 @@ while (1)
  /varlistentry
  
  varlistentry
 + term-231 (symbolECPG_INVALID_CURSOR/symbol)/term
 + listitem
 +  para
 +   The cursor you are trying to use with readahead has not been opened 
 yet (SQLSTATE 34000),
 +   invalid values were passed to libecpg (SQLSTATE 42P11) hor not in 
 prerequisite state, i.e.

Typo.

 --- /dev/null
 +++ b/src/interfaces/ecpg/ecpglib/cursor.c
 @@ -0,0 +1,730 @@

cursor.c contains various 78-col lines.  pgindent has limited ability to
improve those, so please split them.

 +static struct cursor_descriptor *
 +add_cursor(int lineno, struct connection *con, const char *name, bool 
 scrollable, int64 n_tuples, bool *existing)
 +{
 + struct cursor_descriptor *desc,
 + *ptr, *prev = NULL;
 + boolfound = false;
 +
 + if (!name || name[0] == '\0')
 + {
 + if (existing)
 + *existing = false;
 + return NULL;
 + }
 +
 + ptr = con-cursor_desc;
 + while (ptr)
 + {
 + int ret = strcasecmp(ptr-name, name);
 +
 + if (ret == 0)
 + {
 + found = true;
 + break;
 + }
 + if (ret  0)
 + break;
 +
 + prev = ptr;
 + ptr = ptr-next;
 + }

Any reason not to use find_cursor() here?

Re: [HACKERS] index-only quals vs. security_barrier views

2012-03-02 Thread Noah Misch

On Thu, Feb 09, 2012 at 12:02:29PM -0500, Robert Haas wrote:
 When Heikki worked up his original index-only scan patches (which
 didn't end up looking much like what eventually got committed), he had
 the notion of an index-only qual.  That is, given a query like this:
 
 select sum(1) from foo where substring(a,1,3) = 'abc';
 
 We could evaluate the substring qual before performing the heap fetch,
 and fetch the tuple from the heap only if the qual passes.

 Now, there's a fly in the ointment here, which is that applying
 arbitrary user-defined functions to tuples that might not be visible
 doesn't sound very safe.  The user-defined function in question might
 perform some action based on those invisible tuples that has side
 effects, which would be bad, because now we're violating MVCC
 semantics.  Or it might throw an error, killing the scan dead on the
 basis of the contents of some tuple that the scan shouldn't even see.
  However, there is certainly a class of functions for which this type
 of optimization would be safe, and it's an awful lot like the set of
 functions that can be safely pushed down through a security_barrier
 view - namely, things that don't have side effects or throw errors.
 So it's possible that the proleakproof flag KaiGai is proposing to add
 to pg_proc could do double duty, serving also to identify when it's
 safe to apply a qual to an index tuple when the corresponding heap
 tuple might be invisible.  However, I have some reservations about
 assuming that the two concepts are exactly the same.  For one thing,
 people are inevitably going to want to cheat a little bit more here
 than is appropriate for security views, and encouraging people to mark
 things LEAKPROOF when they're really not is a security disaster
 waiting to happen.

The similarity is indeed tempting, but I find the concepts sufficiently
distinct to not make one device serve both.  Adding to the reservations you
offer, LEAKPROOF is superuser-only.  This other marker would not entail any
special privilege.

 For another thing, there are some important cases
 that this doesn't cover, like:
 
 select * from foo where substring(a,1,3) like '%def%';
 
 The like operator doesn't seem to be leakproof in the security sense,
 because it can throw an error if the pattern is something like a
 single backslash (ERROR:  LIKE pattern must not end with escape
 character) and indeed it doesn't seem like it would be safe here
 either if the pattern were stored in the table.  But if the pattern
 were constant, it'd be OK, or almost OK: there's still the edge case
 where the table contains invisible rows but no visible ones - whether
 or not we complain about the pattern there ought to be the same as
 whether or not we complain about it on a completely empty table.  If
 we got to that point, then we might as well consider the qual
 leakproof for security purposes under the same set of circumstances
 we'd consider it OK to apply to possibly-invisible tuples.

This sort of thing implicates substring(), too, when you call it as
substring(a, 1, b); b  0 produces an error.

To handle these, I think we'd need a facility along the lines of protransform.
Have a function inspecting call nodes for a particular other function and
determining whether each is ok-for-index-only-quals.  You could even force
protransform itself into that role.  Create an additional pg_proc entry
identical to the ordinary substring() but for a different name and having the
ok-for-index-only-quals flag.  Add a protransform to the main pg_proc entry
that inspects the argument nodes and, when they're safe, replaces the call
with a call to that errorfree_substring() at plan time.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY with hints, rebirth

2012-03-02 Thread Noah Misch

On Fri, Mar 02, 2012 at 08:46:45AM +, Simon Riggs wrote:
On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
It's still broken:

[BEGIN;TRUNCATE;SAVEPOINT;COPY;ROLLBACK TO]

So this approach isn't the one...

The COPY FREEZE patch provides a way for the user to say explicitly
that they don't really care about these MVCC corner cases and as a
result allows us to avoid touching XidInMVCCSnapshot() at all. So
there is still a patch on the table.

You can salvage the optimization by tightening its prerequisite: use it when
the current subtransaction or a child thereof created or truncated the table.
A parent subtransaction having done so is acceptable for the WAL avoidance
optimization but not for this.

A COPY FREEZE ignoring that stronger restriction would be an interesting
feature in its own right, but it brings up other problems. For example,
suppose you write a tuple, then fail while writing its index entries. The
tuple is already frozen and visible, but it lacks a full set of live index
entries. The subtransaction aborts, but the top transaction commits and makes
the situation permanent.

Incidentally, I contend that we should write frozen tuples to new/truncated
tables unconditionally. The current behavior of making old snapshots see the
table as empty violates atomicity at least as badly as letting those snapshots
see the future-snapshot contents. But Marti has a sound proposal that would
interact with your efforts here to avoid violating atomicity at all:
http://archives.postgresql.org/message-id/cabrt9rbrmdsoz8kxgehfb4lg-ev9u67-6dlqvoiibpkkhtl...@mail.gmail.com

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ECPG FETCH readahead

2012-03-05 Thread Noah Misch

On Sun, Mar 04, 2012 at 05:16:06PM +0100, Michael Meskes wrote:
 On Fri, Mar 02, 2012 at 11:41:05AM -0500, Noah Misch wrote:
  I suggest enabling the feature by default but drastically reducing the 
  default
  readahead chunk size from 256 to, say, 5.  That still reduces the FETCH 
  round
  trip overhead by 80%, but it's small enough not to attract pathological
  behavior on a workload where each row is a 10 MiB document.  I would not 
  offer
  an ecpg-time option to disable the feature per se.  Instead, let the user 
  set
  the default chunk size at ecpg time.  A setting of 1 effectively disables 
  the
  feature, though one could later re-enable it with ECPGFETCHSZ.
 
 Using 1 to effectively disable the feature is fine with me, but I strongly
 object any default enabling this feature. It's farily easy to create cases 
 with
 pathological behaviour and this features is not standard by any means. I 
 figure
 a normal programmer would expect only one row being transfered when fetching
 one.

On further reflection, I agree with you here.  The prospect for queries that
call volatile functions changed my mind; they would exhibit different
functional behavior under readahead.  We mustn't silently give affected
programs different semantics.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ECPG FETCH readahead

2012-03-05 Thread Noah Misch

On Sun, Mar 04, 2012 at 04:33:32PM +0100, Boszormenyi Zoltan wrote:
 2012-03-02 17:41 keltez?ssel, Noah Misch ?rta:
  On Thu, Dec 29, 2011 at 10:46:23AM +0100, Boszormenyi Zoltan wrote:

  I suggest enabling the feature by default but drastically reducing the 
  default
  readahead chunk size from 256 to, say, 5. 
 
 
That still reduces the FETCH round
  trip overhead by 80%, but it's small enough not to attract pathological
  behavior on a workload where each row is a 10 MiB document.
 
 I see. How about 8? Nice round power of 2 value, still small and avoids
 87.5% of overhead.

Having pondered the matter further, I now agree with Michael that the feature
should stay disabled by default.  See my response to him for rationale.
Assuming that conclusion holds, we can recommended a higher value to users who
enable the feature at all.  Your former proposal of 256 seems fine.

 BTW, the default disabled behaviour was to avoid make check breakage,
 see below.
 
I would not offer
  an ecpg-time option to disable the feature per se.  Instead, let the user 
  set
  the default chunk size at ecpg time.  A setting of 1 effectively disables 
  the
  feature, though one could later re-enable it with ECPGFETCHSZ.
 
 This means all code previously going through ECPGdo() would go through
 ECPGopen()/ECPGfetch()/ECPGclose(). This is more intrusive and all
 regression tests that were only testing certain features would also
 test the readahead feature, too.

It's a good sort of intrusiveness, reducing the likelihood of introducing bugs
basically unrelated to readahead that happen to afflict only ECPGdo() or only
the cursor.c interfaces.  Let's indeed not have any preexisting test cases use
readahead per se, but having them use the cursor.c interfaces anyway will
build confidence in the new code.  The churn in expected debug output isn't
ideal, but I don't prefer the alternative of segmenting the implementation for
the sake of the test cases.

 Also, the test for WHERE CURRENT OF at ecpg time would have to be done
 at runtime, possibly making previously working code fail if ECPGFETCHSZ is 
 enabled.

Good point.

 How about still allowing NO READAHEAD cursors that compile into plain 
 ECPGdo()?
 This way, ECPGFETCHSZ don't interfere with WHERE CURRENT OF. But this would
 mean code changes everywhere where WHERE CURRENT OF is used.

ECPGFETCHSZ should only affect cursors that make no explicit mention of
READAHEAD.  I'm not sure whether that should mean actually routing READHEAD 1
cursors through ECPGdo() or simply making sure that cursor.c achieves the same
outcome; see later for a possible reason to still do the latter.

 Or how about a new feature in the backend, so ECPG can do
 UPDATE/DELETE ... WHERE OFFSET N OF cursor
 and the offset of computed from the actual cursor position and the position 
 known
 by the application? This way an app can do readahead and do work on rows 
 collected
 by the cursor with WHERE CURRENT OF which gets converted to WHERE OFFSET OF
 behind the scenes.

That's a neat idea, but I would expect obstacles threatening our ability to
use it automatically for readahead.  You would have to make the cursor a
SCROLL cursor.  We'll often pass a negative offset, making the operation fail
if the cursor query used FOR UPDATE.  Volatile functions in the query will get
more calls.  That's assuming the operation will map internally to something
like MOVE N; UPDATE ... WHERE CURRENT OF; MOVE -N.  You might come up with
innovations to mitigate those obstacles, but those innovations would probably
also apply to MOVE/FETCH.  In any event, this would constitute a substantive
patch in its own right.


One way out of trouble here is to make WHERE CURRENT OF imply READHEAD
1/READHEAD 0 (incidentally, perhaps those two should be synonyms) on the
affected cursor.  If the cursor has some other readahead quantity declared
explicitly, throw an error during preprocessing.

Failing a reasonable resolution, I'm prepared to withdraw my suggestion of
making ECPGFETCHSZ always-usable.  It's nice to have, not critical.

  +bool
  +ECPGopen(const int lineno, const int compat, const int force_indicator,
  +  const char *connection_name, const bool questionmarks,
  +  const char *curname, const int st, const char *query, ...)
  +{
  +  va_list args;
  +  boolret, scrollable;
  +  char   *new_query, *ptr, *whold, *noscroll, *scroll, *dollar0;
  +  struct sqlca_t *sqlca = ECPGget_sqlca();
  +
  +  if (!query)
  +  {
  +  ecpg_raise(lineno, ECPG_EMPTY, 
  ECPG_SQLSTATE_ECPG_INTERNAL_ERROR, NULL);
  +  return false;
  +  }
  +  ptr = strstr(query, for );
  +  if (!ptr)
  +  {
  +  ecpg_raise(lineno, ECPG_INVALID_STMT, 
  ECPG_SQLSTATE_ECPG_INTERNAL_ERROR, NULL);
  +  return false;
  +  }
  +  whold = strstr(query, with hold );
  +  dollar0 = strstr(query, $0);
  +
  +  noscroll = strstr(query, no scroll );
  +  scroll = strstr(query, scroll

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Noah Misch

On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote:
  I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ 
  and
  not at READ COMMITTED. ?They tend to be narrow race conditions at READ
  COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
  http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php
 
 Yeah.  Well, that's actually an interesting example, because it
 illustrates how general this problem is.  We could potentially get
 ourselves into a situation where just about every system catalog table
 needs an xmin field to store the point at which the object came into
 existence - or for that matter, was updated.

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others.  The value in extending this to more catalogs is the
ability to narrow the impact of failing the check.  A failed indcheckxmin
comparison merely excludes plans involving the index.  A failed inhvalidxmin
check might just skip recursion to the table in question.  Those are further
refinements, much like using weaker heavyweight lock types.

 But it's not quite the
 same as the xmin of the row itself, because some updates might be
 judged not to matter.  There could also be intermediate cases where
 updates are invalidating for some purposes but not others.  I think
 we'd better get our hands around more of the problem space before we
 start trying to engineer solutions.

I'm not seeing that problem.  Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness.  If you have something in mind that
needs more, could you elaborate?

  Incidentally, people use READ COMMITTED because they don't question the
  default, not because they know hazards of REPEATABLE READ. ?I don't know the
  bustedness you speak of; could we improve the documentation to inform folks?
 
 The example that I remember was related to SELECT FOR UPDATE/SELECT
 FOR SHARE.  The idea of those statements is that you want to prevent
 the row from being updated or deleted until some other concurrent
 action is complete; for example, in the case of a foreign key, we'd
 like to prevent the referenced row from being deleted or updated in
 the relevant columns until the inserting transaction is committed.
 But it doesn't work, because when the updating or deleting process
 gets done with the lock wait, they are still using the same snapshot
 as before, and merrily do exactly the the thing that the lock-wait was
 supposed to prevent.  If an actual UPDATE is used, it's safe (I
 think): anyone who was going to UPDATE or DELETE the row will fail
 with some kind of serialization error.  But a SELECT FOR UPDATE that
 commits is treated more like an UPDATE that rolls back: it's as if the
 lock never existed.  Someone (Florian?) proposed a patch to change
 this, but it seemed problematic for reasons I no longer exactly
 remember.

Thanks.  I vaguely remember that thread.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-06 Thread Noah Misch

On Sun, Mar 04, 2012 at 01:02:57PM +, Simon Riggs wrote:
 More detailed thoughts show that the test in heap_beginscan_internal()
 is not right enough, i.e. wrong.
 
 We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
 needs to be a specific xid, not an xmin because otherwise we can get
 concurrent transactions failing, not just older transactions.

Good point; I agree.  indcheckxmin's level of pessimism isn't appropriate for
this new check.

 If we're going freeze tuples on load this needs to be watertight, so
 some minor rework needed.
 
 Of course, if we only have a valid xid on the class we might get new
 columns added when we do repeated SELECT * statements using the same
 snapshot while concurrent DDL occurs. That is impractical, so if we
 define this as applying to rows it can work; if we want it to apply to
 everything its getting more difficult.

Excess columns seem less grave to me than excess or missing rows.  I'm having
difficulty thinking up an explanation for that opinion.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-06 Thread Noah Misch

On Mon, Mar 05, 2012 at 03:46:16PM -0500, Robert Haas wrote:
 On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch n...@leadboat.com wrote:
  I can see this strategy applying to many relation-pertinent system catalogs.
  Do you foresee applications to non-relation catalogs?
 
 Well, in theory, we have similar issues if, say, a query uses a
 function that didn't exist at the time the snapshot as taken; the
 actual results the user sees may not be consistent with any serial
 execution schedule.  And the same could be true for any other SQL
 object.  It's unclear that those cases are as compelling as this one,
 but then again it's unclear that no one will ever want to fix them,
 either.  For example, suppose we have a view v over a table t that
 calls a function f.  Somebody alters f to give different results and,
 in the same transaction, modifies the contents of t (but no DDL).
 This doesn't strike me as a terribly unlikely scenario; the change to
 t could well be envisioned as a compensating transaction.  But now if
 somebody uses the new definition of f against the old contents of t,
 the user may fail to get what they were hoping for out of bundling
 those changes together in one transaction.

Good example.

 Now, maybe we're never going to fix those kinds of anomalies anyway,
 but if we go with this architecture, then I think the chances of it
 ever being palatable to try are pretty low.

Why?

  But it's not quite the
  same as the xmin of the row itself, because some updates might be
  judged not to matter. ?There could also be intermediate cases where
  updates are invalidating for some purposes but not others. ?I think
  we'd better get our hands around more of the problem space before we
  start trying to engineer solutions.
 
  I'm not seeing that problem. ?Any operation that would update some xmin
  horizon should set it to the greater of its current value and the value the
  operation needs for its own correctness. ?If you have something in mind that
  needs more, could you elaborate?

Simon's point about xmin vs. xid probably leads to an example.  One value is
fine for TRUNCATE, because only the most recent TRUNCATE matters.  Not all DDL
is so simple.

 Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
 to operate on a table that has been truncated since CLUSTER's snapshot
 was taken, and no serialization anomaly is created that would not have
 already existed as a result of the non-MVCC-safe TRUNCATE.  On the
 other hand, if CLUSTER operates on a table that was created since
 CLUSTER's snapshot was taken, then you have a bona fide serialization
 anomaly.

Core CLUSTER does not use any MVCC snapshot.  We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot.  Could you elaborate this example?

 Maybe not a very important one, but does that prove that
 there's no significant problem of this type in general, or just
 nobody's thought through all the cases yet?  After all, the issues
 with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
 for a very long time, and we're only just getting around to looking at
 them, so I don't have much confidence that there aren't other cases
 floating around out there.

Granted.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ECPG FETCH readahead

2012-03-06 Thread Noah Misch

On Tue, Mar 06, 2012 at 07:07:41AM +0100, Boszormenyi Zoltan wrote:
 2012-03-05 19:56 keltez?ssel, Noah Misch ?rta:
  Or how about a new feature in the backend, so ECPG can do
  UPDATE/DELETE ... WHERE OFFSET N OF cursor
  and the offset of computed from the actual cursor position and the 
  position known
  by the application? This way an app can do readahead and do work on rows 
  collected
  by the cursor with WHERE CURRENT OF which gets converted to WHERE OFFSET OF
  behind the scenes.
  That's a neat idea, but I would expect obstacles threatening our ability to
  use it automatically for readahead.  You would have to make the cursor a
  SCROLL cursor.  We'll often pass a negative offset, making the operation 
  fail
  if the cursor query used FOR UPDATE.  Volatile functions in the query will 
  get
  more calls.  That's assuming the operation will map internally to something
  like MOVE N; UPDATE ... WHERE CURRENT OF; MOVE -N.  You might come up with
  innovations to mitigate those obstacles, but those innovations would 
  probably
  also apply to MOVE/FETCH.  In any event, this would constitute a substantive
  patch in its own right.
 
 I was thinking along the lines of a Portal keeping the ItemPointerData
 for each tuple in the last FETCH statement. The WHERE OFFSET N OF cursor
 would treat the offset value relative to the tuple order returned by FETCH.
 So, OFFSET 0 OF == CURRENT OF and other values of N are negative.
 This way, it doesn't matter if the cursor is SCROLL, NO SCROLL or have
  the default behaviour with SCROLL in some cases. Then ECPGopen()
 doesn't have to play games with the DECLARE statement. Only ECPGfetch()
 needs to play with MOVE statements, passing different offsets to the backend,
 not what the application passed.

That broad approach sounds promising.  The main other consideration that comes
to mind is a plan to limit resource usage for a cursor that reads, say, 1B
rows.  However, I think attempting to implement this now will significantly
decrease the chance of getting the core patch features committed now.

  One way out of trouble here is to make WHERE CURRENT OF imply READHEAD
  1/READHEAD 0 (incidentally, perhaps those two should be synonyms) on the
  affected cursor.  If the cursor has some other readahead quantity declared
  explicitly, throw an error during preprocessing.
 
 I played with this idea a while ago, from a different point of view.
 If the ECPG code had the DECLARE mycur, DML ... WHERE CURRENT OF mycur
 and OPEN mycur in exactly this order, i.e. WHERE CURRENT OF appears in
 a standalone function between DECLARE and the first OPEN for the cursor,
 then ECPG disabled readahead automatically for that cursor and for that
 cursor only. But this requires effort on the user of ECPG and can be very
 fragile. Code cleanup with reordering functions can break previously
 working code.

Don't the same challenges apply to accurately reporting an error when the user
specifies WHERE CURRENT OF for a readahead cursor?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Noah Misch

On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:
 On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch n...@leadboat.com wrote:
  Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
  to operate on a table that has been truncated since CLUSTER's snapshot
  was taken, and no serialization anomaly is created that would not have
  already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
  other hand, if CLUSTER operates on a table that was created since
  CLUSTER's snapshot was taken, then you have a bona fide serialization
  anomaly.
 
  Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit
  of functions called during the reindex phase, but it does not appear that 
  you
  speak of that snapshot. ?Could you elaborate this example?
 
 Imagine this:
 
 - Transaction #1 acquires a snapshot.
 - Transaction #2 creates tables A, inserts a row into table B, and then 
 commits.
 - Transaction #1 tries to CLUSTER A and then select from B.
 
 The only serial execution schedules are T1  T2, in which case the
 transaction fails, or T2  T1, in which case the row is seen.  But
 what actually happens is that the row isn't seen and yet the
 transaction doesn't fail.

For the purpose of contemplating this anomaly, one could just as well replace
CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
table, correct?

I agree this test case is good to keep in mind while designing, but we could
well conclude not to bother improving it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings

2012-03-08 Thread Noah Misch

On Wed, Mar 07, 2012 at 04:57:12PM -0500, Robert Haas wrote:
 On Sat, Feb 25, 2012 at 12:57 AM, Noah Misch n...@leadboat.com wrote:
  Thanks. ?While testing a crashing function, I noticed that my above patch
  added some noise to psql output when the server crashes:
 
  [local] test=# select crashme();
  The connection to the server was lost. Attempting reset: Failed.
  The connection to the server was lost. Attempting reset: Failed.
  unexpected transaction status (4)
  Time: 6.681 ms
  ?! \q
 
  Status 4 is PQTRANS_UNKNOWN, which is expected when the connection is not
  CONNECTION_OK. ?The double message arrives because ProcessResult() now calls
  CheckConnection() at least twice, for the benefit of COPY. ?(Incidentally, 
  the
  reconnect fails because the server has not yet finished recovering; that 
  part
  is nothing new.)
 
  The attached small patch has SendQuery() keep quiet about PQTRANS_UNKNOWN 
  when
  the connection is down. ?It makes ProcessResult() skip the second
  CheckConnection() when the command string had no COPY results. ?This 
  restores
  the pre-08146775acd8bfe0fcc509c71857abb928697171 psql output:
 
  [local] test=# select crashme();
  The connection to the server was lost. Attempting reset: Failed.
  Time: 4.798 ms
  ?! \q
 
 Committed, but... why do we EVER need to call CheckConnection() at the
 bottom of ProcessResult(), after exiting that function's main loop?  I
 don't see any way to get out of that loop without first calling
 AcceptResult on every PGresult we've seen, and that function already
 calls CheckConnection() if any failure is observed.

You can reproduce a case where it helps by sending SIGKILL to a backend
serving a long-running COPY TO, such as this:

copy (select n, pg_sleep(case when n  1000 then 1 else 0 end)
  from generate_series(1,1030) t(n)) to stdout;

The connection dies during handleCopyOut().  By the time control returns to
ProcessResult(), there's no remaining PGresult to check.  Only that last-ditch
CheckConnection() notices the lost connection.

[I notice more examples of poor error reporting cosmetics in some of these
COPY corner cases, so I anticipate another patch someday.]

 As a side note, the documentation for PQexec() is misleading about
 what will happen if COPY is present in a multi-command string.  It
 says: Note however that the returned PGresult structure describes
 only the result of the last command executed from the string. Should
 one of the commands fail, processing of the string stops with it and
 the returned PGresult describes the error condition.  It does not
 explain that it also stops if it hits a COPY.  I had to read the
 source code for libpq to understand why this psql logic was coded the
 way it is.

Agreed; I went through a similar process.  Awhile after reading the code, I
found the behavior documented in section Functions Associated with the COPY
Command:

  If a COPY command is issued via PQexec in a string that could contain
  additional commands, the application must continue fetching results via
  PQgetResult after completing the COPY sequence. Only when PQgetResult
  returns NULL is it certain that the PQexec command string is done and it is
  safe to issue more commands.

I'm not quite sure what revision would help most here -- a cross reference,
moving that content, duplicating that content.  Offhand, I'm inclined to move
it to the PQexec() documentation with some kind of reference back from its
original location.  Thoughts?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sortsupport for text

2012-03-08 Thread Noah Misch

On Fri, Mar 02, 2012 at 03:45:38PM -0500, Robert Haas wrote:
 SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t) x;
 
 On unpatched master, this takes about 416 ms (plus or minus a few).
 With the attached patch, it takes about 389 ms (plus or minus a very
 few), a speedup of about 7%.
 
 I repeated the experiment using the C locale, like this:
 
 SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t COLLATE C) x;
 
 Here, it takes about 202 ms with the patch, and about 231 ms on
 unpatched master, a savings of about 13%.

 [oprofile report, further discussion]

Thanks for looking into this.  Your patch is also a nice demonstration of
sortsupport's ability to help with more than just fmgr overhead.

 Considering all that, I
 had hoped for more like a 15-20% gain from this approach, but it
 didn't happen, I suppose because some of the instructions saved just
 resulted in more processor stalls.  All the same, I'm inclined to
 think it's still worth doing.

This is a border case, but I suggest that a 13% speedup on a narrowly-tailored
benchmark, degrading to 7% in common configurations, is too meager to justify
adopting this patch.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-08 Thread Noah Misch

On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
 Alexander Korotkov aekorot...@gmail.com writes:
  True. If (max count - min count + 1) is small, enumerating of frequencies
  is both more compact and more precise representation. Simultaneously,
  if (max count - min count + 1) is large, we can run out of
  statistics_target with such representation. We can use same representation
  of count distribution as for scalar column value: MCV and HISTOGRAM, but it
  would require additional statkind and statistics slot. Probably, you've
  better ideas?
 
 I wasn't thinking of introducing two different representations,
 but just trimming the histogram length when it's larger than necessary.
 
 On reflection my idea above is wrong; for example assume that we have a
 column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
 what I said, we'd reduce the histogram to {1,2}, which might accurately
 capture the set of lengths present but fails to show that 1 is much more
 common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
 would capture the situation perfectly in one-tenth the space that the
 current logic does.

Granted.  When the next sample finds 899/101 instead, though, the optimization
vanishes.  You save 90% of the space, perhaps 10% of the time.  If you want to
materially narrow typical statistics, Alexander's proposal looks like the way
to go.  I'd guess array columns always having DEC = default_statistics_target
are common enough to make that representation the dominant representation, if
not the only necessary representation.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-08 Thread Noah Misch

On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
  On reflection my idea above is wrong; for example assume that we have a
  column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
  what I said, we'd reduce the histogram to {1,2}, which might accurately
  capture the set of lengths present but fails to show that 1 is much more
  common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
  would capture the situation perfectly in one-tenth the space that the
  current logic does.
 
  Granted.  When the next sample finds 899/101 instead, though, the 
  optimization
  vanishes.
 
 No, you missed my next point.  That example shows that sometimes a
 smaller histogram can represent the situation with zero error, but in
 all cases a smaller histogram can represent the situation with perhaps
 more error than a larger one.  Since we already have a defined error
 tolerance, we should try to generate a histogram that is as small as
 possible while still not exceeding the error tolerance.
 
 Now, it might be that doing that is computationally impractical, or
 too complicated to be reasonable.  But it seems to me to be worth
 looking into.

Yes, I did miss your point.

One characteristic favoring this approach is its equal applicability to both
STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST.

  If you want to materially narrow typical statistics, Alexander's
  proposal looks like the way to go.  I'd guess array columns always
  having DEC = default_statistics_target are common enough to make that
  representation the dominant representation, if not the only necessary
  representation.
 
 Well, I don't want to have two representations; I don't think it's worth
 the complexity.  But certainly we could consider switching to a
 different representation if it seems likely to usually be smaller.

Perhaps some heavy array users could provide input: what are some typical
length ranges among arrays in your applications on which you use arr 
const, arr @ const or arr @ const searches?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-12 Thread Noah Misch

On Mon, Mar 12, 2012 at 01:28:11PM -0400, Robert Haas wrote:
 I spent some time thinking about this over the weekend, and I have an
 observation, and an idea.  Here's the observation: I believe that
 locking a tuple whose xmin is uncommitted is always a noop, because if
 it's ever possible for a transaction to wait for an XID that is part
 of its own transaction (exact same XID, or sub-XIDs of the same top
 XID), then a transaction could deadlock against itself.  I believe
 that this is not possible: if a transaction were to wait for an XID
 assigned to that same backend, then the lock manager would observe
 that an ExclusiveLock on the xid is already held, so the request for a
 ShareLock would be granted immediately.  I also don't believe there's
 any situation in which the existence of an uncommitted tuple fails to
 block another backend, but a lock on that same uncommitted tuple would
 have caused another backend to block.  If any of that sounds wrong,
 you can stop reading here (but please tell me why it sounds wrong).

When we lock an update-in-progress row, we walk the t_ctid chain and lock all
descendant tuples.  They may all have uncommitted xmins.  This is essential to
ensure that the final outcome of the updating transaction does not affect
whether the locking transaction has its KEY SHARE lock.  Similarly, when we
update a previously-locked tuple, we copy any locks (always KEY SHARE locks)
to the new version.  That new tuple is both uncommitted and has locks, and we
cannot easily sacrifice either property.  Do you see a way to extend your
scheme to cover these needs?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring relation free space

2012-03-12 Thread Noah Misch

On Fri, Mar 09, 2012 at 02:18:02AM -0500, Jaime Casanova wrote:
 On Wed, Feb 22, 2012 at 12:27 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote:
 
  1) pgstattuple-gin_spgist.patch
  This first patch adds gin and spgist support to pgstattuple, also
  makes pgstattuple use a ring buffer when reading tables or indexes.
 
  The buffer access strategy usage bits look fine to commit.
 
 
 ok. i extracted that part. which basically makes pgstattuple usable in
 production (i mean, by not bloating shared buffers when using the
 function)

I created a CF entry for this and marked it Ready for Committer.  You left the
bstrategy variable non-static, but that didn't seem important enough to
justify another round trip.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-13 Thread Noah Misch

On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote:
 On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
  I agree with you that some worst case performance tests should be
  done. Could you please say what you think the worst cases would be, so
  those can be tested? That would avoid wasting time or getting anything
  backwards.
 
 I've thought about this some and here's what I've come up with so far:
 
 1. SELECT FOR SHARE on a large table on a system with no write cache.

Easy enough that we may as well check it.  Share-locking an entire large table
is impractical in a real application, so I would not worry if this shows a
substantial regression.

 2. A small parent table (say 30 rows or so) and a larger child table
 with a many-to-one FK relationship to the parent (say 100 child rows
 per parent row), with heavy update activity on the child table, on a
 system where fsyncs are very slow.  This should generate lots of mxid
 consumption, and every 1600 or so mxids (I think) we've got to fsync;
 does that generate a noticeable performance hit?

More often than that; each 2-member mxid takes 4 bytes in an offsets file and
10 bytes in a members file.  So, more like one fsync per ~580 mxids.  Note
that we already fsync the multixact SLRUs today, so any increase will arise
from the widening of member entries from 4 bytes to 5.  The realism of this
test is attractive.  Nearly-static parent tables are plenty common, and this
test will illustrate the impact on those designs.

 3. It would be nice to test the impact of increased mxid lookups in
 the parent, but I've realized that the visibility map will probably
 mask a good chunk of that effect, which is a good thing.  Still, maybe
 something like this: a fairly large parent table, say a million rows,
 but narrow rows, so that many of them fit on a page, with frequent
 reads and occasional updates (if there are only reads, autovacuum
 might end with all the visibility map bits set); plus a child table
 with one or a few rows per parent which is heavily updated.  In theory
 this ought to be good for the patch, since the the more fine-grained
 locking will avoid blocking, but in this case the parent table is
 large enough that you shouldn't get much blocking anyway, yet you'll
 still pay the cost of mxid lookups because the occasional updates on
 the parent will clear VM bits.  This might not be the exactly right
 workload to measure this effect, but if it's not maybe someone can
 devote a little time to thinking about what would be.

You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
lookups, so I think something more sophisticated is needed to exercise that
cost.  Not sure what.

 4. A plain old pgbench run or two, to see whether there's any
 regression when none of this matters at all...

Might as well.

 This isn't exactly a test case, but from Noah's previous comments I
 gather that there is a theoretical risk of mxid consumption running
 ahead of xid consumption.  We should try to think about whether there
 are any realistic workloads where that might actually happen.  I'm
 willing to believe that there aren't, but not just because somebody
 asserts it.  The reason I'm concerned about this is because, if it
 should happen, the result will be more frequent anti-wraparound
 vacuums on every table in the cluster.  Those are already quite
 painful for some users.

Yes.  Pre-release, what can we really do here other than have more people
thinking about ways it might happen in practice?  Post-release, we could
suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table
is using more mxid space than xid space.


Also consider a benchmark that does plenty of non-key updates on a parent
table with no activity on the child table.  We'll pay the overhead of
determining that the key column(s) have not changed, but it will never pay off
by preventing a lock wait.  Granted, this is barely representative of
application behavior.  Perhaps, too, we already have a good sense of this cost
from the HOT benchmarking efforts and have no cause to revisit it.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-14 Thread Noah Misch

On Wed, Mar 14, 2012 at 01:23:14PM -0400, Robert Haas wrote:
 On Tue, Mar 13, 2012 at 11:42 PM, Noah Misch n...@leadboat.com wrote:
  More often than that; each 2-member mxid takes 4 bytes in an offsets file 
  and
  10 bytes in a members file. ?So, more like one fsync per ~580 mxids. ?Note
  that we already fsync the multixact SLRUs today, so any increase will arise
  from the widening of member entries from 4 bytes to 5. ?The realism of this
  test is attractive. ?Nearly-static parent tables are plenty common, and this
  test will illustrate the impact on those designs.
 
 Agreed.  But speaking of that, why exactly do we fsync the multixact SLRU 
 today?

Good question.  So far, I can't think of a reason.  nextMulti is critical,
but we already fsync it with pg_control.  We could delete the other multixact
state data at every startup and set OldestVisibleMXactId accordingly.

  You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
  lookups, so I think something more sophisticated is needed to exercise that
  cost. ?Not sure what.
 
 I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
 all-visible.  HEAP_XMAX_INVALID will obviously help, when it happens.

True.  The patch (see ResetMultiHintBit()) also replaces a multixact xmax with
the updater xid when all transactions of the multixact have ended.  You would
need a test workload with long-running multixacts that delay such replacement.
However, the workloads that come to mind are the very workloads for which this
patch eliminates lock waits; they wouldn't illustrate a worst-case.

  This isn't exactly a test case, but from Noah's previous comments I
  gather that there is a theoretical risk of mxid consumption running
  ahead of xid consumption. ?We should try to think about whether there
  are any realistic workloads where that might actually happen. ?I'm
  willing to believe that there aren't, but not just because somebody
  asserts it. ?The reason I'm concerned about this is because, if it
  should happen, the result will be more frequent anti-wraparound
  vacuums on every table in the cluster. ?Those are already quite
  painful for some users.
 
  Yes. ?Pre-release, what can we really do here other than have more people
  thinking about ways it might happen in practice? ?Post-release, we could
  suggest monitoring methods or perhaps have VACUUM emit a WARNING when a 
  table
  is using more mxid space than xid space.
 
 Well, post-release, the cat is out of the bag: we'll be stuck with
 this whether the performance characteristics are acceptable or not.
 That's why we'd better be as sure as possible before committing to
 this implementation that there's nothing we can't live with.  It's not
 like there's any reasonable way to turn this off if you don't like it.

I disagree; we're only carving in stone the FOR KEY SHARE and FOR KEY UPDATE
syntax additions.  We could even avoid doing that by not documenting them.  A
later major release could implement them using a completely different
mechanism or even reduce them to aliases, KEY SHARE = SHARE and KEY UPDATE =
UPDATE.  To be sure, let's still do a good job the first time.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-15 Thread Noah Misch

On Thu, Mar 15, 2012 at 08:37:36PM -0400, Robert Haas wrote:
 On Thu, Mar 15, 2012 at 5:07 PM, Simon Riggs si...@2ndquadrant.com wrote:
  On Wed, Mar 14, 2012 at 5:23 PM, Robert Haas robertmh...@gmail.com wrote:
  You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on 
  mxid
  lookups, so I think something more sophisticated is needed to exercise 
  that
  cost. ?Not sure what.
 
  I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
  all-visible.
 
  So because committed does not equal all visible there will be
  additional lookups on mxids? That's complete rubbish.
 
 Noah seemed to be implying that once the updating transaction
 committed, HEAP_XMAX_COMMITTED would get set and save the mxid lookup.
  But I think that's not true, because anyone who looks at the tuple
 afterward will still need to know the exact xmax, to test it against
 their snapshot.

Yeah, my comment above was wrong.  I agree that we'll need to retrieve the
mxid members during every MVCC scan until we either mark the page all-visible
or have occasion to simplify the mxid xmax to the updater xid.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_terminate_backend for same-role

2012-03-16 Thread Noah Misch

On Thu, Mar 15, 2012 at 04:14:03PM -0700, Daniel Farina wrote:
 Parallel to pg_cancel_backend, it'd be nice to allow the user to just
 outright kill a backend that they own (politely, with a SIGTERM),
 aborting any transactions in progress, including the idle transaction,
 and closing the socket.

+1

 I imagine the problem is a race condition whereby a pid might be
 reused by another process owned by another user (doesn't that also
 affect pg_cancel_backend?).  Shall we just do everything using the
 MyCancelKey (which I think could just be called SessionKey,
 SessionSecret, or even just Session) as to ensure we have no case
 of mistaken identity? Or does that end up being problematic?

No, I think the hazard you identify here is orthogonal to the question of when
to authorize pg_terminate_backend().  As you note downthread, protocol-level
cancellations available in released versions already exhibit this hazard.  I
wouldn't mind a clean fix for this, but it's an independent subject.


Here I discussed a hazard specific to allowing pg_terminate_backend():
http://archives.postgresql.org/message-id/20110602045955.gc8...@tornado.gateway.2wire.net

To summarize, user code can trap SIGINT cancellations, but it cannot trap
SIGTERM terminations.  If a backend is executing a SECURITY DEFINER function
when another backend of the same role calls pg_terminate_backend() thereon,
the pg_terminate_backend() caller could achieve something he cannot achieve in
PostgreSQL 9.1.  I vote that this is an acceptable loss.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-03-16 Thread Noah Misch

[used followup EACH-foreign-key.v4b.patch.bz2 for review]

On Wed, Mar 14, 2012 at 07:03:08PM +0100, Marco Nenciarini wrote:
 please find attached v4 of the EACH Foreign Key patch (formerly known
 also as Foreign Key Array).

 On Fri, Feb 24, 2012 at 09:01:35PM -0500, Noah Misch wrote:

   We can use multi-dimensional arrays as well as referencing columns. In 
   that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH
   SET NULL. This is a safe way of implementing the action. 
   We have some ideas on how to implement this, but we feel it is better to
   limit the behaviour for now.
  
  This still feels awfully unprincipled to me.  How about just throwing an 
  error
  when we need to remove an element from a multidimensional array?  
  Essentially,
  have ON DELETE EACH CASCADE downgrade to ON DELETE RESTRICT when it 
  encounters
  a multidimensional array.  That's strictly less code than what you have now,
  and it keeps our options open.  We can always change from error to set-null
  later, but we can't so readily change from set-null to anything else.
 
 That seems reasonable to me. Implemented and documented.

I like the semantics now.  You implemented this by doing two scans per PK
delete, the first to detect multidimensional dependants of the PK row and the
second to fix 1-dimensional dependants.  We don't require an index to support
the scan, so this can mean two seq scans.  Currently, the only benefit to
doing it this way is a change of error message.  Here is the current error
message when we would need to remove a multidimensional array element:

  ERROR:  update or delete on table pktableforarray violates foreign key 
constraint fktableforarray_ftest1_fkey on table fktableforarray
  DETAIL:  Key (ptest1)=(2) is still referenced from table fktableforarray.

If I just rip out the first scan, we get the same outcome with a different
error message:

  ERROR:  removing elements from multidimensional arrays is not supported
  CONTEXT:  SQL statement UPDATE ONLY public.fktableforarray SET ftest1 
= array_remove(ftest1, $1) WHERE $1 OPERATOR(pg_catalog.=) ANY (ftest1)

That has less polish, but I think it's actually more useful.  The first
message gives no indication that a multidimensional array foiled your ON
DELETE EACH CASCADE.  The second message hints at that cause.

I recommend removing the new block of code in RI_FKey_eachcascade_del() and
letting array_remove() throw the error.  If you find a way to throw a nicer
error without an extra scan, by all means submit that to a future CF as an
improvement.  I don't think it's important enough to justify cycles at this
late hour of the current CF.

  As I pondered this patch again, I came upon formal hazards around 
  non-default
  operator classes.  Today, ATAddForeignKeyConstraint() always chooses pfeqop,
  ffeqop and ppeqop in the same operator family.  If it neglected this, we 
  would
  get wrong behavior when one of the operators is sensitive to a change to 
  which
  another is insensitive.  For EACH FK constraints, this patch sets ffeqop =
  ARRAY_EQ_OP unconditionally.  array_eq() uses the TYPECACHE_EQ_OPR (usually
  from the default B-tree operator class).  That operator may not exist at 
  all,
  let alone share an operator family with pfeqop.  Shall we deal with this by
  retrieving TYPECACHE_EQ_OPR in ATAddForeignKeyConstraint() and rejecting the
  constraint creation if it does not share an operator family with pfeqop?  
  The
  same hazard affects ri_triggers.c use of array_remove() and array_replace(),
  and the same change mitigates it.
 
 Check added. As a consequence of this stricter policy, certain
 previously allowed cases are now unacceptable (e.g pk FLOAT and fk
 INT[]).
 Regression tests have been added.

Ah, good.  Not so formal after all.

  pg_constraint.confeach needs documentation.
 
 Most of pg_constraint columns, including all the boolean ones, are
 documented only in the description column of
 
 http://www.postgresql.org/docs/9.1/static/catalog-pg-constraint.html#AEN85760
 
 it seems that confiseach should not be an exception, since it just
 indicates whether the constraint is of a certain kind or not.

Your patch adds two columns to pg_constraint, confiseach and confeach, but it
only mentions confiseach in documentation.  Just add a similar minimal mention
of confeach.

   ***
   *** 5736,5741  ATAddForeignKeyConstraint(AlteredTableInfo *tab, 
   Relation rel,
   --- 5759,5807 
 (errcode(ERRCODE_INVALID_FOREIGN_KEY),
  errmsg(number of referencing and 
   referenced columns for foreign key disagree)));
 
   + /* Enforce each foreign key restrictions */
   + if (fkconstraint-fk_is_each)
   + {
   + /*
   +  * ON UPDATE CASCADE action is not supported on EACH 
   foreign keys
   +  */
   + if (fkconstraint-fk_upd_action

Re: [HACKERS] pg_terminate_backend for same-role

2012-03-17 Thread Noah Misch

On Fri, Mar 16, 2012 at 04:42:07PM -0700, Daniel Farina wrote:
On Fri, Mar 16, 2012 at 3:42 PM, Noah Misch n...@leadboat.com wrote:
On Thu, Mar 15, 2012 at 04:14:03PM -0700, Daniel Farina wrote:
I imagine the problem is a race condition whereby a pid might be
reused by another process owned by another user (doesn't that also
affect pg_cancel_backend?). ?Shall we just do everything using the
MyCancelKey (which I think could just be called SessionKey,
SessionSecret, or even just Session) as to ensure we have no case
of mistaken identity? Or does that end up being problematic?

No, I think the hazard you identify here is orthogonal to the question of
when
to authorize pg_terminate_backend(). ?As you note downthread, protocol-level
cancellations available in released versions already exhibit this hazard. ?I
wouldn't mind a clean fix for this, but it's an independent subject.

Hmm. Well, here's a patch that implements exactly that, I think,
similar to the one posted to this thread, but not using BackendIds,
but rather the newly-introduced SessionId. Would appreciate
comments. Because an out-of-band signaling method has been integrated
more complex behaviors -- such as closing the
TERM-against-SECURITY-DEFINER-FUNCTION hazard -- can be addressed.
For now I've only attempted to solve the problem of backend ambiguity,
which basically necessitated out-of-line information transfer as per
the usual means.

This patch still changes the policy for pg_terminate_backend(), and it does
not fix other SIGINT senders like processCancelRequest() and ProcSleep(). If
you're concerned about PID-reuse races, audit all backend signalling. Either
fix all such problems or propose a plan to get there eventually. Any further
discussion of this topic needs a new subject line; mixing its consideration
with proposals to change the policy behind pg_terminate_backend() reduces the
chances of the right people commenting on these distinct questions.

Currently, when pg_terminate_backend() follows a pg_cancel_backend() on which
the target has yet to act, the eventual outcome is a terminated process. With
this patch, the pg_terminate_backend() becomes a no-op with this warning:

! ereport(WARNING,
!
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! (errmsg(process is busy responding
to administrative
! request)),
! (errhint(This is temporary, and may
be retried.;

That's less useful than the current behavior.

That being said, I can't get too excited about closing PID-reuse races. I've
yet to see another program do so. I've never seen a trouble report around
this race for any software. Every OS I have used assigns PIDs so as to
maximize the reuse interval, which seems like an important POLA measure given
typical admin formulae like kill `ps | grep ...`. This defense can only
matter in fork-bomb conditions, at which point a stray signal is minor.

I do think it's worth keeping this idea in a back pocket for achieving those
more complex behaviors, should we ever desire them.

Here I discussed a hazard specific to allowing pg_terminate_backend():
http://archives.postgresql.org/message-id/20110602045955.gc8...@tornado.gateway.2wire.net

To summarize, user code can trap SIGINT cancellations, but it cannot trap
SIGTERM terminations. ?If a backend is executing a SECURITY DEFINER function
when another backend of the same role calls pg_terminate_backend() thereon,
the pg_terminate_backend() caller could achieve something he cannot achieve
in
PostgreSQL 9.1. ?I vote that this is an acceptable loss.

I'll throw out a patch that just lets this hazard exist and see what
happens, although it is obsoleted/incompatible with the one already
attached.

+1. Has anyone actually said that the PID-reuse race or the thing I mention
above should block such a patch? I poked back through the threads I could
remember and found nothing.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1905 matches

Mail list logo