Re: [HACKERS] pg_stat_statements normalization: re-review

2012-02-24 Thread Daniel Farina
On Thu, Feb 23, 2012 at 3:09 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 These have all been changed in the usual manner to support one new
 field, the queryId, on the toplevel Plan and Query nodes.  The queryId
 is 64-bit field copied from queries to plans to tie together plans to
 be used by plugins (quoth the source) as they flow through the
 system.  pg_stat_statements fills them with the 64-bit hash of the
 query text -- a reasonable functionality to possess, I think, but the
 abstraction seems iffy: plugins cannot compose well if they all need
 to use the queryId for different reasons.  Somehow I get the feeling
 that this field can be avoided or its use case can be satisfied in a
 more satisfying way.

 Maybe the answer here is to have a simple mechanism for modules to
 claim the exclusive right to be able to set that flag?

As I see it, the queryId as the most contentious internals change in
the patch, as queryId is not really an opaque id tracking a query's
progression, but rather a bit of extra space for query jumbles, and
query jumbles only, as far as I can tell: thus, not really a generally
useful backend service, but rather specifically for pg_stat_statements
-- that may not take such a special field off the table (more
opinionated people will probably comment on that), but it's a pretty
awkward kind of inclusion.

I'm posting an experimental patch in having the backend define state
whose only guarantee is to be equal between a PlannedStmt and a Query
node that derived it, and its effect on pg_stat_statements.  (the
latter not included in this email, but available via git[0])  The
exact mechanism I use is pretty questionable, but my skimming of the
entry points of the code suggests it might work: I use the pointer to
the Query node from the PlannedStmt, which is only unique as long as
the Query (head) node remains allocated while the PlannedStmt is also
alive.  A big if, but able to be verified with assertions in debug
mode (does the pointer see poisoned memory?) and low overhead
presuming one had to allocate memory already.  However, the
abstraction is as such that if the key generation needs to change
compliant consumers should be fine.

At ExecutorFinish (already hookable) all NodeKeys remembered by an
extension should be invalidated, as that memory is free and ready to
be used again.

As Query node availability from the PlannedStmt is not part of the
node for a reason, it is rendered as an opaque uintptr_t, and hidden
behind a type that only is intended to define equality and
definedness (!= NULL is the implementation).

The clear penalty is that the extension must be able to map from the
arbitrary, backend-assigned queryId (which I have renamed nodeKey just
to make it easier to discuss) to its own internal value it cares
about: the query jumble.  I used a very simple, fixed-size association
array with a small free-space location optimization, taking advantage
of the property in pg_stat_statements can simply not process a query
if it doesn't want to (but should process most of them so one can get
a good sense of what is going on).

A credible(?) alternative I have thought of is to instead expose a
memory context for use by extensions on the interesting nodes; in
doing so extensions can request space and store whatever they need.
The part that I feel might be tricky is figuring out a reasonable and
appropriate time to free that 'extension-context' and what context
that context should live under.  But I'd be wiling to try it with
haste if it sounds a lot more credible or useful than what I've got.

-- 
fdr

[0]: https://github.com/fdr/postgres/tree/wrench-pgss
From c3ad77375062cfbeee7d4ce7e0fe274a5db76453 Mon Sep 17 00:00:00 2001
From: Daniel Farina dan...@heroku.com
Date: Fri, 24 Feb 2012 01:31:54 -0800
Subject: [PATCH] Introduce NodeKey as a service to extensions in the backend

Signed-off-by: Daniel Farina dan...@heroku.com
---
 src/backend/nodes/copyfuncs.c|4 ++--
 src/backend/nodes/outfuncs.c |6 --
 src/backend/nodes/readfuncs.c|7 +++
 src/backend/optimizer/plan/planner.c |2 +-
 src/backend/parser/analyze.c |2 +-
 src/include/nodes/parsenodes.h   |4 ++--
 src/include/nodes/plannodes.h|4 +++-
 src/include/nodes/primnodes.h|   14 ++
 8 files changed, 26 insertions(+), 17 deletions(-)

*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***
*** 91,97  _copyPlannedStmt(const PlannedStmt *from)
  	COPY_NODE_FIELD(relationOids);
  	COPY_NODE_FIELD(invalItems);
  	COPY_SCALAR_FIELD(nParamExec);
! 	COPY_SCALAR_FIELD(queryId);
  
  	return newnode;
  }
--- 91,97 
  	COPY_NODE_FIELD(relationOids);
  	COPY_NODE_FIELD(invalItems);
  	COPY_SCALAR_FIELD(nParamExec);
! 	COPY_SCALAR_FIELD(nodeKey);
  
  	return newnode;
  }
***
*** 2415,2421  _copyQuery(const Query *from)
  
  	COPY_SCALAR_FIELD(commandType);
  	

Re: [HACKERS] pg_stat_statements normalization: re-review

2012-02-24 Thread Daniel Farina
On Fri, Feb 24, 2012 at 3:14 AM, Daniel Farina dan...@heroku.com wrote:
 At ExecutorFinish (already hookable) all NodeKeys remembered by an
 extension should be invalidated, as that memory is free and ready to
 be used again.

I think this statement is false; I thought it was true because *not*
invalidating gives me correct-appearing but unsound results.

Firstly, it should be ExecutorEnd, not ExecutorFinish, but this
doesn't work because those run in the exec_simple_query path twice:
once in the planner and then again in PortalCleanup, it's important
that invalidation only occur at the end, and not twice.  So there's a
problem of a kind resulting from this experiment.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements normalization: re-review

2012-02-23 Thread Peter Geoghegan
On 23 February 2012 09:58, Daniel Farina dan...@heroku.com wrote:
 What I'm going to do here is focus on the back-end source changes that
 are not part of the contrib.  The size of the changes are much
 reduced, but their semantics are also somewhat more complex...so, here
 goes.  Conclusion first:

 * The small changes to hashing are probably not strictly required,
 unless collisions are known to get terrible.

I imagine that collisions would be rather difficult to demonstrate at
all with a 32-bit value.

 * The addition of a field to scribble upon in the Query and Plan nodes
 is something I'd like to understand better, as these leave me with a
 bit of disquiet.

 What follows is in much more detail, and broken up by functionality
 and file grouping in the backend.

  src/include/access/hash.h            |    4 ++-
  src/backend/access/hash/hashfunc.c   |   21 ++-

 This adjusts the hash_any procedure to support returning two possible
 precisions (64 bit and 32 bit) by tacking on a Boolean flag to make
 the precision selectable.  The hash_any operator is never used
 directly, and instead via macro, and a second macro to support 64-bit
 precision has been added.  It seems a bit wonky to me to use a flag to
 select this rather than encapsulating the common logic in a procedure
 and just break this up into three symbols, though.  I think the longer
 precision is used to try to get fewer collisions with not too much
 slowdown.  Although it may meaningfully improve the quality of the
 hashing for pg_stat_statements or living without the extra hashing
 bits might lead to unusable results (?), per the usual semantics of
 hashing it is probably not *strictly* necessary.

It might be unnecessary. It's difficult to know where to draw the line.

 A smidgen of avoidable formatting niggles ( 80 columns) are in this section.

  src/backend/nodes/copyfuncs.c        |    5 
  src/backend/nodes/equalfuncs.c       |    4 +++
  src/backend/nodes/outfuncs.c         |   10 +
  src/backend/optimizer/plan/planner.c |    1 +
  src/backend/nodes/readfuncs.c        |   12 +++
  src/include/nodes/parsenodes.h       |    3 ++
  src/include/nodes/plannodes.h        |    2 +

I'll look into those.

 These have all been changed in the usual manner to support one new
 field, the queryId, on the toplevel Plan and Query nodes.  The queryId
 is 64-bit field copied from queries to plans to tie together plans to
 be used by plugins (quoth the source) as they flow through the
 system.  pg_stat_statements fills them with the 64-bit hash of the
 query text -- a reasonable functionality to possess, I think, but the
 abstraction seems iffy: plugins cannot compose well if they all need
 to use the queryId for different reasons.  Somehow I get the feeling
 that this field can be avoided or its use case can be satisfied in a
 more satisfying way.

Maybe the answer here is to have a simple mechanism for modules to
claim the exclusive right to be able to set that flag?

 Mysteriously, in the Query representation the symbol uses an
 underscored-notation (query_id) and in the planner it uses a camelcase
 one, queryId.  Overall, the other fields in those structs all use
 camelcase, so I'd recommend normalizing it.

Fair enough.

There is one other piece of infrastructure that I think is going to
need to go into core that was not in the most recent revision. I
believe that it can be justified independently of this work.
Basically, there are some very specific scenarios in which the
location of Const nodes is incorrect, because the core system makes
that location the same location as another coercion-related node. Now,
this actually doesn't matter to the positioning of the caret in most
error messages, as they usually ought to be the same anyway, as in
this scenario:

select 'foo'::integer;

Here, both the Const and coercion node's location start from the
SCONST token - no problem there.

However, consider this query:

select cast('foo' as integer);

and this query:

select integer 'foo';

Now, the location of the Const and Coercion nodes is *still* the same,
but starting from the location of the cast in the first example and
integer in the second.

I believe that this is a bug. They should have different locations
here, so that I can always rely on the location of a Const having a
single token that is one of only a handful of possible CONST tokens,
as I currently can in all other scenarios.

Of course, this didn't particularly matter before now, as the Const
location field only dictated caret position in messages generated from
outside the parser. Any error message should be blaming the Coercion
node if there's a problem with coercion, so there's no reason why this
needs to break any error context messages. If a message wants to blame
a Const on something, surely it should have a caret placed in the
right location - the location of the const token. If it wants to blame
the Coercion node on something, that won't change, 

Re: [HACKERS] pg_stat_statements normalization: re-review

2012-02-23 Thread Peter Geoghegan
On 23 February 2012 11:09, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 23 February 2012 09:58, Daniel Farina dan...@heroku.com wrote:
 * The small changes to hashing are probably not strictly required,
 unless collisions are known to get terrible.

 I imagine that collisions would be rather difficult to demonstrate at
 all with a 32-bit value.

Assuming that the hash function exhibits a perfectly uniform
distribution of values (FWIW, hash_any is said to exhibit the
avalanche effect), the birthday problem provides a mathematical basis
for estimating the probability of some 2 queries being alike. Assuming
a population of 1,000 queries are in play at any given time (i.e. the
default value of pg_stat_statements.max) and 2 ^ 32 days of the
year, that puts the probability of a collision at a vanishingly small
number. I cannot calculate the number with Gnome calculator. Let's
pretend that 2 ^ 32 is exactly 42 million, rather than approximately
4.2 *billion*. Even then, the probability of collision is a minuscule
0.01857 .

I'd have to agree that a uint32 hash is quite sufficient here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers