Re: [HACKERS] pg_stat_statements normalization: re-review
On Fri, Feb 24, 2012 at 3:14 AM, Daniel Farina wrote: > At ExecutorFinish (already hookable) all NodeKeys remembered by an > extension should be invalidated, as that memory is free and ready to > be used again. I think this statement is false; I thought it was true because *not* invalidating gives me correct-appearing but unsound results. Firstly, it should be ExecutorEnd, not ExecutorFinish, but this doesn't work because those run in the exec_simple_query path twice: once in the planner and then again in PortalCleanup, it's important that invalidation only occur at the end, and not twice. So there's a problem of a kind resulting from this experiment. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements normalization: re-review
On Thu, Feb 23, 2012 at 3:09 AM, Peter Geoghegan wrote: >> These have all been changed in the usual manner to support one new >> field, the queryId, on the toplevel Plan and Query nodes. The queryId >> is 64-bit field copied from queries to plans to tie together plans "to >> be used by plugins" (quoth the source) as they flow through the >> system. pg_stat_statements fills them with the 64-bit hash of the >> query text -- a reasonable functionality to possess, I think, but the >> abstraction seems iffy: plugins cannot compose well if they all need >> to use the queryId for different reasons. Somehow I get the feeling >> that this field can be avoided or its use case can be satisfied in a >> more satisfying way. > > Maybe the answer here is to have a simple mechanism for modules to > claim the exclusive right to be able to set that flag? As I see it, the queryId as the most contentious internals change in the patch, as queryId is not really an opaque id tracking a query's progression, but rather a bit of extra space for query jumbles, and query jumbles only, as far as I can tell: thus, not really a generally useful backend service, but rather specifically for pg_stat_statements -- that may not take such a special field off the table (more opinionated people will probably comment on that), but it's a pretty awkward kind of inclusion. I'm posting an experimental patch in having the backend define state whose only guarantee is to be equal between a PlannedStmt and a Query node that derived it, and its effect on pg_stat_statements. (the latter not included in this email, but available via git[0]) The exact mechanism I use is pretty questionable, but my skimming of the entry points of the code suggests it might work: I use the pointer to the Query node from the PlannedStmt, which is only unique as long as the Query (head) node remains allocated while the PlannedStmt is also alive. A big if, but able to be verified with assertions in debug mode ("does the pointer see poisoned memory?") and low overhead presuming one had to allocate memory already. However, the abstraction is as such that if the key generation needs to change compliant consumers should be fine. At ExecutorFinish (already hookable) all NodeKeys remembered by an extension should be invalidated, as that memory is free and ready to be used again. As Query node availability from the PlannedStmt is not part of the node for a reason, it is rendered as an opaque uintptr_t, and hidden behind a type that only is intended to define equality and "definedness" (!= NULL is the implementation). The clear penalty is that the extension must be able to map from the arbitrary, backend-assigned queryId (which I have renamed nodeKey just to make it easier to discuss) to its own internal value it cares about: the query jumble. I used a very simple, fixed-size association array with a small free-space location optimization, taking advantage of the property in pg_stat_statements can simply not process a query if it doesn't want to (but should process most of them so one can get a good sense of what is going on). A credible(?) alternative I have thought of is to instead expose a memory context for use by extensions on the interesting nodes; in doing so extensions can request space and store whatever they need. The part that I feel might be tricky is figuring out a reasonable and appropriate time to free that 'extension-context' and what context that context should live under. But I'd be wiling to try it with haste if it sounds a lot more credible or useful than what I've got. -- fdr [0]: https://github.com/fdr/postgres/tree/wrench-pgss From c3ad77375062cfbeee7d4ce7e0fe274a5db76453 Mon Sep 17 00:00:00 2001 From: Daniel Farina Date: Fri, 24 Feb 2012 01:31:54 -0800 Subject: [PATCH] Introduce NodeKey as a service to extensions in the backend Signed-off-by: Daniel Farina --- src/backend/nodes/copyfuncs.c|4 ++-- src/backend/nodes/outfuncs.c |6 -- src/backend/nodes/readfuncs.c|7 +++ src/backend/optimizer/plan/planner.c |2 +- src/backend/parser/analyze.c |2 +- src/include/nodes/parsenodes.h |4 ++-- src/include/nodes/plannodes.h|4 +++- src/include/nodes/primnodes.h| 14 ++ 8 files changed, 26 insertions(+), 17 deletions(-) *** a/src/backend/nodes/copyfuncs.c --- b/src/backend/nodes/copyfuncs.c *** *** 91,97 _copyPlannedStmt(const PlannedStmt *from) COPY_NODE_FIELD(relationOids); COPY_NODE_FIELD(invalItems); COPY_SCALAR_FIELD(nParamExec); ! COPY_SCALAR_FIELD(queryId); return newnode; } --- 91,97 COPY_NODE_FIELD(relationOids); COPY_NODE_FIELD(invalItems); COPY_SCALAR_FIELD(nParamExec); ! COPY_SCALAR_FIELD(nodeKey); return newnode; } *** *** 2415,2421 _copyQuery(const Query *from) COPY_SCALAR_FIELD(commandType); COPY_SCALAR_FIELD(querySource); ! COPY_SCALAR_FIELD(quer
Re: [HACKERS] pg_stat_statements normalization: re-review
On 23 February 2012 11:09, Peter Geoghegan wrote: > On 23 February 2012 09:58, Daniel Farina wrote: >> * The small changes to hashing are probably not strictly required, >> unless collisions are known to get terrible. > > I imagine that collisions would be rather difficult to demonstrate at > all with a 32-bit value. Assuming that the hash function exhibits a perfectly uniform distribution of values (FWIW, hash_any is said to exhibit the avalanche effect), the birthday problem provides a mathematical basis for estimating the probability of some 2 queries being alike. Assuming a population of 1,000 queries are in play at any given time (i.e. the default value of pg_stat_statements.max) and 2 ^ 32 "days of the year", that puts the probability of a collision at a vanishingly small number. I cannot calculate the number with Gnome calculator. Let's pretend that 2 ^ 32 is exactly 42 million, rather than approximately 4.2 *billion*. Even then, the probability of collision is a minuscule 0.01857 . I'd have to agree that a uint32 hash is quite sufficient here. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements normalization: re-review
On 23 February 2012 09:58, Daniel Farina wrote: > What I'm going to do here is focus on the back-end source changes that > are not part of the contrib. The size of the changes are much > reduced, but their semantics are also somewhat more complex...so, here > goes. Conclusion first: > > * The small changes to hashing are probably not strictly required, > unless collisions are known to get terrible. I imagine that collisions would be rather difficult to demonstrate at all with a 32-bit value. > * The addition of a field to scribble upon in the Query and Plan nodes > is something I'd like to understand better, as these leave me with a > bit of disquiet. > > What follows is in much more detail, and broken up by functionality > and file grouping in the backend. > > src/include/access/hash.h | 4 ++- > src/backend/access/hash/hashfunc.c | 21 ++- > > This adjusts the hash_any procedure to support returning two possible > precisions (64 bit and 32 bit) by tacking on a Boolean flag to make > the precision selectable. The hash_any operator is never used > directly, and instead via macro, and a second macro to support 64-bit > precision has been added. It seems a bit wonky to me to use a flag to > select this rather than encapsulating the common logic in a procedure > and just break this up into three symbols, though. I think the longer > precision is used to try to get fewer collisions with not too much > slowdown. Although it may meaningfully improve the quality of the > hashing for pg_stat_statements or living without the extra hashing > bits might lead to unusable results (?), per the usual semantics of > hashing it is probably not *strictly* necessary. It might be unnecessary. It's difficult to know where to draw the line. > A smidgen of avoidable formatting niggles (> 80 columns) are in this section. > > src/backend/nodes/copyfuncs.c | 5 > src/backend/nodes/equalfuncs.c | 4 +++ > src/backend/nodes/outfuncs.c | 10 + > src/backend/optimizer/plan/planner.c | 1 + > src/backend/nodes/readfuncs.c | 12 +++ > src/include/nodes/parsenodes.h | 3 ++ > src/include/nodes/plannodes.h | 2 + I'll look into those. > These have all been changed in the usual manner to support one new > field, the queryId, on the toplevel Plan and Query nodes. The queryId > is 64-bit field copied from queries to plans to tie together plans "to > be used by plugins" (quoth the source) as they flow through the > system. pg_stat_statements fills them with the 64-bit hash of the > query text -- a reasonable functionality to possess, I think, but the > abstraction seems iffy: plugins cannot compose well if they all need > to use the queryId for different reasons. Somehow I get the feeling > that this field can be avoided or its use case can be satisfied in a > more satisfying way. Maybe the answer here is to have a simple mechanism for modules to claim the exclusive right to be able to set that flag? > Mysteriously, in the Query representation the symbol uses an > underscored-notation (query_id) and in the planner it uses a camelcase > one, queryId. Overall, the other fields in those structs all use > camelcase, so I'd recommend normalizing it. Fair enough. There is one other piece of infrastructure that I think is going to need to go into core that was not in the most recent revision. I believe that it can be justified independently of this work. Basically, there are some very specific scenarios in which the location of Const nodes is incorrect, because the core system makes that location the same location as another coercion-related node. Now, this actually doesn't matter to the positioning of the caret in most error messages, as they usually ought to be the same anyway, as in this scenario: select 'foo'::integer; Here, both the Const and coercion node's location start from the SCONST token - no problem there. However, consider this query: select cast('foo' as integer); and this query: select integer 'foo'; Now, the location of the Const and Coercion nodes is *still* the same, but starting from the location of the "cast" in the first example and "integer" in the second. I believe that this is a bug. They should have different locations here, so that I can always rely on the location of a Const having a single token that is one of only a handful of possible CONST tokens, as I currently can in all other scenarios. Of course, this didn't particularly matter before now, as the Const location field only dictated caret position in messages generated from outside the parser. Any error message should be blaming the Coercion node if there's a problem with coercion, so there's no reason why this needs to break any error context messages. If a message wants to blame a Const on something, surely it should have a caret placed in the right location - the location of the const token. If it wants to blame the Coerc
[HACKERS] pg_stat_statements normalization: re-review
Hello again, I'm reviewing the revised version of pg_stat_statements again. In particular, this new version has done away with the mechanical but invasive change to the lexing that preserved *both* the position and length of constant values. (See http://archives.postgresql.org/message-id/CAEYLb_WGeFCT7MfJ8FXf-CR6BSE6Lbn+O1VX3+OGrc4Bscn4=a...@mail.gmail.com) I did the review front-matter again (check against a recent version, make sure it does what it says it'll do, et al..) and did trigger an assertion failure that seems to be an oversight, already reported to Peter. I found that oversight by running the SQLAlchemy Python query-generation toolkit's tests, which are voluminous. The functionality is seemingly mostly unchanged. What I'm going to do here is focus on the back-end source changes that are not part of the contrib. The size of the changes are much reduced, but their semantics are also somewhat more complex...so, here goes. Conclusion first: * The small changes to hashing are probably not strictly required, unless collisions are known to get terrible. * The hook to analyze is pretty straight-forward and seem like the other hooks * The addition of a field to scribble upon in the Query and Plan nodes is something I'd like to understand better, as these leave me with a bit of disquiet. What follows is in much more detail, and broken up by functionality and file grouping in the backend. src/include/access/hash.h|4 ++- src/backend/access/hash/hashfunc.c | 21 ++- This adjusts the hash_any procedure to support returning two possible precisions (64 bit and 32 bit) by tacking on a Boolean flag to make the precision selectable. The hash_any operator is never used directly, and instead via macro, and a second macro to support 64-bit precision has been added. It seems a bit wonky to me to use a flag to select this rather than encapsulating the common logic in a procedure and just break this up into three symbols, though. I think the longer precision is used to try to get fewer collisions with not too much slowdown. Although it may meaningfully improve the quality of the hashing for pg_stat_statements or living without the extra hashing bits might lead to unusable results (?), per the usual semantics of hashing it is probably not *strictly* necessary. A smidgen of avoidable formatting niggles (> 80 columns) are in this section. src/backend/nodes/copyfuncs.c|5 src/backend/nodes/equalfuncs.c |4 +++ src/backend/nodes/outfuncs.c | 10 + src/backend/optimizer/plan/planner.c |1 + src/backend/nodes/readfuncs.c| 12 +++ src/include/nodes/parsenodes.h |3 ++ src/include/nodes/plannodes.h|2 + These have all been changed in the usual manner to support one new field, the queryId, on the toplevel Plan and Query nodes. The queryId is 64-bit field copied from queries to plans to tie together plans "to be used by plugins" (quoth the source) as they flow through the system. pg_stat_statements fills them with the 64-bit hash of the query text -- a reasonable functionality to possess, I think, but the abstraction seems iffy: plugins cannot compose well if they all need to use the queryId for different reasons. Somehow I get the feeling that this field can be avoided or its use case can be satisfied in a more satisfying way. Mysteriously, in the Query representation the symbol uses an underscored-notation (query_id) and in the planner it uses a camelcase one, queryId. Overall, the other fields in those structs all use camelcase, so I'd recommend normalizing it. src/backend/parser/analyze.c | 37 ++--- src/include/parser/analyze.h | 12 +++ These changes implement hooks for the once-non-hookable symbols parse_analyze_varparams and parse_analyze, in seemingly the same way they are implemented in other hooks in the system. These are neatly symmetrical with the planner hook. I only ask if there is a way to have one hook and not two, but I suppose that would be a similar question as "why is are there two ways to symbols to enter into parsing, and not one". -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers