Re: [HACKERS] Optimizing prepared statements

2006-09-04 Thread Martijn van Oosterhout
On Mon, Sep 04, 2006 at 11:12:13AM +0700, Jeroen T. Vermeulen wrote:
 As I've said before, all this falls down if there is a significant cost to
 keeping one or two extra plans per prepared statement.  You mentioned
 something about tracking plans.  I don't know what that means, but it
 sounded like it might impose a runtime cost on keeping plans around. 

I think what he meant is tracking plans during the planning process.
Currently at the end of each step you weed out all the plans that arn't
the best for each path-key. To track multiple results at that stage
would be expensive.

However, just running the planner over the same query multiple times
with different estimates shouldn't be too expensive to store.

However, you're discussing the process of replanning based on changes
in variables. At the moment we really need to work on replanning
generally, it isn't done at all currently...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Optimizing prepared statements

2006-09-04 Thread Tom Lane
Jeroen T. Vermeulen [EMAIL PROTECTED] writes:
 On Sun, September 3, 2006 23:52, Tom Lane wrote:
 What exactly do you mean by optimize away a parameter?  The way you
 described the mechanism, there are no parameters that are optimized
 away, you've merely adjusted selectivity predictions using some assumed
 values.

 I'm using optimized away as shorthand for replaced with a literal
 constant in the statement definition used to generate the plan.

Ah.  I think you're confusing the spectators by using predict when you
should say match.  You're looking for previously generated plans that
have assumed parameter values matching the current query --- saying that
the plan predicts a parameter value is just a really poor choice of
wording.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Optimizing prepared statements

2006-09-04 Thread Jeroen T. Vermeulen
On Mon, September 4, 2006 23:03, Tom Lane wrote:

 Ah.  I think you're confusing the spectators by using predict when you
 should say match.  You're looking for previously generated plans that
 have assumed parameter values matching the current query --- saying that
 the plan predicts a parameter value is just a really poor choice of
 wording.

I guess so...  It's been over a decade since I last busied myself with
database optimization, so I don't know any of the jargon.  Had to use
processor architecture jargon instead.

So with that out of the way, can anyone think of some good real-life
examples of prepared statement usage that I can test against?  Suggestions
I've had include TPC, DBT2 (based on TPC-C), and pgbench, but what I'm
really looking for is traces of invocations by real applications.

If we don't have a body of application traces available, can anyone think
of a convenient, nonintrusive way I could extract them out of applications
I'm running?  If there is, I could write a filter to eliminate irrelevant
information.  The filter would ignore everything other than prepared
statement invocations.  It could randomize statement names, strip out
their definitions, hide the ordering between different statements, and
replace parameter values with meaningless numbers; so it would be
relatively safe for others to volunteer traces of their own applications. 
None of those transformations would affect my simulation results.


Jeroen



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Optimizing prepared statements

2006-09-04 Thread Josh Berkus
Jeroen,

 So with that out of the way, can anyone think of some good real-life
 examples of prepared statement usage that I can test against?  Suggestions
 I've had include TPC, DBT2 (based on TPC-C), and pgbench, but what I'm
 really looking for is traces of invocations by real applications.

I've requested that the Sun SpecJApp benchmark engineer harvest a log for us.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Gregory Stark
Jeroen T. Vermeulen [EMAIL PROTECTED] writes:

 For now, I'll summarize some results I got from randomized input data.  I
 used very simple traces, with 11 prepared statements, each taking a
 different number of parameters (0 through 10, inclusive).  All calls were
 uniformly randomized.  I used LRU replacement of cached plans, with up to
 4 retained plans per statement.  Confidence counters ran from 0 to 3
 inclusive, with the confidence threshold lying between 1 and 2.

I'm confused, what exactly are you trying to predict? Whether each parameter
will be some cached value? Or whether the cached plan was correct?

 So once again, does anyone know of any realistic logs that I can use for
 more useful simulations?

You might look at the DBT test suite, iirc the TPCC spec it implements
intentionally mixes random queries with predictable queries.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Jeroen T. Vermeulen
On Sun, September 3, 2006 18:41, Gregory Stark wrote:

 I'm confused, what exactly are you trying to predict? Whether each
 parameter
 will be some cached value? Or whether the cached plan was correct?

That's described in more detail in a separate thread (prepared statements
considered harmful).  In a nutshell, the algorithm detects pseudoconstant
parameters to prepared statements, and keeps a small set of different
plans optimized for recurring combinations of constants.


 So once again, does anyone know of any realistic logs that I can use for
 more useful simulations?

 You might look at the DBT test suite, iirc the TPCC spec it implements
 intentionally mixes random queries with predictable queries.

I considered the TPC benchmarks, but they're still benchmarks.  When seen
from one angle they try to look like real applications, but I'm worried
that in testing this algorithm, I may be looking at them from a very
different angle.  I'd still need actual application logs to validate them!


Jeroen



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Gregory Stark

Jeroen T. Vermeulen [EMAIL PROTECTED] writes:

 On Sun, September 3, 2006 18:41, Gregory Stark wrote:

 I'm confused, what exactly are you trying to predict? Whether each
 parameter
 will be some cached value? Or whether the cached plan was correct?

 That's described in more detail in a separate thread (prepared statements
 considered harmful).  In a nutshell, the algorithm detects pseudoconstant
 parameters to prepared statements, and keeps a small set of different
 plans optimized for recurring combinations of constants.

I read that but apparently I misunderstood it since it would not have been
doable the way I understood it. I thought you wanted the predictor bits to
correspond to particular plans. If a plan was wrong then you marked it as a
bad guess. I don't think that can be made to work though for the reasons I
stated then.

But if you have something working clearly that's not what you're doing. So
what are you doing? Storing up a list of arguments seen for each parameter
when executed and use the predictor bits to determine if any of those
arguments are constants? Storing up a list of selectivity estimates?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Jeroen T. Vermeulen
On Sun, September 3, 2006 21:52, Gregory Stark wrote:

 I read that but apparently I misunderstood it since it would not have been
 doable the way I understood it. I thought you wanted the predictor bits to
 correspond to particular plans. If a plan was wrong then you marked it
 as a
 bad guess. I don't think that can be made to work though for the reasons I
 stated then.

Oh, sorry--I guess I haven't been too systematic about it.  In the
algorithm's current incarnation, the confidence counters don't mark
anything as bad ideas, as such.  Instead, whenever a new invocation has a
set of parameters that are (1) correctly and (2) confidently predicted by
the counters, the predictor decides that it's probably a good idea to plan
for calls with that particular set of parameters built in as constants.

Assuming we didn't already have a plan for the particular combination of
values at hand, the algorithm generates a new one.  The new plan is
optimized on the assumption that those predicted parameters are constants.

We keep a small cache of recently-used plans, possibly including the
original plan where all parameters are truly variable.  Every plan also
remembers a list of the predicted parameter values, so on any next call,
we can check whether a particular cached plan actually applies to the
call.  If it doesn't (because of a mismatch between the incoming
parameters and the plan's assumed pseudo-constants), we just pick another
plan.

If multiple cached plans can be applied to a given call, we prefer the one
that optimizes away the most parameters.  Age is used as a tiebreaker, on
the assumption that more recent planning information is likely to be more
accurate.

The tricky part is deciding when to generate a new, more specialized plan
when we already have a matching one that may not be optimal.  Without this
step we'd never get beyond that first, completely generic plan--it applies
to every call.  The way I've approached it is this: when the predictor's
current state correctly and confidently predicts more of the invocation's
parameter values than any of the cached plans did, then it's time to
generate a new plan.

So let's say your statement has a parameter x that's always the same
value, say x==0, and another parameter y that's a randomly alternating
Boolean value, and another one z that varies randomly between lots of
values.  What order they're in doesn't matter, and they needn't be the
only parameters.  You're probably going to see four plans generated for
this example:

1. Either on the first call or during definition, you get the generic
plan.  This is the same plan you'd get in the existing backend, with
placeholders for variable x, y, and z.

2. Pretty soon, the algorithm is going to detect that x is always zero. 
It will generate a new plan, substituting the constant value 0 for x, and
hopefully getting better optimization because of it.

3. Sooner or later (probably fairly soon) you'll see a run of consecutive
calls where y happens to be true.  A new plan is generated with the
assumption that y==true.  The new plan will also still assume that x==0.

4. The same is going to happen for y==false.  Yet another specialized plan
is generated.  If we keep up to 3 plans per statement, say, then this new
plan overflows the cache.  The least recently used plan is flushed to make
room for the new one--in this case, the generic one because we haven't
seen any cases where x!=0 recently.

More complex scenarios will also happen, of course, such as if y==true
then x will usually be 0, but otherwise x will be highly variable or if
y==true then x is pseudo-constant and z is highly variable, but if
y==false then it's the other way around or if y==false then z is usually
the empty string, and so on.  The predictor as I've simulated it is too
dumb to be intimidated by the complexity.  It should work reasonably well
for all those scenarios, assuming of course that its cache is large enough
to remember the most frequent patterns.

Right now I'm using the plans' time since last use as the only eviction
criterion when the cache overflows.  There may be a better policy; the one
Achilles heel of LRU is the big loop where every cache entry is used
once, then evicted shortly before it is needed again.  (If the loop is so
big that entries are flushed long before they're needed again, well, then
it's just a big job and you stop blaming LRU :-)


 But if you have something working clearly that's not what you're doing. So
 what are you doing? Storing up a list of arguments seen for each parameter
 when executed and use the predictor bits to determine if any of those
 arguments are constants? Storing up a list of selectivity estimates?

The former.  I'm keeping a single predictor with a single more or less
last-seen value per parameter; plus a checklist of pseudoconstants for
every cached plan.  It's pretty simple, really, with no cost functions or
spanning trees or other intelligent logic--and certainly nothing original.
 Which 

Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Jeroen T. Vermeulen
On Sun, September 3, 2006 23:28, Jeroen T. Vermeulen wrote:
 On Sun, September 3, 2006 21:52, Gregory Stark wrote:

 I read that but apparently I misunderstood it since it would not have
 been
 doable the way I understood it. I thought you wanted the predictor bits
 to
 correspond to particular plans. If a plan was wrong then you marked it
 as a
 bad guess. I don't think that can be made to work though for the reasons
 I
 stated then.

Come to think of it, I still haven't answered your question very clearly...

I keep one predictor for any given statement.  The predictor keeps two
pieces of state *per statement parameter*:

1. The predictor value (usually the last-seen value for that parameter,
though it does take a few mismatches to make it drop a value that was
repeated a lot).

2. A saturating counter measuring confidence in the current predictor
value.  As long as this counter is below the confidence threshold, the
predictor value has no other meaning than let's see if this value comes
back.

No information flows between these pieces of state.  All logic in the
predictor itself is entirely local to individual parameters.  But any new
plan is generated based on a snapshot of all predictor values that (i)
match their respective parameter values in the ongoing call, and (ii) have
their counters above the confidence threshold.  It is that combination of
parameters, for that combination of values, that is taken as a set of
pseudo-constants.

The cache can hold any number of plans optimized for the same combinations
of parameters but with different values; or for different combinations of
parameters but the same values; subsets and supersets of parameters with
either the same or different values; completely disjoint sets; anything. 
The cache neither knows nor cares.  The only thing that *won't* happen is
two plans in the same cache covering the same set of parameters with the
same set of values--because there is never any need to generate that
second plan while the first is still in cache.


Jeroen



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Tom Lane
Jeroen T. Vermeulen [EMAIL PROTECTED] writes:
 If multiple cached plans can be applied to a given call, we prefer the one
 that optimizes away the most parameters.

What exactly do you mean by optimize away a parameter?  The way you
described the mechanism, there are no parameters that are optimized
away, you've merely adjusted selectivity predictions using some assumed
values.  Actually converting a parameter to a constant is a whole
'nother matter --- it allows constant-folding for example.  But then you
*cannot* use the plan unless there's an exact match to the assumed
value.  These two approaches provide very different tradeoffs of plan
quality vs plan specificity, so it makes me uncomfortable that you're
failing to clarify what you mean.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Gregory Stark
Jeroen T. Vermeulen [EMAIL PROTECTED] writes:

 Oh, sorry--I guess I haven't been too systematic about it.  In the
 algorithm's current incarnation, ...

Thanks, that cleared things up enormously. I'm trying to figure how it would
react to some of the queries I've written in the past. In particular I'm
thinking of queries like

WHERE (? OR category = ?)
  AND (? OR cost  ?)
  AND (? OR description like ?||'%')

Where I then pass flags in from the application to indicate which search
constraints to apply. If it notices that most searches are for a particular
set of constraints it would be able to cache plans with the unused constraints
removed.

It would not however be able to notice that the last parameter never contains
a % and therefore can use an index scan.

I'm also wondering how this interacts with plan stability. Obviously the
direct effect is to throw out any chance of it. But in the long run they may
be two complementary sides of the same thing.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Jeroen T. Vermeulen
On Sun, September 3, 2006 23:52, Tom Lane wrote:

 What exactly do you mean by optimize away a parameter?  The way you
 described the mechanism, there are no parameters that are optimized
 away, you've merely adjusted selectivity predictions using some assumed
 values.

I'm using optimized away as shorthand for replaced with a literal
constant in the statement definition used to generate the plan.  So if a
parameter $n is found to be, say, always equal to the string 'foo', then
we might want to generate a specialized plan as if the statement's
definition contained the literal string 'foo' wherever it really says $n. 
I've been calling that optimized for $n=='foo' or $n is optimized away
in this plan.


  Actually converting a parameter to a constant is a whole
 'nother matter --- it allows constant-folding for example.  But then you
 *cannot* use the plan unless there's an exact match to the assumed
 value.  These two approaches provide very different tradeoffs of plan
 quality vs plan specificity, so it makes me uncomfortable that you're
 failing to clarify what you mean.

Right.  When I said optimized for a certain parameter value, I meant
actual substitution the whole time.  I'm sorry if I didn't make that
clear; it seemed so basic that I must have forgotten to mention it.  I
guess the principle would also work otherwise, but it's intended to allow
constant folding.

So for any given statement, there would be a cache of frequently-needed
plans for different sets of constant substitutions.  As you point out, a
call could only use a plan if the plan's substitutions were consistent
with the call's parameter values (but within that constraint, the more
substitutions the merrier).  That's why I talked so much about comparing
and matching: that part is for correctness, not optimization.

As I've said before, all this falls down if there is a significant cost to
keeping one or two extra plans per prepared statement.  You mentioned
something about tracking plans.  I don't know what that means, but it
sounded like it might impose a runtime cost on keeping plans around. 
Could you elaborate?


Jeroen



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Optimizing prepared statements

2006-09-03 Thread Jeroen T. Vermeulen
On Mon, September 4, 2006 03:56, Gregory Stark wrote:

 Thanks, that cleared things up enormously. I'm trying to figure how it
 would
 react to some of the queries I've written in the past. In particular I'm
 thinking of queries like

 WHERE (? OR category = ?)
   AND (? OR cost  ?)
   AND (? OR description like ?||'%')

 Where I then pass flags in from the application to indicate which search
 constraints to apply. If it notices that most searches are for a
 particular
 set of constraints it would be able to cache plans with the unused
 constraints
 removed.

Right.  That's pretty much the problem as Peter originally described it, I
think.


 It would not however be able to notice that the last parameter never
 contains a % and therefore can use an index scan.

If I understand you correctly, then no.  If the algorithm sees highly
variable values for that last parameter, it will never decide to assume
that that parameter will never contain '%'--and I'm not sure how that
could be done safely.

I do see two glimmers of hope, however:

1. If that last parameter is usually some specific value, then you'll
probably end up using specialized plans with that specific value in the
parameter's place.  If that value is a string without wildcards, you can
use your index on description (assuming you have one).  If it's '%' or
null, the optimizer can decide to ignore the like clause.  It's only
when the scheme finds that it cannot predict what the parameter's value is
going to be that you get the generic, poorly-performing code.

2. Once we have a predictor, and assuming it works, it could be tied in
with the planner a bit more.  As I believe Tom said, the planner can't
afford to chase down lots of scenarios just in case they ever happen.  But
when a parameter is used only for simple matches or inserts on non-indexed
columns, for example, the planner might find in the course of its normal
activities that there's nothing useful it can do with that parameter and
deliver this information with its plan, so that the predictor can ignore
the parameter.


 I'm also wondering how this interacts with plan stability. Obviously the
 direct effect is to throw out any chance of it. But in the long run they
 may be two complementary sides of the same thing.

Well, it'll cause some plans to be re-generated, surely.  But the
impression I've gotten from the discussion so far is some that plans were
getting too old anyway.


Jeroen



---(end of broadcast)---
TIP 6: explain analyze is your friend