Re: [HACKERS] An Idea for planner hints

2006-08-26 Thread Hayes
On Aug 17, 2006, at 1:41 PM, Peter Eisentraut wrote:

 But we need to work this from the other end anyway.  We need to
 determine first, what sort of statistics the planner could make use of.
 Then we can figure out the difficulties in collecting them.
 

There are still some things that the architect or DBA will know that 
the system could never deduce.

Any suggestions for what these statistics are?   Cross-column 
statistics have been mentioned previously.  Another that's come up 
before is how clustered a table is around its keys (think web log, 
where all the session records are going to be in the same page (or 
small set of pages)).  FK selectivity has been mentioned in this 
thread.

Anything else to throw into the ring?

-arturo

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-24 Thread Jim C. Nasby
On Wed, Aug 23, 2006 at 08:42:10AM -0700, Mark Dilger wrote:
 Jim C. Nasby wrote:
 On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
 I proposed something like this quite a bit up-thread.  I was hoping we 
 could have a mode in which the system would run the second, third, 
 fourth, ... best plans rather than just the best looking one, and then 
 determine from actual runtime statistics which was best.  (The proposal 
 also included the ability to output the best plan and read that in at a 
 later time in lieu of a SQL query, but that part of it can be ignored if 
 you like.)  The posting didn't generate much response, so I'm not sure 
 what people thought of it.  The only major problem I see is getting the 
 planner to keep track of alternate plans.  I don't know the internals of 
 it very well, but I think the genetic query optimizer doesn't have a 
 concept of runner-up #1, runner-up #2, etc., which it would need to 
 have.
 
 I think the biggest issue is that you'd have to account for varying load
 on the box. If we assume that the database is the only thing running on
 the box, we might be able to do that by looking at things like how much
 IO traffic we generated (though of course OS caching will screw with
 that).
 
 Actually, that's another issue... any plans run after the first one will
 show up as being artificially fast, since there will be a lot of extra
 cached data.
 
 Yes, caching issues prevent you from using wall-clock time.  We could 
 instrument the code to count the number of rows vs. the number predicted 
 for each internal join, from which new cost estimates could be generated.
 
But if you're only looking at the number of rows, I suspect there's no
need to actually run the other plans; you can just look at how many rows
you got in the plan you used. Worst-case, you may have to figure out the
correlation stats for the result-set, which could probably be done on
the fly without too much impact.

 Perhaps you can check my reasoning for me:  I'm imagining a query which 
 computes AxBxCxD, where A, B, C, and D are actual tables.  I'm also 
 imagining that the planner always chooses AxB first, then joins on C, then 
 joins on D.  (It does so because the single-table statistics suggest this 
 as the best course of action.) It might be that AxD is a really small 
  metatable, much smaller than would be estimated from the statistics for A 
 independent of the statistics for D, but AxB is pretty much what you would 
 expect given the independent statistics for A and B.  So we need some way 
 for the system to stumble upon that fact.  If we only ever calculate 
 cross-join statistics for plans that the system chooses, we will only 
 discover that AxB is about the size we expected it to be.  So, if the 
 actual size of AxB is nearly equal to the estimated size of AxB, the system 
 will continue to choose the same plan in future queries, totally ignorant 
 of the advantages of doing AxD first.

Is there actually evidence that there's a lot of problems with bad join
orders? ISTM that's one of the areas where the planner actually does a
pretty good job.

 That last paragraph is my reasoning for suggesting that the system have a 
 mode in which it runs the runner-up #1, runner-up #2, etc sorts of 
 plans.  Such a mode could force it down alternate paths where it might pick 
 up interesting statistics that it wouldn't find otherwise.
 
 This idea could be changed somewhat.  Rather than running the other plans, 
 we could just extract from them which alternate joins they include, and 
 consider also calculating those join statistics.
 
 mark
 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
   http://www.postgresql.org/docs/faq
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] An Idea for planner hints

2006-08-24 Thread Mark Dilger

Is there actually evidence that there's a lot of problems with bad join
orders? ISTM that's one of the areas where the planner actually does a
pretty good job.


I put together a quick demonstration using AxBxC where AxB is empty but AxC is 
not.  Sure enough, postgres chooses AxC first, then xB, which results in extra 
work.  This is a contrived example, but it would be a pain to try to post a real 
example with all the data and analysis.  I think it is fair to say that if it is 
making the wrong choice in this example, it is sometimes making the wrong choice 
in practice.  Cross-table statistics are supposed to help avoid this, right? 
But I think it would only help if the system had the statistics for AxB.  I 
think I have been hearing other people propose systems which would track which 
joins the system is actually using and then recommend to the user that those 
statistics be gathered.  I think we need to go beyond that to recommending 
statistics (or automatically gathering statistics, or whatever) for joins that 
*might* be used given different plans than the one currently chosen by the planner.


test=# create table A (a integer);
CREATE TABLE
Time: 60.151 ms
test=# create table B (b integer);
CREATE TABLE
Time: 3.270 ms
test=# create table C (c integer);
CREATE TABLE
Time: 2.421 ms
test=# insert into A (a) (select * from generate_series(1,1,2));
INSERT 0 5000
Time: 67.829 ms
test=# insert into B (b) (select * from generate_series(2,1,2));
INSERT 0 5000
Time: 60.031 ms
test=# insert into C (c) (select * from generate_series(1,1000,2));
INSERT 0 500
Time: 6.303 ms
test=# analyze A;
ANALYZE
Time: 69.669 ms
test=# analyze B;
ANALYZE
Time: 24.548 ms
test=# analyze C;
ANALYZE
Time: 2.936 ms
test=# explain select * from A, B, C where A.a = B.b and A.a = C.c;
QUERY PLAN
---
 Hash Join  (cost=113.50..216.50 rows=500 width=12)
   Hash Cond: (outer.b = inner.a)
   -  Seq Scan on b  (cost=0.00..73.00 rows=5000 width=4)
   -  Hash  (cost=112.25..112.25 rows=500 width=8)
 -  Hash Join  (cost=9.25..112.25 rows=500 width=8)
   Hash Cond: (outer.a = inner.c)
   -  Seq Scan on a  (cost=0.00..73.00 rows=5000 width=4)
   -  Hash  (cost=8.00..8.00 rows=500 width=4)
 -  Seq Scan on c  (cost=0.00..8.00 rows=500 width=4)
(9 rows)

Time: 4.807 ms
test=# select * from A, B, C where A.a = B.b and A.a = C.c;
 a | b | c
---+---+---
(0 rows)

Time: 34.561 ms
test=# select count(*) from A, C where A.a = C.c;
 count
---
   500
(1 row)

Time: 8.450 ms
test=# select count(*) from A, B where A.a = B.b;
 count
---
 0
(1 row)

Time: 33.757 ms


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-23 Thread Jim C. Nasby
On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
 I proposed something like this quite a bit up-thread.  I was hoping we 
 could have a mode in which the system would run the second, third, fourth, 
 ... best plans rather than just the best looking one, and then determine 
 from actual runtime statistics which was best.  (The proposal also included 
 the ability to output the best plan and read that in at a later time in 
 lieu of a SQL query, but that part of it can be ignored if you like.)  The 
 posting didn't generate much response, so I'm not sure what people thought 
 of it.  The only major problem I see is getting the planner to keep track 
 of alternate plans.  I don't know the internals of it very well, but I 
 think the genetic query optimizer doesn't have a concept of runner-up #1, 
 runner-up #2, etc., which it would need to have.

I think the biggest issue is that you'd have to account for varying load
on the box. If we assume that the database is the only thing running on
the box, we might be able to do that by looking at things like how much
IO traffic we generated (though of course OS caching will screw with
that).

Actually, that's another issue... any plans run after the first one will
show up as being artificially fast, since there will be a lot of extra
cached data.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-23 Thread Mark Dilger

Jim C. Nasby wrote:

On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
I proposed something like this quite a bit up-thread.  I was hoping we 
could have a mode in which the system would run the second, third, fourth, 
... best plans rather than just the best looking one, and then determine 
from actual runtime statistics which was best.  (The proposal also included 
the ability to output the best plan and read that in at a later time in 
lieu of a SQL query, but that part of it can be ignored if you like.)  The 
posting didn't generate much response, so I'm not sure what people thought 
of it.  The only major problem I see is getting the planner to keep track 
of alternate plans.  I don't know the internals of it very well, but I 
think the genetic query optimizer doesn't have a concept of runner-up #1, 
runner-up #2, etc., which it would need to have.


I think the biggest issue is that you'd have to account for varying load
on the box. If we assume that the database is the only thing running on
the box, we might be able to do that by looking at things like how much
IO traffic we generated (though of course OS caching will screw with
that).

Actually, that's another issue... any plans run after the first one will
show up as being artificially fast, since there will be a lot of extra
cached data.


Yes, caching issues prevent you from using wall-clock time.  We could instrument 
the code to count the number of rows vs. the number predicted for each internal 
join, from which new cost estimates could be generated.


Perhaps you can check my reasoning for me:  I'm imagining a query which computes 
AxBxCxD, where A, B, C, and D are actual tables.  I'm also imagining that the 
planner always chooses AxB first, then joins on C, then joins on D.  (It does so 
because the single-table statistics suggest this as the best course of action.) 
 It might be that AxD is a really small metatable, much smaller than would be 
estimated from the statistics for A independent of the statistics for D, but AxB 
is pretty much what you would expect given the independent statistics for A and 
B.  So we need some way for the system to stumble upon that fact.  If we only 
ever calculate cross-join statistics for plans that the system chooses, we will 
only discover that AxB is about the size we expected it to be.  So, if the 
actual size of AxB is nearly equal to the estimated size of AxB, the system will 
continue to choose the same plan in future queries, totally ignorant of the 
advantages of doing AxD first.


That last paragraph is my reasoning for suggesting that the system have a mode 
in which it runs the runner-up #1, runner-up #2, etc sorts of plans.  Such a 
mode could force it down alternate paths where it might pick up interesting 
statistics that it wouldn't find otherwise.


This idea could be changed somewhat.  Rather than running the other plans, we 
could just extract from them which alternate joins they include, and consider 
also calculating those join statistics.


mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-22 Thread Mark Dilger

Peter Eisentraut wrote:

Jim C. Nasby wrote:

Meet EXPLAIN ANALYZE.
Which does no good for apps that you don't control the code on. Even 
if you do control the code, you have to find a way to stick EXPLAIN

ANALYZE in  front of every query, and figure out how to deal with
what's comming back.


It would not be hard to create an auto explain analyze mode that 
implicitly runs EXPLAIN ANALYZE along with every query and logs the 
result.  On its face, it sounds like an obviously great idea.  I just 
don't see how you would put that to actual use, unless you want to read 
server logs all day long.  Grepping for query duration and using the 
statistics views are much more manageable tuning methods.  In my view 
anyway.



Going back to the original discussion though, there's no reason this
needs to involve EXPLAIN ANALYZE. All we want to know is what columns
the planner is dealing with as a set rather than individually.


This would log a whole bunch of column groups, since every moderately 
interesting query uses a column in combination with some other column, 
but you still won't know which ones you want the planner to optimize.


To get that piece of information, you'd need to do something like 
principal component analysis over the column groups thus identified.  
Which might be a fun thing to do.  But for the moment I think it's 
better to stick to declaring the interesting pairs/groups manually.




If the system logs which cross-table join statistics it didn't have for 
cross-table joins that it actually performed, it won't log the really 
interesting stuff.


What is interesting are the plans that it didn't chose on account of guessing 
that they were too expensive, when in reality the cross-table statistics were 
such that they were not too expensive.  This case might not be the common case, 
but it is the interesting case.  We are trying to get the planner to notice 
cheap plans that don't look cheap unless you have the cross-table statistics. 
So you have a chicken-and-egg problem here unless the system attempts (or 
outputs without actually attempting) what appear to be sub-optimal plans in 
order to determine how bad they really are.


I proposed something like this quite a bit up-thread.  I was hoping we could 
have a mode in which the system would run the second, third, fourth, ... best 
plans rather than just the best looking one, and then determine from actual 
runtime statistics which was best.  (The proposal also included the ability to 
output the best plan and read that in at a later time in lieu of a SQL query, 
but that part of it can be ignored if you like.)  The posting didn't generate 
much response, so I'm not sure what people thought of it.  The only major 
problem I see is getting the planner to keep track of alternate plans.  I don't 
know the internals of it very well, but I think the genetic query optimizer 
doesn't have a concept of runner-up #1, runner-up #2, etc., which it would 
need to have.


mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-17 Thread Peter Eisentraut
Gregory Stark wrote:
 I'm not sure it's worth throwing out the more user-friendly interface
 we have now but I think it's clear that a table is the obvious
 machine-readable format if you're already sitting in an SQL
 database... :)

Then again, a table might not be the optimal format for an inherently 
hierarchical structure.

But we're getting ahead of ourselves.  There are three parts to this:

1. determine what statistics to gather
2. gather those statistics
3. use those statistics

#1 can really be handled manually in the beginning, and you'd still have 
an excessively useful system if #2 and #3 are available.  Once that is 
done, we can gain experience with the system and maybe find a way to 
automate #1, but it really does not need to be the first step.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-17 Thread Greg Stark

Peter Eisentraut [EMAIL PROTECTED] writes:

 Gregory Stark wrote:
  I'm not sure it's worth throwing out the more user-friendly interface
  we have now but I think it's clear that a table is the obvious
  machine-readable format if you're already sitting in an SQL
  database... :)
 
 Then again, a table might not be the optimal format for an inherently 
 hierarchical structure.

If it were up to me I would just promote ltree to a standard data type and use
that.

On an only tangentially note it seems like the bootstrap sequence could be
split into two steps. The table definitions and the data types, operators, and
operator classes necessary for those table definitions have to be done in some
kind of C bootstrap code as it is now. However much of the bootstrap code now
could be split off into a standard SQL script.

That would save us a ton of headaches in getting OIDs to line up across tables
and make it easier to add new data types with all their associated operators
and operator classes. Worse, new access methods require defining new operator
classes for all the data types you want to support.

It's much easier to just copy paste the CREATE statements and let the SQL
engine assign all the ids and match up all the records.

-- 
greg


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-17 Thread Arturo Pérez


On Aug 15, 2006, at 10:40 AM, Jim C. Nasby wrote:


On Mon, Aug 14, 2006 at 11:41:29PM +0300, Hannu Krosing wrote:
??hel kenal p??eval, E, 2006-08-14 kell 18:21, kirjutas Peter  
Eisentraut:

Perez wrote:

I thought, from watching the list for a while, that the planner
statistics needed were known but that how to gather the statistics
was not?


I think over the course of the discussion we have figured out  
that we

would like to have cross-column correlation statistics.  The precise
mathematical incarnation hasn't been determined yet, as far as I can
see.  Collecting the statistics thereafter isn't that hard, but  
there
needs to be a way to not collect an exponential volume of  
statistics on

all column combinations.


I understood that the proposal was to collect only the stats where
needed (determined by user/dba) and use some rule-of-thumb values  
if no

collected stats were available.


Yeah, unless someone comes up with some kind of 'magic', I think  
trying

to handle every cross-column possibility is a non-starter. IIRC, that
argument is what's stalled cross-column stats every time in the  
past. It
makes a lot more sense to allow defining what combinations of  
columns we

need stats for.

After that's done, it'd be easy to then write a script that will tell
the database to collect stats on all multi-column indexes, RI, etc.  
Down

the road, the planner could even be made to log (in a machine-readable
format) every time it needs cross-column stats, and that data could be
used to add stats that are needed.


If we're talking about my random neuron firing then I think the  
responses have gone off
a bit.  My thought was to just tell the planner the statistics that  
are of interest.


An example of what I'm thinking would probably be helpful.  Let's say  
that the
DBA knows, through whatever means at his/her disposal (heck! the  
magic you mention)
that column a  column b have some sort of correlation that the  
planner can't determine
on its own but can use if it had it.  The DBA therefore pokes the  
right information into
the planner's statistical tables (or, perhaps, a more human- 
manageable one that gets

compiled into the planner's stats).

For this to work we'd have to
1.  Define the types of statistics that the planner could use in its  
planning that
it cannot currently (or ever) collect itself.  Cross-column  
correlations, suitable

join selectivity, anything that would be useful to the planner.
2. Create a table or other data structure to contain this planner  
information.  Modify

the planner to use this information.
3. Document what these stats are, and the influence they have in a  
format suitable
for use by DBAs, and how to add the stats to the above table.  Mere  
mortals can tinker

with this feature at their own peril :-)

Now, when a DBA has information that could steer the planner in the  
right direction
he/she has a mechanism to do so that does not involve hinting the  
specific query.  My
hope would be that this information wouldn't go stale as fast as a  
query hint would.
Furthermore, the DBA can improve an application's performance without  
having to go

into every query it executes.

The planner would look in that table and say Ah! there's information  
in here that says

that when a is joined to be it's going to eliminate 90% of my I/O.

Seems to me that such a feature would be a cool knob and address most/ 
all of the need for

query hints.

One other possibility for the above information would be just to have  
a place for
the planner to save information for itself when it finds a plan to be  
either horribly

over-optimistic or pessimistic.

Hope this blathering makes some kind of sense...
-arturo



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] An Idea for planner hints

2006-08-17 Thread Peter Eisentraut
Arturo Pérez wrote:
 The DBA therefore pokes the
 right information into
 the planner's statistical tables (or, perhaps, a more human-
 manageable one that gets
 compiled into the planner's stats).

I think we're perfectly capable of producing a system that can collect 
the statistics.  We just don't want to collect every possible 
statistic, but just those that someone declared to be interesting 
beforehand.  There need not be any manual poking.  Just manual 
declaring.

But we need to work this from the other end anyway.  We need to 
determine first, what sort of statistics the planner could make use of.  
Then we can figure out the difficulties in collecting them.

A certain other hacker would send us all to the university library now.  
Maybe we should listen.  I for one am going to do laundry now. :-)

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-17 Thread Florian G. Pflug

Peter Eisentraut wrote:

Arturo Pérez wrote:

The DBA therefore pokes the
right information into
the planner's statistical tables (or, perhaps, a more human-
manageable one that gets
compiled into the planner's stats).


I think we're perfectly capable of producing a system that can collect 
the statistics.  We just don't want to collect every possible 
statistic, but just those that someone declared to be interesting 
beforehand.  There need not be any manual poking.  Just manual 
declaring.


But we need to work this from the other end anyway.  We need to 
determine first, what sort of statistics the planner could make use of.  
Then we can figure out the difficulties in collecting them.


I've been told that oracle has an interesting feature regarding
materialized views that gave me an idea how to declare what statistics
to gather. It seems as if oracle is able to figure out that it can
use a certain materialized view to speed up execution of a certain
query, even if the query doesn't use that view explicitly. So, e.g.
if you do

1) create materialized view v as select * from t1 join t2 on t1.t2_id = 
t2.id.

2) select * from t1 join t2 on t1.t2_id = t2.id join t3 on t3.t2_id = t2.id

then oracle seems to be able to use the already-joined tuples in v, and
only needs to join t3 to those, instead of having to rejoin t1 and t2.

That gave me the idea that something similar could be used to declare
what statistics to gather, in a very general way. Imagine that I could
do.

1) create statistics for select * from t1 join t2 on t1.t2_id and 
t1.flag = TRUE.
2) select * from t1 join t2 on t1.t2_id and t1.flag = TRUE join t3 on 
...  join t4 on ...


The command 1) would basically gather the same statistics for the result
of the query as it would gather for a normal table with the same signature.
When planning 2), postgres would recognize that it can use those
statistics (similar to how oracle recognizes that it can use a certain
materialized view), and would thus. know the selectivity of that
particular join very accurately.

I think there might even be a way to do (1) without actually executing 
the (whole) query. If every access-method in the query plan could be

told to deliver only say 10% of the rows it would deliver normally,
but the rest of the plan was executed normally, then the result should
have the same statistical properties as the complete result would have.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] An Idea for planner hints

2006-08-16 Thread Gregory Stark
Jim C. Nasby [EMAIL PROTECTED] writes:

 On Tue, Aug 15, 2006 at 07:55:28PM +0200, Peter Eisentraut wrote:

  It would not be hard to create an auto explain analyze mode that 
  implicitly runs EXPLAIN ANALYZE along with every query and logs the 
  result.  On its face, it sounds like an obviously great idea.  I just 
  don't see how you would put that to actual use, unless you want to read 
  server logs all day long.  Grepping for query duration and using the 
  statistics views are much more manageable tuning methods.  In my view 
  anyway.
  
 Well, the output would really need to go into some machine-readable
 format, since you certainly aren't going to read it. That would also
 make it trivial to identify plans that diverged greatly from reality.

Oracle's EXPLAIN had a peculiar design feature that always seemed bizarre from
a user's point of view. But here's where it begins to become clear what they
were thinking.

It stuffs the EXPLAIN output into a table. It means you can then use SQL to
format the data for display, to generate aggregate reports of plans, or to
search for plans or plan nodes that meet certain criteria. They don't even
have to be plans generated by your session. You can have an application run
explain on its queries and then go and peek at the plans from a separate
session. And it doesn't interfere with the query outputting its normal output.

I'm not sure it's worth throwing out the more user-friendly interface we have
now but I think it's clear that a table is the obvious machine-readable
format if you're already sitting in an SQL database... :)

Also, incidentally you guys are still thinking of applications that don't use
prepared queries and parameters extensively. If they do they won't have reams
of plans since there'll only be one ream of plans with one plan for each query
on a session start not one for each execution.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-16 Thread Jim C. Nasby
On Wed, Aug 16, 2006 at 06:48:09PM -0400, Gregory Stark wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
 
  On Tue, Aug 15, 2006 at 07:55:28PM +0200, Peter Eisentraut wrote:
 
   It would not be hard to create an auto explain analyze mode that 
   implicitly runs EXPLAIN ANALYZE along with every query and logs the 
   result.  On its face, it sounds like an obviously great idea.  I just 
   don't see how you would put that to actual use, unless you want to read 
   server logs all day long.  Grepping for query duration and using the 
   statistics views are much more manageable tuning methods.  In my view 
   anyway.
   
  Well, the output would really need to go into some machine-readable
  format, since you certainly aren't going to read it. That would also
  make it trivial to identify plans that diverged greatly from reality.
 
 Oracle's EXPLAIN had a peculiar design feature that always seemed bizarre from
 a user's point of view. But here's where it begins to become clear what they
 were thinking.
 
 It stuffs the EXPLAIN output into a table. It means you can then use SQL to
 format the data for display, to generate aggregate reports of plans, or to
 search for plans or plan nodes that meet certain criteria. They don't even
 have to be plans generated by your session. You can have an application run
 explain on its queries and then go and peek at the plans from a separate
 session. And it doesn't interfere with the query outputting its normal output.
 
 I'm not sure it's worth throwing out the more user-friendly interface we have
 now but I think it's clear that a table is the obvious machine-readable
 format if you're already sitting in an SQL database... :)
 
Actually, I had another idea, though I'm not sure how useful it will
ultimately be...

There's now a program to analyze generic PostgreSQL logs, someone else
just posted that they're working on an analyzer for VACUUM, and there's
a desire for machine-readable EXPLAIN output. What about providing a
secondary logging mechanism that produces machine-readable output for
different operations? The three I just mentioned are obvious choices,
but there could be more.

 Also, incidentally you guys are still thinking of applications that don't use
 prepared queries and parameters extensively. If they do they won't have reams
 of plans since there'll only be one ream of plans with one plan for each query
 on a session start not one for each execution.

That behavior could presumably be changed if we added the ability to
analyze every statement a particular session was running.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Jim C. Nasby
On Mon, Aug 14, 2006 at 11:41:29PM +0300, Hannu Krosing wrote:
 ??hel kenal p??eval, E, 2006-08-14 kell 18:21, kirjutas Peter Eisentraut:
  Perez wrote:
   I thought, from watching the list for a while, that the planner
   statistics needed were known but that how to gather the statistics
   was not?
  
  I think over the course of the discussion we have figured out that we 
  would like to have cross-column correlation statistics.  The precise 
  mathematical incarnation hasn't been determined yet, as far as I can 
  see.  Collecting the statistics thereafter isn't that hard, but there 
  needs to be a way to not collect an exponential volume of statistics on 
  all column combinations.
 
 I understood that the proposal was to collect only the stats where
 needed (determined by user/dba) and use some rule-of-thumb values if no
 collected stats were available.

Yeah, unless someone comes up with some kind of 'magic', I think trying
to handle every cross-column possibility is a non-starter. IIRC, that
argument is what's stalled cross-column stats every time in the past. It
makes a lot more sense to allow defining what combinations of columns we
need stats for.

After that's done, it'd be easy to then write a script that will tell
the database to collect stats on all multi-column indexes, RI, etc. Down
the road, the planner could even be made to log (in a machine-readable
format) every time it needs cross-column stats, and that data could be
used to add stats that are needed.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread AgentM


On Aug 15, 2006, at 10:40 , Jim C. Nasby wrote:


Yeah, unless someone comes up with some kind of 'magic', I think  
trying

to handle every cross-column possibility is a non-starter. IIRC, that
argument is what's stalled cross-column stats every time in the  
past. It
makes a lot more sense to allow defining what combinations of  
columns we

need stats for.

After that's done, it'd be easy to then write a script that will tell
the database to collect stats on all multi-column indexes, RI, etc.  
Down

the road, the planner could even be made to log (in a machine-readable
format) every time it needs cross-column stats, and that data could be
used to add stats that are needed.


I've always found it odd that database didn't determine which  
statistics are the most interesting from the queries themselves. At  
the very least, the database could make suggestions: It looks like  
this prepared query which is used often could benefit from an index  
on x(a,b,c). That would be better than me guessing.


-M

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Peter Eisentraut
AgentM wrote:
 I've always found it odd that database didn't determine which
 statistics are the most interesting from the queries themselves.

The overhead of doing that on the fly is probably prohibitive.  More 
explicit profiling support could be helpful, but that would seem a lot 
more complicated than, say, a compiler profiling tool that merely has 
to sort out the branch predictions.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread AgentM


On Aug 15, 2006, at 12:26 , Peter Eisentraut wrote:


AgentM wrote:

I've always found it odd that database didn't determine which
statistics are the most interesting from the queries themselves.


The overhead of doing that on the fly is probably prohibitive.  More
explicit profiling support could be helpful, but that would seem a lot
more complicated than, say, a compiler profiling tool that merely has
to sort out the branch predictions.


Couldn't the session be explicitly transferred into a special  
analysis mode? Explain analyze could run on every query implicitly  
and point out time and row count discrepancies as HINTs. Multi-column  
joins, for example, could be pointed out and display whether or not  
there are related indexes.


Then, I imagine, I would regularly run all my app's prepared queries  
through this analysis mode to see what I could improve. Who knows the  
database better than itself?


-M

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Peter Eisentraut
AgentM wrote:
 Couldn't the session be explicitly transferred into a special
 analysis mode? Explain analyze could run on every query implicitly
 and point out time and row count discrepancies as HINTs. Multi-column
 joins, for example, could be pointed out and display whether or not
 there are related indexes.

Meet EXPLAIN ANALYZE.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Jim C. Nasby
On Tue, Aug 15, 2006 at 07:00:49PM +0200, Peter Eisentraut wrote:
 AgentM wrote:
  Couldn't the session be explicitly transferred into a special
  analysis mode? Explain analyze could run on every query implicitly
  and point out time and row count discrepancies as HINTs. Multi-column
  joins, for example, could be pointed out and display whether or not
  there are related indexes.
 
 Meet EXPLAIN ANALYZE.

Which does no good for apps that you don't control the code on. Even if
you do control the code, you have to find a way to stick EXPLAIN ANALYZE
in  front of every query, and figure out how to deal with what's comming
back. There's definately use cases where EXPLAIN ANALYZE isn't a very
good tool.

Going back to the original discussion though, there's no reason this
needs to involve EXPLAIN ANALYZE. All we want to know is what columns
the planner is dealing with as a set rather than individually. Logging
that information someplace need not be anywhere near as invasive as
EXPLAIN [ANALYZE]. One possibility is spewing out table/column
names/OIDs to a logfile in a tab-delimited format that can easily be
pulled back into the database and analyzed.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Peter Eisentraut
Jim C. Nasby wrote:
  Meet EXPLAIN ANALYZE.

 Which does no good for apps that you don't control the code on. Even 
 if you do control the code, you have to find a way to stick EXPLAIN
 ANALYZE in  front of every query, and figure out how to deal with
 what's comming back.

It would not be hard to create an auto explain analyze mode that 
implicitly runs EXPLAIN ANALYZE along with every query and logs the 
result.  On its face, it sounds like an obviously great idea.  I just 
don't see how you would put that to actual use, unless you want to read 
server logs all day long.  Grepping for query duration and using the 
statistics views are much more manageable tuning methods.  In my view 
anyway.

 Going back to the original discussion though, there's no reason this
 needs to involve EXPLAIN ANALYZE. All we want to know is what columns
 the planner is dealing with as a set rather than individually.

This would log a whole bunch of column groups, since every moderately 
interesting query uses a column in combination with some other column, 
but you still won't know which ones you want the planner to optimize.

To get that piece of information, you'd need to do something like 
principal component analysis over the column groups thus identified.  
Which might be a fun thing to do.  But for the moment I think it's 
better to stick to declaring the interesting pairs/groups manually.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread AgentM


On Aug 15, 2006, at 13:55 , Peter Eisentraut wrote:


Jim C. Nasby wrote:

Meet EXPLAIN ANALYZE.


Which does no good for apps that you don't control the code on. Even
if you do control the code, you have to find a way to stick EXPLAIN
ANALYZE in  front of every query, and figure out how to deal with
what's comming back.


It would not be hard to create an auto explain analyze mode that
implicitly runs EXPLAIN ANALYZE along with every query and logs the
result.  On its face, it sounds like an obviously great idea.  I just
don't see how you would put that to actual use, unless you want to  
read

server logs all day long.  Grepping for query duration and using the
statistics views are much more manageable tuning methods.  In my view
anyway.


Also [and this has been brought up before], explain analyze doesn't  
make any effort to highlight the actual discrepancies from the plan.  
If it could drop an arrow or an asterisk where, for example, the  
numbers are off by an order of magnitude, it would make a big  
difference.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Jim C. Nasby
On Tue, Aug 15, 2006 at 07:55:28PM +0200, Peter Eisentraut wrote:
 Jim C. Nasby wrote:
   Meet EXPLAIN ANALYZE.
 
  Which does no good for apps that you don't control the code on. Even 
  if you do control the code, you have to find a way to stick EXPLAIN
  ANALYZE in  front of every query, and figure out how to deal with
  what's comming back.
 
 It would not be hard to create an auto explain analyze mode that 
 implicitly runs EXPLAIN ANALYZE along with every query and logs the 
 result.  On its face, it sounds like an obviously great idea.  I just 
 don't see how you would put that to actual use, unless you want to read 
 server logs all day long.  Grepping for query duration and using the 
 statistics views are much more manageable tuning methods.  In my view 
 anyway.
 
Well, the output would really need to go into some machine-readable
format, since you certainly aren't going to read it. That would also
make it trivial to identify plans that diverged greatly from reality.

  Going back to the original discussion though, there's no reason this
  needs to involve EXPLAIN ANALYZE. All we want to know is what columns
  the planner is dealing with as a set rather than individually.
 
 This would log a whole bunch of column groups, since every moderately 
 interesting query uses a column in combination with some other column, 
 but you still won't know which ones you want the planner to optimize.
 
Well, I guess there's actually two kinds of stats that are
interesting...

groups of columns that are often refered to as a group, ie:
WHERE a='blah' and b='bleh' and c='blech'

columns that are joined to other columns (perhaps in a group)

 To get that piece of information, you'd need to do something like 
 principal component analysis over the column groups thus identified.  
 Which might be a fun thing to do.  But for the moment I think it's 
 better to stick to declaring the interesting pairs/groups manually.

Sure, but the idea is to make it easier to identify what those pairs
might be. If the grouping info was alwas in a deterministic order, then
simply doing

SELECT columns, count(*) ... GROUP BY columns ORDER BY count(*) DESC
LIMIT 10;

would be very useful. And given the data, if someone wanted to do a more
complex analysis they could.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Christopher Kings-Lynne

  see.  Collecting the statistics thereafter isn't that hard, but there
  needs to be a way to not collect an exponential volume of statistics on
  all column combinations.


You could collect them on all FK relationships - is that enough?

Chris

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] An Idea for planner hints

2006-08-15 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 see.  Collecting the statistics thereafter isn't that hard, but there
 needs to be a way to not collect an exponential volume of statistics on
 all column combinations.

 You could collect them on all FK relationships - is that enough?

As somebody pointed out upthread, collecting these stats on FK
relationships is actually not very interesting: you already know
that the referenced side of the relationship is a unique column,
and so the selectivity stats of the referencing side are enough.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] An Idea for planner hints

2006-08-14 Thread Perez
In article [EMAIL PROTECTED],
 [EMAIL PROTECTED] (Jim C. Nasby) wrote:

 On Wed, Aug 09, 2006 at 08:31:42AM -0400, Perez wrote:
  Every once in a while people talk about collecting better statistics, 
  correlating multi-column correlations etc.  But there never seems to be 
  a way to collect that data/statistics.  
  
  Would it be possible to determine the additional statistics the planner 
  needs, modify the statistics table to have them and document how to 
  insert data there?  We wouldn't have a good automated way to determine 
  the information but a properly educated DBA could tweak things until 
  they are satisfied.
  
  At worse if this new information is unpopulated then things would be as 
  they are now.  But if a human can insert the right information then some 
  control over the planner would be possible.
  
  Is this a viable idea?  Would this satisfy those that need to control 
  the planner immediately without code changes?
 
 Sure, it's a Simple Matter of Code.
 
 The real issue is figuring out what to do with these stats. I think all
 the estimator fucntions could use improvement, but no one's taken that
 on yet.

I thought, from watching the list for a while, that the planner 
statistics needed were known but that how to gather the statistics was 
not?

For example,  there is the discussion around multi-column correlation.  
I got the impression that we (you all grin) knew what to do with the 
stats but that there was no reliable way to get them.

So, the situation is that we need better stats, but we don't know how to 
collect them AND we don't know what they are either?  If we did know 
what to do then my idea and SMC would prevail?

If that's the case then it sounds to me like we should figure out the 
statistics we wish we had that the planner could work with.  Something 
for the 8.5 timeframe I guess :-)

-arturo

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-14 Thread Peter Eisentraut
Perez wrote:
 I thought, from watching the list for a while, that the planner
 statistics needed were known but that how to gather the statistics
 was not?

I think over the course of the discussion we have figured out that we 
would like to have cross-column correlation statistics.  The precise 
mathematical incarnation hasn't been determined yet, as far as I can 
see.  Collecting the statistics thereafter isn't that hard, but there 
needs to be a way to not collect an exponential volume of statistics on 
all column combinations.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] An Idea for planner hints

2006-08-14 Thread Hannu Krosing
Ühel kenal päeval, E, 2006-08-14 kell 18:21, kirjutas Peter Eisentraut:
 Perez wrote:
  I thought, from watching the list for a while, that the planner
  statistics needed were known but that how to gather the statistics
  was not?
 
 I think over the course of the discussion we have figured out that we 
 would like to have cross-column correlation statistics.  The precise 
 mathematical incarnation hasn't been determined yet, as far as I can 
 see.  Collecting the statistics thereafter isn't that hard, but there 
 needs to be a way to not collect an exponential volume of statistics on 
 all column combinations.

I understood that the proposal was to collect only the stats where
needed (determined by user/dba) and use some rule-of-thumb values if no
collected stats were available.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-13 Thread Perez
In article [EMAIL PROTECTED],
 Perez [EMAIL PROTECTED] wrote:

 In article [EMAIL PROTECTED],
  [EMAIL PROTECTED] (Tom Lane) wrote:
 
  Martijn van Oosterhout kleptog@svana.org writes:
   My main problem is that selectivity is the wrong measurement. What
   users really want to be able to communicate is:
  
   1. If you join tables a and b on x, the number of resulting rows will be
   the number of roows selected from b (since b.x id a foreign key
   referencing a.x).
  
  FWIW, I believe the planner already gets that case right, because a.x
  will be unique and it should know that.  (Maybe not if the FK is across
  a multi-column key, but in principle it should get it right.)
  
  I agree though that meta-knowledge like this is important, and that
  standard SQL frequently doesn't provide any adequate way to declare it.
  
  regards, tom lane
 
 
 Every once in a while people talk about collecting better statistics, 
 correlating multi-column correlations etc.  But there never seems to be 
 a way to collect that data/statistics.  
 
 Would it be possible to determine the additional statistics the planner 
 needs, modify the statistics table to have them and document how to 
 insert data there?  We wouldn't have a good automated way to determine 
 the information but a properly educated DBA could tweak things until 
 they are satisfied.
 
 At worse if this new information is unpopulated then things would be as 
 they are now.  But if a human can insert the right information then some 
 control over the planner would be possible.
 
 Is this a viable idea?  Would this satisfy those that need to control 
 the planner immediately without code changes?
 
 -arturo

I didn't see any response to this idea so I thought I'd try again with a 
real email.

-arturo

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] An Idea for planner hints

2006-08-13 Thread Jim C. Nasby
On Wed, Aug 09, 2006 at 08:31:42AM -0400, Perez wrote:
 Every once in a while people talk about collecting better statistics, 
 correlating multi-column correlations etc.  But there never seems to be 
 a way to collect that data/statistics.  
 
 Would it be possible to determine the additional statistics the planner 
 needs, modify the statistics table to have them and document how to 
 insert data there?  We wouldn't have a good automated way to determine 
 the information but a properly educated DBA could tweak things until 
 they are satisfied.
 
 At worse if this new information is unpopulated then things would be as 
 they are now.  But if a human can insert the right information then some 
 control over the planner would be possible.
 
 Is this a viable idea?  Would this satisfy those that need to control 
 the planner immediately without code changes?

Sure, it's a Simple Matter of Code.

The real issue is figuring out what to do with these stats. I think all
the estimator fucntions could use improvement, but no one's taken that
on yet.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Csaba Nagy
On Tue, 2006-08-08 at 22:14, Tom Lane wrote:
 So some kind of override for statistical guesses doesn't seem completely
 silly to me.  But it needs to be declarative information that's stored
 somewhere out of view of the actual SQL queries.  IMHO anyway.

The real problem is that sometimes there's no way to get a better plan
without some code change in the planner. And given the postgres release
policy, that might be as far as 1 year away for a normal user... of
course it's open source, you can patch, but would I trust a patch which
is not tested by the community ? So mostly I can't wait for code
changes, and then a generic tool to fix _now_ the one bad query which
brings my system down would be nice. This is why hints would be nice, to
quick-fix immediate problems. Of course they can and would be abused, as
anything else.

On the planner improvements part, would it be possible to save
statistics about join criteria between tables ? I'm not sure where that
would belong, but I guess it would be possible to have a special kind of
ANALYZE which analyzes multiple tables and their correlations... this
way the user would not need to hard-code the statistics hints, but the
system could generate them.

Cheers,
Csaba.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Florian G. Pflug

Tom Lane wrote:

Martijn van Oosterhout kleptog@svana.org writes:

ISTM theat the easiest way would be to introduce a sort of predicate
like so:



SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);


The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries.  This throws that away
again.

I think for this to be really effective, you'd actually need both - a
query-independent way specifying selectivities, and a way to influence
the estimates for a _single_ query.

Image a complex, autogenerated query with looks something like this
select 
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
  big, complicated expression derived from some user input.

This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression. This leads to weird join orders, and generally very bad 
performance. Of course, *I* don't know the selectivity of this 
expression myself - but experience tells me that on average it's 
something like 50%, and not 1% as postgres believes. So, in that case,

being able to write

select ... join  where pg_selectivity(expression, 0.5)
would be a big win.


The thing I object to about the I want to decorate my queries with
planner hints mindset is that it's coming at it from the wrong
direction.  You should never be thinking in terms of fix this one
query, because that just leads back into the same dead end that your
fix doesn't work tomorrow.  What you *should* be thinking about is why
did the planner get this wrong, and how do I fix the generic problem?.
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.


Fixing the generic problem is surely the best _if_ there is a fix for
the generic problem at all. But if your where-conditions involves fields
from 10 different tables, then IMHO there is no way to _ever_ guarantee
that postgres will get correct selectivity estimates. But since (at 
least for me) overestimating selectivity hurts fare more than 
underestimating it, forcing postgres to just assume a certain 
selectivity could help.


I'm not in any way saying that there should _only_ be selectivity
annotations inside the query - a query-independent mechanism would
be a very nice thing to have. But a query-independent mechanism
wont be sufficient in all cases IMHO.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Martijn van Oosterhout
On Wed, Aug 09, 2006 at 12:57:39PM +0200, Florian G. Pflug wrote:
 Fixing the generic problem is surely the best _if_ there is a fix for
 the generic problem at all. But if your where-conditions involves fields
 from 10 different tables, then IMHO there is no way to _ever_ guarantee
 that postgres will get correct selectivity estimates. But since (at 
 least for me) overestimating selectivity hurts fare more than 
 underestimating it, forcing postgres to just assume a certain 
 selectivity could help.

I'm not sure if the problem is totally solvable, but we can certainly
do a lot better than we do now.

ISTM that what's really missing at the moment is some kind of
post-mortem analysis that looks at the EXPLAIN ANALYZE output, pulls it
apart and say: 'look, we went wrong here'. For leaf nodes trying to
estimate the selectivity on a single table it easy. But working out the
selectivity of join nodes is harder.

Where we really fall down right now it that we do not recognise highly
correlated columns. If we have the expression WHERE a = 1 AND b = 2 we
assume the expressions are independant and multiply the selectivities
together. Often this is the wrong thing to do.

This also a problem for columns in different tables that get joined on.
Currently we don't do anything special there either.

Perhaps the way to go would be to allow users to declare columns often
used together and have ANALYSE collect information on correlation which
can be used later...

Have a ncie day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Tom Lane
Florian G. Pflug [EMAIL PROTECTED] writes:
 Image a complex, autogenerated query with looks something like this
 select 
 from t1
 join t2 on ...
 join t3 on ...
 join t4 on ...
 ...
 ...
 where
big, complicated expression derived from some user input.

 This big, complicated expression looks different for every query - and
 currently, postgres often vastly overestimates the selectivity of this
 expression.

This is a straw man.  There is no way that your application can throw in
a chosen-at-random selectivity value for a join condition that it
doesn't understand and have that be more likely to be right than the
planner's guess.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Kaare Rasmussen
 SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

ISTM that you introduced the Oracle silliness again, putting the hint into the 
query.

My suggestion would be to tell about it separately. Something like

CREATE HINT FOR JOIN foo, bar ON foo.a=bar.b AS some hint;

This way hints can be added and removed without ever touching the existing 
queries.

-- 

Med venlig hilsen
Kaare Rasmussen, Jasonic

Jasonic Telefon: +45 3816 2582
Nordre Fasanvej 12
2000 Frederiksberg  Email: [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Image a complex, autogenerated query with looks something like this
select 
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
   big, complicated expression derived from some user input.



This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression.


This is a straw man.  There is no way that your application can throw in
a chosen-at-random selectivity value for a join condition that it
doesn't understand and have that be more likely to be right than the
planner's guess.


No, my application probably won't get it right, _but_
.) I can at least _choose_ what selectivity to use. My experience is 
that a selectivity that is too small (meaning that postgres 
underestimates the number of records resulting for a join or where)
is usually much worse than a overly large selectivity (meaning that 
postgres expects more records than it actually finds). Forcing a

high selectivity (thus letting postgres expect a lot of records)
therefore should lead to better plans then letting postgres 
underestimating the selectivity.


.) Often, my application (or I) *can* guess betten then postgres. My
application, for example, executes the same set of about 100 queries
every day to build cache tables. Since I _know_ how many records the
query returned yesterday, I can use that value to get a *very*
good approximation of the selectivity. This is something my app
can do easily, while postgres would have really a hard time to figure
that out.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Jim C. Nasby
Been suggested before... the problem is actually doing something useful
with all that data that's collected, as well as how to collect it
without greatly impacting the system.

On Tue, Aug 08, 2006 at 08:23:05AM -0700, Mark Dilger wrote:
 If this feature I'm proposing already exists, sorry for the waste of 
 bandwidth, and could someone please point me to it? :)
 
 What if there were a mode that told postgres to do an exhaustive search (or 
 if not exhaustive, then much more extensive search) of all plans (or many 
 plans), trying each plan, reporting the performance of each, and discarding 
 the query results, much like explain analyze does.  Postgres could then 
 dump the best plan in machine readable (and semi-human readable) form which 
 the planner could parse and use at some later date in lieu of a SQL query.
 
 This would allow people with reasonably static table statistics (where the 
 best plan is not likely to change) to spend upfront cycles investigating 
 the best plan and then embed that plan in their business logic.  Since the 
 stored plan is both written-by and read-by postgres, it can get quite 
 complicated without putting a burden on humans to read and write such 
 complicated things.  It would also remove the risk that the planner will 
 occasionally (due to its nondeterministic workings) choose a really bad 
 plan and stall a production system.
 
 mark
 
 Florian G. Pflug wrote:
 Hi
 
 Since the discussion about how to force a specific plan has
 come up, I though I'd post an idea I had for this a while ago.
 It's not reall well though out yet, but anyway.
 
 When the topic of optimizer hints comes up, people often suggest
 that there should be a way to force postgres to use a certain
 index, or do joins in a certain order. AFAIK, this mimics what
 oracle does - you can put comments into your query that specify
 what index to use. This approach has two major drawbacks
 .) Plans that seem good now might not seem that good a few months
 later - your data might have changed, and other execution plans
 might fit better now
 .) You have to change all your queries to make use of features
 in new postgres versions, like bitmap scans.
 
 My experience with the postgres optimizer is that it usually performs
 great - and if it doesn't, that always boiled down to two problems
 (at least for me)
 .) The query is autogenerated, and includes complex, and highly inter-
 dependent where (or join) conditions. This leads to wrong estimates
 of where selectivity, and thus to bad plans.
 .) There are correlations between columns and/or tables that postgres
 doesn't know about (and has no chance of knowing about). Again, this
 leads to vastly wrong estimates of row counts, and to bad plans.
 
 I think that those bad estimates of the selectivity of where-clauses
 (or on-clauses for joins) is where postgres could use hints.
 
 Image a query like select ... from t1 join t2 on t1.t2_id = t2.id and 
 expr. Lets say that expr is true for only 1% of the rows in t2 -
 but those are exactly the rows that have matching rows in t1.
 
 Postgres would probably guess that this join will produce about 1/100
 of the rows that t1 has - but I _know_ that it will produce 100 (!) 
 times more rows.
 
 Now, I'd like to hand that information to postgres. I wouldn't want
 to force any particular access method or join order, but rather I'd
 just tell it hey, this expression has selectivity 1 in this context,
 not 0.01 as you might think.
 
 Could that work?
 
 greetings, Florian Pflug
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Jim C. Nasby
On Wed, Aug 09, 2006 at 02:02:10PM +0200, Martijn van Oosterhout wrote:
 On Wed, Aug 09, 2006 at 12:57:39PM +0200, Florian G. Pflug wrote:
  Fixing the generic problem is surely the best _if_ there is a fix for
  the generic problem at all. But if your where-conditions involves fields
  from 10 different tables, then IMHO there is no way to _ever_ guarantee
  that postgres will get correct selectivity estimates. But since (at 
  least for me) overestimating selectivity hurts fare more than 
  underestimating it, forcing postgres to just assume a certain 
  selectivity could help.
 
 I'm not sure if the problem is totally solvable, but we can certainly
 do a lot better than we do now.
 
 ISTM that what's really missing at the moment is some kind of
 post-mortem analysis that looks at the EXPLAIN ANALYZE output, pulls it
 apart and say: 'look, we went wrong here'. For leaf nodes trying to
 estimate the selectivity on a single table it easy. But working out the
 selectivity of join nodes is harder.
 
 Where we really fall down right now it that we do not recognise highly
 correlated columns. If we have the expression WHERE a = 1 AND b = 2 we
 assume the expressions are independant and multiply the selectivities
 together. Often this is the wrong thing to do.
 
 This also a problem for columns in different tables that get joined on.
 Currently we don't do anything special there either.
 
 Perhaps the way to go would be to allow users to declare columns often
 used together and have ANALYSE collect information on correlation which
 can be used later...

One thing that would help tremendously would be to collect stats on
multi-column indexes. That would probably hit a good chunk of our
problem areas.

Something this is related to is providing estimates for functions (which
has been discussed in the past). There were numerous proposals there,
but the one that stuck in my head was allowing users to define functions
that would provide appropriate stats based on some input. Granted,
that's a pretty low-level construct, but it's more than we have now, and
would allow for better schemes to be built on top of it.

As for query hints, I really wish we'd just bite the bullet and add
them. Yes, they're far from perfect, yes, we should just fix the
planner, yes, it's ugly that they're per-statement, but ultimately
sometimes you have to just flat-out tell the planner to do things a
certain way. I suspect enough time has been spent debating them since
7.2 that they could have been implemented by now.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 On Wed, Aug 09, 2006 at 02:02:10PM +0200, Martijn van Oosterhout wrote:
 Perhaps the way to go would be to allow users to declare columns often
 used together and have ANALYSE collect information on correlation which
 can be used later...

 One thing that would help tremendously would be to collect stats on
 multi-column indexes. That would probably hit a good chunk of our
 problem areas.

But it would specifically fail to cover join situations.  I kinda like
Martijn's thought of allowing users to specify combinations of columns
to collect correlation stats about.

(Not sure how we'd implement that, seeing that ANALYZE currently works
on one table at a time, but it's probably doable --- and it'd fix the
fundamental problem for correlation statistics, which is how not to try
to collect stats about an exponential number of combinations ...)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Mark Dilger

Jim C. Nasby wrote:

Been suggested before... the problem is actually doing something useful
with all that data that's collected, as well as how to collect it
without greatly impacting the system.


Identifying the best plan by means of actually running multiple plans and timing 
them is useful.  That would be the point.


As far as without greatly impacting the system, I don't think that is a real 
concern.  The whole idea is to greatly impact the system *once*, sometime when 
the DBA doesn't mind impacting the system (like before you go live on a 
production network, or between midnight and 3 AM, or whatever), and then store 
the best plan for future use.


The planner trades-off the desire to find the best plan and the need to find a 
plan quickly.  It also chooses a plan based on statistics and not based on 
actual runtimes (because there is a chicken-and-egg problem: how do you know 
which plan has the smallest runtime without running it?), so the chosen plan 
that looks best based on statistics might not actually be best.


The idea I'm proposing circumvents the whole trade-off problem by explicitly 
choosing to do something that makes the planner run really slowly and take a 
really long time.  But it doesn't do it at runtime, in the sense that you 
don't do it for each query.  You just do it once up front and be done with it. 
Of course, this is only useful for people with reasonably static queries and 
reasonably static table statistics, so that a good plan found up-front continues 
to be a good plan as it is repeatedly used.


My personal motivation is that I have tables whose statistics are quite static. 
 The data itself changes, but the statistical distribution from which the data 
is pulled is unchanging, so the table statistics end up about the same even as 
the data itself is added and deleted.  On top of that, the planner keeps 
choosing the wrong plan, which I know to be true because I can make individual 
queries run faster by structuring them in ways that the planner can't see 
through and optimize away the particular plan that I am effectively giving it. 
 But this is a PITA for me, especially since I don't always know what the best 
plan might be and have to try them all until I find the right one.  (With the 
added complexity that I can't always figure out how to trick the planner into 
choosing a specific plan, and hence can't test it.)  It would be *so much 
easier* to have an option to tell the planner to try them all.


mark


On Tue, Aug 08, 2006 at 08:23:05AM -0700, Mark Dilger wrote:
If this feature I'm proposing already exists, sorry for the waste of 
bandwidth, and could someone please point me to it? :)


What if there were a mode that told postgres to do an exhaustive search (or 
if not exhaustive, then much more extensive search) of all plans (or many 
plans), trying each plan, reporting the performance of each, and discarding 
the query results, much like explain analyze does.  Postgres could then 
dump the best plan in machine readable (and semi-human readable) form which 
the planner could parse and use at some later date in lieu of a SQL query.


This would allow people with reasonably static table statistics (where the 
best plan is not likely to change) to spend upfront cycles investigating 
the best plan and then embed that plan in their business logic.  Since the 
stored plan is both written-by and read-by postgres, it can get quite 
complicated without putting a burden on humans to read and write such 
complicated things.  It would also remove the risk that the planner will 
occasionally (due to its nondeterministic workings) choose a really bad 
plan and stall a production system.


mark


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Joshua Reich

(Not sure how we'd implement that, seeing that ANALYZE currently works
on one table at a time, but it's probably doable --- and it'd fix the
fundamental problem for correlation statistics, which is how not to try
to collect stats about an exponential number of combinations ...)



An exponential number of combinations? Is it possible to use FK 
relationships to determine what tables  columns are likely to be used 
in future joins.


Josh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Martijn van Oosterhout
On Wed, Aug 09, 2006 at 03:33:21PM -0400, Joshua Reich wrote:
 (Not sure how we'd implement that, seeing that ANALYZE currently works
 on one table at a time, but it's probably doable --- and it'd fix the
 fundamental problem for correlation statistics, which is how not to try
 to collect stats about an exponential number of combinations ...)
 
 An exponential number of combinations? Is it possible to use FK 
 relationships to determine what tables  columns are likely to be used 
 in future joins.

Sure, except FKs are the degenerate case. You know the target column is
unique and the source column only contains values in the target column,
so the ratio of number of rows is good.

No, the interesting stats are in other columns, which have no
explicitly declared relationship, except perhaps that they are both
foreign keys to another table.

Once you've got the basic infrastructure, you could make a tool that
would scan the queries and tables that look for column combinations
which are often joined and have an unusual correlation (unusually high
or unusually low).

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Perez
In article [EMAIL PROTECTED],
 [EMAIL PROTECTED] (Tom Lane) wrote:

 Martijn van Oosterhout kleptog@svana.org writes:
  My main problem is that selectivity is the wrong measurement. What
  users really want to be able to communicate is:
 
  1. If you join tables a and b on x, the number of resulting rows will be
  the number of roows selected from b (since b.x id a foreign key
  referencing a.x).
 
 FWIW, I believe the planner already gets that case right, because a.x
 will be unique and it should know that.  (Maybe not if the FK is across
 a multi-column key, but in principle it should get it right.)
 
 I agree though that meta-knowledge like this is important, and that
 standard SQL frequently doesn't provide any adequate way to declare it.
 
   regards, tom lane


Every once in a while people talk about collecting better statistics, 
correlating multi-column correlations etc.  But there never seems to be 
a way to collect that data/statistics.  

Would it be possible to determine the additional statistics the planner 
needs, modify the statistics table to have them and document how to 
insert data there?  We wouldn't have a good automated way to determine 
the information but a properly educated DBA could tweak things until 
they are satisfied.

At worse if this new information is unpopulated then things would be as 
they are now.  But if a human can insert the right information then some 
control over the planner would be possible.

Is this a viable idea?  Would this satisfy those that need to control 
the planner immediately without code changes?

-arturo

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Mark Dilger
If this feature I'm proposing already exists, sorry for the waste of bandwidth, 
and could someone please point me to it? :)


What if there were a mode that told postgres to do an exhaustive search (or if 
not exhaustive, then much more extensive search) of all plans (or many plans), 
trying each plan, reporting the performance of each, and discarding the query 
results, much like explain analyze does.  Postgres could then dump the best 
plan in machine readable (and semi-human readable) form which the planner could 
parse and use at some later date in lieu of a SQL query.


This would allow people with reasonably static table statistics (where the best 
plan is not likely to change) to spend upfront cycles investigating the best 
plan and then embed that plan in their business logic.  Since the stored plan is 
both written-by and read-by postgres, it can get quite complicated without 
putting a burden on humans to read and write such complicated things.  It would 
also remove the risk that the planner will occasionally (due to its 
nondeterministic workings) choose a really bad plan and stall a production system.


mark

Florian G. Pflug wrote:

Hi

Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.

When the topic of optimizer hints comes up, people often suggest
that there should be a way to force postgres to use a certain
index, or do joins in a certain order. AFAIK, this mimics what
oracle does - you can put comments into your query that specify
what index to use. This approach has two major drawbacks
.) Plans that seem good now might not seem that good a few months
later - your data might have changed, and other execution plans
might fit better now
.) You have to change all your queries to make use of features
in new postgres versions, like bitmap scans.

My experience with the postgres optimizer is that it usually performs
great - and if it doesn't, that always boiled down to two problems
(at least for me)
.) The query is autogenerated, and includes complex, and highly inter-
dependent where (or join) conditions. This leads to wrong estimates
of where selectivity, and thus to bad plans.
.) There are correlations between columns and/or tables that postgres
doesn't know about (and has no chance of knowing about). Again, this
leads to vastly wrong estimates of row counts, and to bad plans.

I think that those bad estimates of the selectivity of where-clauses
(or on-clauses for joins) is where postgres could use hints.

Image a query like select ... from t1 join t2 on t1.t2_id = t2.id and 
expr. Lets say that expr is true for only 1% of the rows in t2 -

but those are exactly the rows that have matching rows in t1.

Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!) 
times more rows.


Now, I'd like to hand that information to postgres. I wouldn't want
to force any particular access method or join order, but rather I'd
just tell it hey, this expression has selectivity 1 in this context,
not 0.01 as you might think.

Could that work?

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Martijn van Oosterhout
On Tue, Aug 08, 2006 at 01:55:35PM +0200, Florian G. Pflug wrote:
 Hi
 
 Since the discussion about how to force a specific plan has
 come up, I though I'd post an idea I had for this a while ago.
 It's not reall well though out yet, but anyway.

snip

 Image a query like select ... from t1 join t2 on t1.t2_id = t2.id and 
 expr. Lets say that expr is true for only 1% of the rows in t2 -
 but those are exactly the rows that have matching rows in t1.
 
 Postgres would probably guess that this join will produce about 1/100
 of the rows that t1 has - but I _know_ that it will produce 100 (!) 
 times more rows.

ISTM theat the easiest way would be to introduce a sort of predicate
like so:

SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

If you teach the optimiser that pg_selectivity always has the
selectivity of the second argument, you're done. Other than that you
just need to define pg_selectivity as a no-op.

One thing though: when people think of selectivity, they think number
of rows in foo that have a match in bar whereas selectivity for
postgres means chance this expression will be true. They are related
but not the same thing. Converting from one to the other will have it's
own pitfalls...

Hope this helps,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Richard Huxton

Martijn van Oosterhout wrote:

On Tue, Aug 08, 2006 at 01:55:35PM +0200, Florian G. Pflug wrote:

Hi

Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.


snip

Image a query like select ... from t1 join t2 on t1.t2_id = t2.id and 
expr. Lets say that expr is true for only 1% of the rows in t2 -

but those are exactly the rows that have matching rows in t1.

Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!) 
times more rows.


ISTM theat the easiest way would be to introduce a sort of predicate
like so:

SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);


Ideally, though it needs to be defined upon the table(s) in question, 
possibly with a WHERE clause as with indexes:


CREATE STATISTIC ...defn here...
ON invoices (cli_id), clients (id)
WHERE invoices.paid = false
WITH PRIORITY 100;

(I'm thinking the priority so you can delete any rules with a low 
priority while keeping ones you think are vital)


--
  Richard Huxton
  Archonet Ltd


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 ISTM theat the easiest way would be to introduce a sort of predicate
 like so:

 SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries.  This throws that away
again.

The thing I object to about the I want to decorate my queries with
planner hints mindset is that it's coming at it from the wrong
direction.  You should never be thinking in terms of fix this one
query, because that just leads back into the same dead end that your
fix doesn't work tomorrow.  What you *should* be thinking about is why
did the planner get this wrong, and how do I fix the generic problem?.
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.

So some kind of override for statistical guesses doesn't seem completely
silly to me.  But it needs to be declarative information that's stored
somewhere out of view of the actual SQL queries.  IMHO anyway.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Martijn van Oosterhout
On Tue, Aug 08, 2006 at 04:14:45PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  ISTM theat the easiest way would be to introduce a sort of predicate
  like so:
 
  SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
 
 The one saving grace of Florian's proposal was that you could go hack
 the statistics *without* changing your queries.  This throws that away
 again.

Well, that true. I was thinking of the easy way.

To run with something suggested in this thread, do you think it would
be more reasonable to be able to provide statistics information for
joins, which currently we have no grip on at all. Something like:

CREATE STATISTIC foo
ON table1 a, table2 b
WHERE a.x = b.x
AS SELECTIVITY  0.1;

The idea being that if the planner see those tables being joined on
those fields, that it will do its guess on the number of rows, but caps
the selectivity to less than 0.1.

My main problem is that selectivity is the wrong measurement. What
users really want to be able to communicate is:

1. If you join tables a and b on x, the number of resulting rows will be
the number of roows selected from b (since b.x id a foreign key
referencing a.x).

2. That on average there is a N:1 ratio of results between a.x and b.x.
So if you take a value of a.x and look it up in b, on average you'll
get N results. This can be a valid measurement for any two columns, not
just ones related by a foreign key.

For either of those, selectivity is the wrong variable, but I'll be
damned if I can think of a better way of expressing it...

The interesting case would be joins across a number of tables and be
able to tell the planner information about that, but that's an even
harder problem.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Mark Dilger

Tom Lane wrote:

The thing I object to about the I want to decorate my queries with
planner hints mindset is that it's coming at it from the wrong
direction.  You should never be thinking in terms of fix this one
query, because that just leads back into the same dead end that your
fix doesn't work tomorrow.  What you *should* be thinking about is why
did the planner get this wrong, and how do I fix the generic problem?.
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.

So some kind of override for statistical guesses doesn't seem completely
silly to me.  But it needs to be declarative information that's stored
somewhere out of view of the actual SQL queries.  IMHO anyway.

regards, tom lane


Imagine a join between two tables:

select a.x, b.y where a.x = f(b.y) from a, b;

I may know that, given the data I've put into the tables, only one value in b 
will ever match one value in a.  Or perhaps I know that no more than ten rows in 
b will match a given value in a.  But how can the statistics from ANALYZE ever 
see through arbitrary math functions to know this sort of thing?


The current analyze functionality, as I understand it, can store information 
about a given table, but not about the relationships between the data in several 
tables, which is the information the planner would need to choose the right 
plan.  Do all the requests from postgres users for giving hints to the planner 
involve this type of situation, where the hints are not about a single table, 
but rather about the relationship between two or more tables and specific joins 
between them?


Do I understand correctly?  Is this a reasonable analysis?

mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] An Idea for planner hints

2006-08-08 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 My main problem is that selectivity is the wrong measurement. What
 users really want to be able to communicate is:

 1. If you join tables a and b on x, the number of resulting rows will be
 the number of roows selected from b (since b.x id a foreign key
 referencing a.x).

FWIW, I believe the planner already gets that case right, because a.x
will be unique and it should know that.  (Maybe not if the FK is across
a multi-column key, but in principle it should get it right.)

I agree though that meta-knowledge like this is important, and that
standard SQL frequently doesn't provide any adequate way to declare it.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings