Re: [Fwd: Re: [PERFORM] Functionscan estimates]

2005-04-18 Thread elein
Hmmm. My brain is being jostled and I'm confusing illustra-postgres,
informix-postgres and postgresql.  Some things had functions and
some things had constants and I do not remember which products had
what combination. But probably how they are in postgresql, post
hellerstein, is how I am remembering.

I can find out for sure, given a little time, by querying old contacts.
It would be best if I had a clear question to ask, though.

--elein


On Thu, Apr 14, 2005 at 02:58:09PM -0400, Alvaro Herrera wrote:
> On Thu, Apr 14, 2005 at 10:39:03AM -0700, elein wrote:
> 
> > All functions could have a cost associated with them, set by the writer of
> > the function in order for the planner to reorder function calls.
> > The stonebraker airplane level example was:
> > select ... from ... where f(id) = 3 and expensive_image_function(img)
> > The idea, of course is to weight the expensive function so it was
> > pushed to the end of the execution.
> 
> So there was only a constant cost associated with the function?  No
> estimator function, for example?
> 
> -- 
> Alvaro Herrera (<[EMAIL PROTECTED]>)
> "If you have nothing to say, maybe you need just the right tool to help you
> not say it."   (New York Times, about Microsoft PowerPoint)
> 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [Fwd: Re: [PERFORM] Functionscan estimates]

2005-04-14 Thread Alvaro Herrera
On Thu, Apr 14, 2005 at 10:39:03AM -0700, elein wrote:

> All functions could have a cost associated with them, set by the writer of
> the function in order for the planner to reorder function calls.
> The stonebraker airplane level example was:
>   select ... from ... where f(id) = 3 and expensive_image_function(img)
> The idea, of course is to weight the expensive function so it was
> pushed to the end of the execution.

So there was only a constant cost associated with the function?  No
estimator function, for example?

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [Fwd: Re: [PERFORM] Functionscan estimates]

2005-04-14 Thread elein
I'm not subscribed to performance at this time. I reviewed the
thread and owe everything I know about this to Wei Hong whose
brilliance exceeds all others :)  All misinterpretations are
mine alone.

I have not reviewed hellerstein's papers posted by neil, but I
will.

My understanding of this issue is at a very high user level.
In Illustra SRF functions were not necessarily special functions.  
All functions could have a cost associated with them, set by the writer of
the function in order for the planner to reorder function calls.
The stonebraker airplane level example was:
select ... from ... where f(id) = 3 and expensive_image_function(img)
The idea, of course is to weight the expensive function so it was
pushed to the end of the execution.

The only difference I see with SRFs in Postgres is that you may want
the cost represented as one row returned and another weighting representing
the number of estimated rows.  I think this conclusion has already
been drawn.

It seems to make sense, if the optimizer can use this information, to
include wild and/or educated guesses for the costs of the SRF.

I'm sure I haven't contributed here anything new, but perhaps 
phrased it differently.

Copy me on replies and I'll participate as I can.

--elein

On Thu, Apr 14, 2005 at 08:36:38AM +0100, Simon Riggs wrote:
> Elein,
> 
> Any chance you could join this discussion on PERFORM ?
> 
> I understand you did time with Illustra. I thought they had solved the
> optimizer plug-in issue...how did they do it?
> 
> Best Regards, Simon Riggs
> 
> 
>  Forwarded Message 
> From: Tom Lane <[EMAIL PROTECTED]>
> To: Alvaro Herrera <[EMAIL PROTECTED]>
> Cc: Josh Berkus , Michael Fuhr <[EMAIL PROTECTED]>,
> 
> Subject: Re: [PERFORM] Functionscan estimates
> Date: Sat, 09 Apr 2005 00:00:56 -0400
> Not too many releases ago, there were several columns in pg_proc that
> were intended to support estimation of the runtime cost and number of
> result rows of set-returning functions.  I believe in fact that these
> were the remains of Joe Hellerstein's thesis on expensive-function
> evaluation, and are exactly what he was talking about here:
> http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php
> 
> But with all due respect to Joe, I think the reason that stuff got
> trimmed is that it didn't work very well.  In most cases it's
> *hard* to write an estimator for a SRF.  Let's see you produce
> one for dblink() for instance ...
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Functionscan estimates

2005-04-10 Thread Josh Berkus
People:

(HACKERS: Please read this entire thread at 
http://archives.postgresql.org/pgsql-performance/2005-04/msg00179.php 
Sorry for crossing this over.)

> > The larger point is that writing an estimator for an SRF is frequently a
> > task about as difficult as writing the SRF itself
>
> True, although I think this doesn't necessarily kill the idea. If
> writing an estimator for a given SRF is too difficult, the user is no
> worse off than they are today. Hopefully there would be a fairly large
> class of SRFs for which writing an estimator would be relatively simple,
> and result in improved planner behavior.

For that matter, even supplying an estimate constant would be a vast 
improvement over current functionality.  I would suggest, in fact, that we 
allow the use of either a constant number, or an estimator function, in that 
column.  Among other things, this would allow implementing the constant 
number right now and the use of an estimating function later, in case we can 
do the one but not the other for 8.1.

To be more sophisticated about the estimator function, it could take a subset 
of the main functions arguments, based on $1 numbering, for example:
CREATE FUNCTION some_func ( INT, TEXT, TEXT, INT, INT ) ...
ALTER FUNCTION some_func WITH ESTIMATOR some_func_est( $4, $5 )

This would make writing estimators which would work for several functions 
easier.   Estimators would be a special type of functions which would take 
any params and RETURN ESTIMATOR, which would be implicitly castable from some 
general numeric type (like INT or FLOAT).

> > I don't foresee a whole lot of use of an estimator hook designed as
> > proposed here.  In particular, if the API is such that we can only
> > use the estimator when all the function arguments are plan-time
> > constants, it's not going to be very helpful.

Actually, 95% of the time I use SRFs they are accepting constants and not row 
references.  And I use a lot of SRFs.

>
> Yes :( One approach might be to break the function's domain into pieces
> and have the estimator function calculate the estimated result set size
> for each piece. So, given a trivial function like:
>
> foo(int):
> if $1 < 10 then produce 100 rows
> else produce 1 rows
>
> If the planner has encoded the distribution of input tuples to the
> function as a histogram, it could invoke the SRF's estimator function
> for the boundary values of each histogram bucket, and use that to get an
> idea of the function's likely result set size at runtime.
>
> And yes, the idea as sketched is totally unworkable :) For one thing,
> the difficulty of doing this grows rapidly as the number of arguments to
> the function increases. But perhaps there is some variant of this idea
> that might work...
>
> Another thought is that the estimator could provide information on the
> cost of evaluating the function, the number of tuples produced by the
> function, and even the distribution of those tuples.

Another possibility would be to support default values for all estimator 
functions and have functions called in row context passed DEFAULT, thus 
leaving it up to the estimator writer to supply median values for context 
cases.  Or to simply take the "first" values and use those. 

While any of these possibilites aren't ideal, they are an improvement over the 
current "flat 1000" estimate.   As I said, even the ability to set a 
per-function flat constant estimate would be an improvement.

> BTW, why is this on -performance? It should be on -hackers.

'cause I spend more time reading -performance, and I started the thread.  
Crossed over now.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Neil Conway
Tom Lane wrote:
The larger point is that writing an estimator for an SRF is frequently a
task about as difficult as writing the SRF itself
True, although I think this doesn't necessarily kill the idea. If 
writing an estimator for a given SRF is too difficult, the user is no 
worse off than they are today. Hopefully there would be a fairly large 
class of SRFs for which writing an estimator would be relatively simple, 
and result in improved planner behavior.

I don't foresee a whole lot of use of an estimator hook designed as
proposed here.  In particular, if the API is such that we can only
use the estimator when all the function arguments are plan-time
constants, it's not going to be very helpful.
Yes :( One approach might be to break the function's domain into pieces 
and have the estimator function calculate the estimated result set size 
for each piece. So, given a trivial function like:

foo(int):
   if $1 < 10 then produce 100 rows
   else produce 1 rows
If the planner has encoded the distribution of input tuples to the 
function as a histogram, it could invoke the SRF's estimator function 
for the boundary values of each histogram bucket, and use that to get an 
idea of the function's likely result set size at runtime.

And yes, the idea as sketched is totally unworkable :) For one thing, 
the difficulty of doing this grows rapidly as the number of arguments to 
the function increases. But perhaps there is some variant of this idea 
that might work...

Another thought is that the estimator could provide information on the 
cost of evaluating the function, the number of tuples produced by the 
function, and even the distribution of those tuples.

BTW, why is this on -performance? It should be on -hackers.
-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Neil Conway
Tom Lane wrote:
Not too many releases ago, there were several columns in pg_proc that
were intended to support estimation of the runtime cost and number of
result rows of set-returning functions.  I believe in fact that these
were the remains of Joe Hellerstein's thesis on expensive-function
evaluation
FYI, Hellerstein's thesis on xfunc optimization is available here:
ftp://ftp.cs.wisc.edu/pub/tech-reports/reports/1996/tr1304.ps.Z
There's also a paper on this subject by Hellerstein that was published 
in Transactions on Database Systems:

http://www.cs.berkeley.edu/~jmh/miscpapers/todsxfunc.pdf
I haven't had a chance to digest either one yet, but it might be worth a 
look.

-Neil
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote:
>> But with all due respect to Joe, I think the reason that stuff got
>> trimmed is that it didn't work very well.  In most cases it's
>> *hard* to write an estimator for a SRF.  Let's see you produce
>> one for dblink() for instance ...

> Actually, if the remote database supported a way to get a rows estimate
> from the query passed to db_link, it would be trivial, since you'd just
> pass that back.

This assumes that (1) you have the complete query argument at the time
of estimation, and (2) it's OK to contact the remote database and do an
EXPLAIN at that time.  Both of these seem pretty shaky assumptions.

The larger point is that writing an estimator for an SRF is frequently a
task about as difficult as writing the SRF itself, and sometimes much
*more* difficult due to lack of information.  I don't foresee a whole
lot of use of an estimator hook designed as proposed here.  In
particular, if the API is such that we can only use the estimator when
all the function arguments are plan-time constants, it's not going to be
very helpful.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Jim C. Nasby
On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote:
> Not too many releases ago, there were several columns in pg_proc that
> were intended to support estimation of the runtime cost and number of
> result rows of set-returning functions.  I believe in fact that these
> were the remains of Joe Hellerstein's thesis on expensive-function
> evaluation, and are exactly what he was talking about here:
> http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php
> 
> But with all due respect to Joe, I think the reason that stuff got
> trimmed is that it didn't work very well.  In most cases it's
> *hard* to write an estimator for a SRF.  Let's see you produce
> one for dblink() for instance ...

Actually, if the remote database supported a way to get a rows estimate
from the query passed to db_link, it would be trivial, since you'd just
pass that back.

In fact, having such a function (estimate_rows_for_sql(text)) would
probably be very useful to functions that wanted to support returning a
rows estimate.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread PFC

My solution would be a lot simpler, since we could simply populate
pg_proc.proestrows with "1000" by default if not changed by the DBA.  In  
an
even better world, we could tie it to a table, saying that, for example,
proestrows = my_table*0.02.
	What if the estimated row is a function of a parameter ?
	Say a function takes as a parameter :
	- a number to use in a LIMIT
	- it's a function to generate a certain number of values from a  
predetermined set (like, array -> set returning function)

In all those cases it's no use to have just a fixed number.
Id suggest two solutions :
- The ideal solution which is impossible to do :
The function tells the planner about its stats, looking at its 
parameters
	- A solution that would be possible to do
	 pg_proc.proestrows is... the name of another function, defined by the  
user, which takes the exact same parameters as the set returning function  
we're talking about, and which returns estimates.

For instance, in pseudo-sql :
CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE  
plpgsql AS $$
BEGIN
	FOR _i IN 1..icount($1)
		RETURN NEXT $1[_i];
	END
END	In the two cases above, this would give :

CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE  
plpgsql AS $$
BEGIN
	RETURN icount( $1 );
END;

ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator;
	Another interesting case would be the famous "Top 5 by category" case  
where we use a SRF to emulate an index skip scan. Say we have a table  
Categories and a table Users, each User having columns "categories" and  
"score" and we want the N users with best score in each category :

CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE  
LANGUAGE plpgsql AS $$
DECLARE
	_cat_id	INTEGER;
	_n ALIAS FOR $1;
	_user	users%ROWTYPE;
BEGIN
	FOR _cat_id IN SELECT category_id FROM categories DO
		FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY  
score DESC LIMIT _n DO
			RETURN NEXT _user;
		END
	END
END
	
CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER  
LANGUAGE plpgsql AS $$
BEGIN
	RETURN $1 * (the estimated number of rows for the categories table taken  
from the table statistics);
END;

ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator;
Got it ?
	The array_srf case would be extremely useful as this type of function is  
generally used to join against other tables, and having estimates is  
useful for that.
	The top_n case would be useless if we're just returning the rows from the  
function directly, but would be useful if we'll join them to other tables.

This sounds pretty simple, powerful, and versatile.
	Additionally, in some cases (most notably the array case) it's easy to  
estimate the statistics on the returned values because they're all in the  
array already, so the mechanism could be extended to have a way of  
returning a pseudo pg_stats for a Set Returning function.

	For instance, say you have a SRF which returns N random rows from a  
table. It could have an estimator which would return a rowcount of N, and  
a statistics estimator which would return the sats rows for the source  
table, appropriately modified.

This sounds harder to do.
WHat do you think ?








---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread PFC

But with all due respect to Joe, I think the reason that stuff got
trimmed is that it didn't work very well.  In most cases it's
*hard* to write an estimator for a SRF.  Let's see you produce
one for dblink() for instance ...
	Good one...
	Well in some cases it'll be impossible, but suppose I have a function  
get_id_for_something() which just grabs an ID using dblink, then I know it  
returns one row, and pg would be interested in that information too !

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Functionscan estimates

2005-04-08 Thread Tom Lane
Not too many releases ago, there were several columns in pg_proc that
were intended to support estimation of the runtime cost and number of
result rows of set-returning functions.  I believe in fact that these
were the remains of Joe Hellerstein's thesis on expensive-function
evaluation, and are exactly what he was talking about here:
http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php

But with all due respect to Joe, I think the reason that stuff got
trimmed is that it didn't work very well.  In most cases it's
*hard* to write an estimator for a SRF.  Let's see you produce
one for dblink() for instance ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Functionscan estimates

2005-04-08 Thread Alvaro Herrera
On Fri, Apr 08, 2005 at 04:04:27PM -0700, Josh Berkus wrote:

> My solution would be a lot simpler, since we could simply populate 
> pg_proc.proestrows with "1000" by default if not changed by the DBA.  In an 
> even better world, we could tie it to a table, saying that, for example, 
> proestrows = my_table*0.02.

The problem with that approach is that it can't differ depending on the
arguments to the function, so it too seems limited to me.

Ideally an estimator would be able to peek at other table statistics and
do some computation with them, just like other nodes are able to.

Another idea would be have an estimator function (pg_proc.proestimator)
for each regular function.  The estimator would be a very cheap function
to be called with the same arguments, and it would return the estimated
number of tuples the other function would return.  The default estimator
could be "return 1000".

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
"A wizard is never late, Frodo Baggins, nor is he early.
 He arrives precisely when he means to."  (Gandalf, en LoTR FoTR)

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Functionscan estimates

2005-04-08 Thread Josh Berkus
Alvaro, Michael,

> > About a month ago I mentioned that I'd find that useful.  In a
> > followup, Christopher Kings-Lynne brought up the idea of a GUC
> > variable that could give hints about the expected row count.
>
> That seems pretty limited ... what happens if the query contains more
> that one SRF?

Yeah, I'd see that as a pretty bad idea too.  I don't want to tell the planner 
how many rows I expect "all functions" to return, I want to tell it how many 
*one particular* function will return.

> Maybe issuing some sort of special call to the function (say, with
> some boolean in the call info struct) on which it returns planning data;
> thus the planner can call the function itself.  The hard part would be
> figuring out how to do it without breaking backwards compatibility with
> functions that don't know how to handle that.  (And how to do it in
> plpgsql).

Or in pl/perl, or pl/python, or plsh  doesn't sound feasable.   

My solution would be a lot simpler, since we could simply populate 
pg_proc.proestrows with "1000" by default if not changed by the DBA.  In an 
even better world, we could tie it to a table, saying that, for example, 
proestrows = my_table*0.02.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Functionscan estimates

2005-04-08 Thread Alvaro Herrera
On Fri, Apr 08, 2005 at 04:38:20PM -0600, Michael Fuhr wrote:
> On Fri, Apr 08, 2005 at 03:15:50PM -0700, Josh Berkus wrote:
> > 
> > I'm wondering if it might be useful to be able to add estimated selectivity 
> > to 
> > a function definition for purposes of query estimation.  Currently function 
> > scans automatically return a flat default 1000 estimated rows.   It seems 
> > like the DBA ought to be able to ALTER FUNCTION and give it a row estimate 
> > for planning purposes.   
> 
> About a month ago I mentioned that I'd find that useful.  In a
> followup, Christopher Kings-Lynne brought up the idea of a GUC
> variable that could give hints about the expected row count.

That seems pretty limited ... what happens if the query contains more
that one SRF?

Maybe issuing some sort of special call to the function (say, with
some boolean in the call info struct) on which it returns planning data;
thus the planner can call the function itself.  The hard part would be
figuring out how to do it without breaking backwards compatibility with
functions that don't know how to handle that.  (And how to do it in
plpgsql).

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
"La principal característica humana es la tontería"
(Augusto Monterroso)

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Functionscan estimates

2005-04-08 Thread Michael Fuhr
On Fri, Apr 08, 2005 at 03:15:50PM -0700, Josh Berkus wrote:
> 
> I'm wondering if it might be useful to be able to add estimated selectivity 
> to 
> a function definition for purposes of query estimation.  Currently function 
> scans automatically return a flat default 1000 estimated rows.   It seems 
> like the DBA ought to be able to ALTER FUNCTION and give it a row estimate 
> for planning purposes.   

About a month ago I mentioned that I'd find that useful.  In a
followup, Christopher Kings-Lynne brought up the idea of a GUC
variable that could give hints about the expected row count.

http://archives.postgresql.org/pgsql-hackers/2005-03/msg00146.php
http://archives.postgresql.org/pgsql-hackers/2005-03/msg00153.php

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[PERFORM] Functionscan estimates

2005-04-08 Thread Josh Berkus
Folks,

I'm wondering if it might be useful to be able to add estimated selectivity to 
a function definition for purposes of query estimation.  Currently function 
scans automatically return a flat default 1000 estimated rows.   It seems 
like the DBA ought to be able to ALTER FUNCTION and give it a row estimate 
for planning purposes.   

Thoughts?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster