Re: [PERFORM] query slows down with more accurate stats

2004-04-19 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes:
> Random sampling is more like "every possible sample is equally likely to
> be collected", and two-stage sampling doesn't satisfy this condition.

Okay, I finally see the point here: in the limit as the number of pages
B goes to infinity, you'd expect the probability that each tuple in your
sample came from a different page to go to 1.  But this doesn't happen
in the two-stage sampling method: the probability doesn't increase
beyond the value it would have for B=n.  On the average each sample page
would supply one tuple, but the odds that this holds *exactly* would be
pretty low.

However the existing sampling method has glaring flaws of its own,
in particular having to do with the fact that a tuple whose slot is
preceded by N empty slots is N times more likely to be picked than one
that has no empty-slot predecessors.  The fact that the two-stage
method artificially constrains the sample to come from only n pages
seems like a minor problem by comparison; I'd happily accept it to get
rid of the empty-slot bias.

A possible compromise is to limit the number of pages sampled to
something a bit larger than n, perhaps 2n or 3n.  I don't have a feeling
for the shape of the different-pages probability function; would this
make a significant difference, or would it just waste cycles?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] query slows down with more accurate stats

2004-04-16 Thread Manfred Koizar
On Fri, 16 Apr 2004 10:34:49 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
>>  p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}
>
>So?  You haven't proven that either sampling method fails to do the
>same.

On the contrary, I believe that above formula is more or less valid for
both methods.  The point is in what I said next:
| This probability grows with increasing B.

For the one-stage sampling method B is the number of pages of the whole
table.  With two-stage sampling we have to use n instead of B and get a
smaller probability (for n < B, of course).  So this merely shows that
the two sampling methods are not equivalent.

>The desired property can also be phrased as "every tuple should be
>equally likely to be included in the final sample".

Only at first sight.  You really expect more from random sampling.
Otherwise I'd just put one random tuple and its n - 1 successors (modulo
N) into the sample.  This satisfies your condition but you wouldn't call
it a random sample.

Random sampling is more like "every possible sample is equally likely to
be collected", and two-stage sampling doesn't satisfy this condition.

But if in your opinion the difference is not significant, I'll stop
complaining against my own idea.  Is there anybody else who cares?

>You could argue that a tuple on a heavily populated page is
>statistically likely to see a higher T when it's part of the page sample
>pool than a tuple on a near-empty page is likely to see, and therefore
>there is some bias against selection of the former tuple.  But given a
>sample over a reasonably large number of pages, the contribution of any
>one page to T should be fairly small and so this effect ought to be
>small.

It is even better:  Storing a certain number of tuples on heavily
populated pages takes less pages than to store them on sparsely
populated pages (due to tuple size or to dead tuples).  So heavily
populated pages are less likely to be selected in stage one, and this
exactly offsets the effect of increasing T.

>So I think this method is effectively unbiased at the tuple level.

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] query slows down with more accurate stats

2004-04-16 Thread Robert Treat
On Tue, 2004-04-13 at 15:18, Tom Lane wrote:
> Robert Treat <[EMAIL PROTECTED]> writes:
> Well, the first problem is why is ANALYZE's estimate of the total row
> count so bad :-( ?  I suspect you are running into the situation where
> the initial pages of the table are thinly populated and ANALYZE
> mistakenly assumes the rest are too. 

That was my thinking, which is somewhat confirmed after a vacuum full on
the table; now analyze gives pretty accurate states.  Of course the
downside is that now the query is consistently slower. 

> > so i guess i am wondering if there is something I should be doing to
> > help get the better plan at the more accurate stats levels and/or why it
> > doesn't stick with the original plan (I noticed disabling merge joins
> > does seem to push it back to the original plan). 
> 
> With the larger number of estimated rows it's figuring the nestloop will
> be too expensive.  The row estimate for the cl scan went up from 1248
> to 10546, so the estimated cost for the nestloop plan would go to about
> 24 units vs 8 for the mergejoin plan.  This is obviously off
> rather badly when the true runtimes are 1.7 vs 8.1 seconds :-(.
> 
> I think this is an example of a case where we really need better
> estimation of nestloop costs --- it's drastically overestimating the
> relative cost of the nestloop because it's not accounting for the cache
> benefits of the repeated index searches.  You could probably force the
> nestloop to be chosen by lowering random_page_cost, but that's just a
> kluge solution ... the real problem is the model is wrong.
> 

Unfortunately playing with random_page_cost doesn't seem to be enough to
get it to favor the nested loop... though setting it down to 2 does help
overall.  played with index_cpu_tuple_cost a bit but that seemed even
less useful. aggravating when you know there is a better plan it could
pick but no (clean) way to get it to do so...  

> I have a to-do item to work on this, and will try to bump up its
> priority a bit.
> 

I'll keep an eye out, thanks Tom.


Robert Treat
-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] query slows down with more accurate stats

2004-04-16 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes:
> If the number of pages is B and the sample size is n, a perfect sampling
> method collects a sample where all tuples come from different pages with
> probability (in OpenOffice.org syntax):
>   p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}

So?  You haven't proven that either sampling method fails to do the
same.

The desired property can also be phrased as "every tuple should be
equally likely to be included in the final sample".  What we actually
have in the case of your revised algorithm is "every page is equally
likely to be sampled, and of the pages included in the sample, every
tuple is equally likely to be chosen".  Given that there are B total
pages of which we sample b pages that happen to contain T tuples (in any
distribution), the probability that a particular tuple gets chosen is
(b/B) * (n/T)
assuming that the two selection steps are independent and unbiased.

Now b, B, and n are not dependent on which tuple we are talking about.
You could argue that a tuple on a heavily populated page is
statistically likely to see a higher T when it's part of the page sample
pool than a tuple on a near-empty page is likely to see, and therefore
there is some bias against selection of the former tuple.  But given a
sample over a reasonably large number of pages, the contribution of any
one page to T should be fairly small and so this effect ought to be
small.  In fact, because T directly determines our estimate of the total
number of tuples in the relation, your experiments showing that the new
method gives a reliable tuple count estimate directly prove that T is
pretty stable regardless of exactly which pages get included in the
sample.  So I think this method is effectively unbiased at the tuple
level.  The variation in probability of selection of individual tuples
can be no worse than the variation in the overall tuple count estimate.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] query slows down with more accurate stats

2004-04-16 Thread Manfred Koizar
On Thu, 15 Apr 2004 20:18:49 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
>> getting several tuples from the same page is more likely
>> than with the old method.
>
>Hm, are you sure?

Almost sure.  Let's look at a corner case:  What is the probability of
getting a sample with no two tuples from the same page?  To simplify the
problem assume that each page contains the same number of tuples c.

If the number of pages is B and the sample size is n, a perfect sampling
method collects a sample where all tuples come from different pages with
probability (in OpenOffice.org syntax):

p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}

or in C:

p = 1.0;
for (i = 0; i < n; ++i)
p *= c*(B - i) / (c*B - i)

This probability grows with increasing B.

>Also, I'm not at all sure that the old method satisfies that constraint
>completely in the presence of nonuniform numbers of tuples per page,
>so we'd not necessarily be going backwards anyhow ...

Yes, it boils down to a decision whether we want to replace one not
quite perfect sampling method with another not quite perfect method.
I'm still working on putting together the pros and cons ...

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] query slows down with more accurate stats

2004-04-15 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes:
> My biggest concern at the moment is that the new sampling method
> violates the contract of returning each possible sample with he same
> probability:  getting several tuples from the same page is more likely
> than with the old method.

Hm, are you sure?  I recall objecting to your original proposal because
I thought that would happen, but after further thought it seemed not.

Also, I'm not at all sure that the old method satisfies that constraint
completely in the presence of nonuniform numbers of tuples per page,
so we'd not necessarily be going backwards anyhow ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] query slows down with more accurate stats

2004-04-15 Thread Manfred Koizar
[Just a quick note here;  a more thorough discussion of my test results
will be posted to -hackers]

On Tue, 13 Apr 2004 15:18:42 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
>Well, the first problem is why is ANALYZE's estimate of the total row
>count so bad :-( ?  I suspect you are running into the situation where
>the initial pages of the table are thinly populated and ANALYZE
>mistakenly assumes the rest are too.  Manfred is working on a revised
>sampling method for ANALYZE that should fix this problem

The new method looks very promising with respect to row count
estimation:  I got estimation errors of +/- 1% where the old method was
off by up to 60%.  (My test methods might be a bit biased though :-))

My biggest concern at the moment is that the new sampling method
violates the contract of returning each possible sample with he same
probability:  getting several tuples from the same page is more likely
than with the old method.

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] query slows down with more accurate stats

2004-04-13 Thread Tom Lane
Robert Treat <[EMAIL PROTECTED]> writes:
> live=# analyze cl;
> ANALYZE
> live=# select reltuples from pg_class where relname = 'cl';
>  reltuples 
> ---
>  53580
> (1 row)
> live=# vacuum cl;
> VACUUM
> live=# select reltuples from pg_class where relname = 'cl';
>   reltuples  
> -
>  1.14017e+06
> (1 row)

Well, the first problem is why is ANALYZE's estimate of the total row
count so bad :-( ?  I suspect you are running into the situation where
the initial pages of the table are thinly populated and ANALYZE
mistakenly assumes the rest are too.  Manfred is working on a revised
sampling method for ANALYZE that should fix this problem in 7.5 and
beyond, but for now it seems like a VACUUM FULL might be in order.

> so i guess i am wondering if there is something I should be doing to
> help get the better plan at the more accurate stats levels and/or why it
> doesn't stick with the original plan (I noticed disabling merge joins
> does seem to push it back to the original plan). 

With the larger number of estimated rows it's figuring the nestloop will
be too expensive.  The row estimate for the cl scan went up from 1248
to 10546, so the estimated cost for the nestloop plan would go to about
24 units vs 8 for the mergejoin plan.  This is obviously off
rather badly when the true runtimes are 1.7 vs 8.1 seconds :-(.

I think this is an example of a case where we really need better
estimation of nestloop costs --- it's drastically overestimating the
relative cost of the nestloop because it's not accounting for the cache
benefits of the repeated index searches.  You could probably force the
nestloop to be chosen by lowering random_page_cost, but that's just a
kluge solution ... the real problem is the model is wrong.

I have a to-do item to work on this, and will try to bump up its
priority a bit.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] query slows down with more accurate stats

2004-04-13 Thread Robert Treat
In the process of optimizing some queries, I have found the following
query seems to degrade in performance the more accurate I make the
statistics on the table... whether by using increased alter table ...
set statistics or by using vacuum..

SELECT 
count( cl.caller_id ), 
npanxx.city, 
npanxx.state 
FROM 
cl 
LEFT OUTER JOIN npanxx 
  on substr( cl.caller_id, 1, 3 ) = npanxx.npa 
  and substr( cl.caller_id, 4, 3 ) = npanxx.nxx 
LEFT OUTER JOIN cp 
  ON cl.caller_id = cp.caller_id 
WHERE 
cl.ivr_system_id = 130 
AND 
cl.call_time > '2004-03-01 06:00:00.0 CST' 
AND 
cl.call_time < '2004-04-01 06:00:00.0 CST' 
AND 
cp.age >= 18 
AND 
cp.age <= 24 
AND 
cp.gender = 'm' 
GROUP BY 
npanxx.city, 
npanxx.state


live=# analyze cl;
ANALYZE
live=# select reltuples from pg_class where relname = 'cl';
 reltuples 
---
 53580
(1 row)

live=# select count(*) from cl;
  count  
-
 1140166
(1 row)

The plan i get under these conditions is actually pretty good...

 HashAggregate  (cost=28367.22..28367.66 rows=174 width=32) (actual 
time=1722.060..1722.176 rows=29 loops=1)
   ->  Nested Loop  (cost=0.00..28365.92 rows=174 width=32) (actual 
time=518.592..1716.254 rows=558 loops=1)
 ->  Nested Loop Left Join  (cost=0.00..20837.33 rows=1248 width=32) (actual 
time=509.991..1286.755 rows=13739 loops=1)
   ->  Index Scan using cl_ivr_system_id on cl  (cost=0.00..13301.15 
rows=1248 width=14) (actual time=509.644..767.421 rows=13739 loops=1)
 Index Cond: (ivr_system_id = 130)
 Filter: ((call_time > '2004-03-01 07:00:00-05'::timestamp with 
time zone) AND (call_time < '2004-04-01 07:00:00-05'::timestamp with time zone))
   ->  Index Scan using npanxx_pkey on npanxx  (cost=0.00..6.02 rows=1 
width=32) (actual time=0.025..0.027 rows=1 loops=13739)
 Index Cond: ((substr(("outer".caller_id)::text, 1, 3) = 
(npanxx.npa)::text) AND (substr(("outer".caller_id)::text, 4, 3) = (npanxx.nxx)::text))
 ->  Index Scan using cp_pkey on cp  (cost=0.00..6.02 rows=1 width=14) (actual 
time=0.027..0.027 rows=0 loops=13739)
   Index Cond: (("outer".caller_id)::text = (cp.caller_id)::text)
   Filter: ((age >= 18) AND (age <= 24) AND (gender = 'm'::bpchar))
 Total runtime: 1722.489 ms
(12 rows)


but when i do 

live=# vacuum cl;
VACUUM
live=# select reltuples from pg_class where relname = 'cl';
  reltuples  
-
 1.14017e+06
(1 row)

(or alternatively increase the stats target on the table)

I now get the following plan:

 HashAggregate  (cost=80478.74..80482.41 rows=1471 width=32) (actual 
time=8132.261..8132.422 rows=29 loops=1)
   ->  Merge Join  (cost=79951.95..80467.70 rows=1471 width=32) (actual 
time=7794.338..8130.041 rows=558 loops=1)
 Merge Cond: ("outer"."?column4?" = "inner"."?column2?")
 ->  Sort  (cost=55719.06..55745.42 rows=10546 width=32) (actual 
time=4031.827..4052.526 rows=13739 loops=1)
   Sort Key: (cl.caller_id)::text
   ->  Merge Right Join  (cost=45458.30..55014.35 rows=10546 width=32) 
(actual time=2944.441..3796.787 rows=13739 loops=1)
 Merge Cond: ((("outer".npa)::text = "inner"."?column2?") AND 
(("outer".nxx)::text = "inner"."?column3?"))
 ->  Index Scan using npanxx_pkey on npanxx  (cost=0.00..8032.99 
rows=132866 width=32) (actual time=0.200..461.767 rows=130262 loops=1)
 ->  Sort  (cost=45458.30..45484.67 rows=10546 width=14) (actual 
time=2942.994..2967.935 rows=13739 loops=1)
   Sort Key: substr((cl.caller_id)::text, 1, 3), 
substr((cl.caller_id)::text, 4, 3)
   ->  Seq Scan on cl  (cost=0.00..44753.60 rows=10546 
width=14) (actual time=1162.423..2619.662 rows=13739 loops=1)
 Filter: ((ivr_system_id = 130) AND (call_time > 
'2004-03-01 07:00:00-05'::timestamp with time zone) AND (call_time < '2004-04-01 
07:00:00-05'::timestamp with time zone))
 ->  Sort  (cost=24232.89..24457.06 rows=89667 width=14) (actual 
time=3761.703..3900.340 rows=98010 loops=1)
   Sort Key: (cp.caller_id)::text
   ->  Seq Scan on cp  (cost=0.00..15979.91 rows=89667 width=14) (actual 
time=0.128..1772.215 rows=100302 loops=1)
 Filter: ((age >= 18) AND (age <= 24) AND (gender = 'm'::bpchar))
 Total runtime: 8138.607 ms
(17 rows)


so i guess i am wondering if there is something I should be doing to
help get the better plan at the more accurate stats levels and/or why it
doesn't stick with the original plan (I noticed disabling merge joins
does seem to push it back to the original plan). 

alternatively if anyone has any general suggestions on speeding up the
query I'd be open to that too :-