Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-06 Thread Ron Mayer

Tom Lane wrote:


One objection to this is that after moving off the gold standard of
1.0 = one page fetch, there is no longer any clear meaning to the
cost estimate units; you're faced with the fact that they're just an
arbitrary scale.  I'm not sure that's such a bad thing, though.


It seems to me the appropriate gold standard is Time, in microseconds
or milliseconds.

The default postgresql.conf can come with a set of hardcoded
values that reasonably approximate some real-world system; and
if that's documented in the file someone reading it can say
 hey, my CPU's about the same but my disk subsystem is much
 faster, so I know in which direction to change things.
And another person may say ooh, now I know that my 4GHz
machines should have about twice the number here as my 2GHz
box.

For people who *really* care a lot (HW vendors?), they could
eventually make measurements on their systems.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-05 Thread Hannu Krosing
Ühel kenal päeval, P, 2006-06-04 kell 18:09, kirjutas Tom Lane:
 Greg Stark [EMAIL PROTECTED] writes:
  Hannu Krosing [EMAIL PROTECTED] writes:
  Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
  Might also be worth adding analyze delay settings, ala  
  vacuum_cost_delay.
 
 ANALYZE already respects the vacuum delay settings.
 
  Actually we should have delay settings for all potential
  (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
  CONSTRAINT, maybe more - so that there would be better chances of
  running those on busy databases without disastrous effects.
 
  What about UPDATE and DELETE and for that matter SELECT?
 
 This seems pretty silly.  The point of the delay stuff is to prevent
 background maintenance operations from eating an unreasonable share
 of resources compared to foreground queries.  I don't see why you'd
 put delays into queries --- if your machine is loaded, it's loaded.
 
 I think the existing features are sufficient in this line and that
 doing more is just adding complexity for complexity's sake.

Making CREATE INDEX respect delay settings will be reasonable once we
get it to run without locking the table. 

And if non-locking is doable for ADD/ALTER CONSTRAINT, then it makes
sense there too.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-05 Thread Jim Nasby

On Jun 4, 2006, at 5:09 PM, Tom Lane wrote:


Greg Stark [EMAIL PROTECTED] writes:

Hannu Krosing [EMAIL PROTECTED] writes:

Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:

Might also be worth adding analyze delay settings, ala
vacuum_cost_delay.


ANALYZE already respects the vacuum delay settings.


Actually we should have delay settings for all potential
(almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
CONSTRAINT, maybe more - so that there would be better chances of
running those on busy databases without disastrous effects.



What about UPDATE and DELETE and for that matter SELECT?


This seems pretty silly.  The point of the delay stuff is to prevent
background maintenance operations from eating an unreasonable share
of resources compared to foreground queries.  I don't see why you'd
put delays into queries --- if your machine is loaded, it's loaded.


'maintenance operations' often also mean running large updates. Being  
able to run those at a reduced priority would certainly be helpful in  
many cases. Though, a better way to accomplish this would be to have  
the OS handle prioritized IO scheduling, but since pretty much none  
of them seem to do that...

--
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Greg Stark

Greg Stark [EMAIL PROTECTED] writes:

 Perhaps what this indicates is that the real meat is in track sampling, not
 block sampling.

Fwiw, I've done a little benchmarking and I'm starting to think this isn't a
bad idea. I see a dramatic speed improvement for samples of 1-10% as the block
size increases. Presumably this is as Hannu said, reducing the number of
tracks necessary to cover the sample.

I see improvements up to around 256M blocks or so, but my data is pretty
questionable since I'm busy watching tv in Mythtv in another window. It's on
another drive but it still seems to be making the numbers jump around a bit.

I expect there's a trade-off between keeping enough blocks for the sample of
blocks to be representative on the one hand and large blocks being much faster
to read in on the other.

I would suggest something like setting the block size in the block sampling
algorithm to something like max(8k,sqrt(table size)). That gives 8k blocks for
anything up to 255M but takes better advantage of the speed increase available
from sequential i/o for larger tables, from my experiments about a 50%
increase in speed. 

Actually maybe even something even more aggressive would be better, maybe
(table size)^.75 So it kicks in sooner than on 256M tables and gets to larger
block sizes on reasonable sized tables.

Note, this doesn't mean anything like changing page sizes, just selecting more
blocks that hopefully lie on the same track when possible.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Greg Stark
Greg Stark [EMAIL PROTECTED] writes:

 I see improvements up to around 256M blocks or so, but my data is pretty
 questionable since I'm busy watching tv in Mythtv in another window. It's on
 another drive but it still seems to be making the numbers jump around a bit.

Oops, I meant blocks of 256k there. Sorry.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Greg Stark

Hannu Krosing [EMAIL PROTECTED] writes:

 Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
 
  Might also be worth adding analyze delay settings, ala  
  vacuum_cost_delay.
 
 Actually we should have delay settings for all potential
 (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
 CONSTRAINT, maybe more - so that there would be better chances of
 running those on busy databases without disastrous effects.

What about UPDATE and DELETE and for that matter SELECT?


-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Hannu Krosing [EMAIL PROTECTED] writes:
 Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
 Might also be worth adding analyze delay settings, ala  
 vacuum_cost_delay.

ANALYZE already respects the vacuum delay settings.

 Actually we should have delay settings for all potential
 (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
 CONSTRAINT, maybe more - so that there would be better chances of
 running those on busy databases without disastrous effects.

 What about UPDATE and DELETE and for that matter SELECT?

This seems pretty silly.  The point of the delay stuff is to prevent
background maintenance operations from eating an unreasonable share
of resources compared to foreground queries.  I don't see why you'd
put delays into queries --- if your machine is loaded, it's loaded.

I think the existing features are sufficient in this line and that
doing more is just adding complexity for complexity's sake.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Jim Nasby

On Jun 2, 2006, at 5:24 PM, Todd A. Cook wrote:

Josh Berkus wrote:

Greg, Tom,

But for most users analyze doesn't really have to run as often as
vacuum. One sequential scan per night doesn't seem like that big  
a deal

to me.

Clearly you don't have any 0.5 TB databases.


Perhaps something like ANALYZE FULL?  Then only those who need the
more precise statistics would pay the cost for a full scan.


Might also be worth adding analyze delay settings, ala  
vacuum_cost_delay.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Hannu Krosing
Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:

 Might also be worth adding analyze delay settings, ala  
 vacuum_cost_delay.

Actually we should have delay settings for all potential
(almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
CONSTRAINT, maybe more - so that there would be better chances of
running those on busy databases without disastrous effects.

Probably we should use the same settings for all these, not invent a new
set for each.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-06-02 kell 16:23, kirjutas Greg Stark:

 And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
 5% block sampling took just as long as reading all the blocks. Even if we
 figure out what's causing that (IMHO surprising) result and improve matters I
 would only expect it to be 3-4x faster than a full scan.

You should not be surprised by this once you visualise what happens at
the disk level with all those platters spinning and heads moving :) 

Disks can read at full rotation speed, so skipping (not reading) some
blocks will not make reading the remaining blocks from the same track
faster. And if there are more than 20 8k pages per track, you still have
a very high probablility you need to read all tracks..

You may be able to move to the next track a little earlier compared to
reading all blocks, but then you are likely to miss the block from next
track and have to wait a full rotation.

You will get some win from skipping pages only once your % falls so low
that you can also skip a significant number of tracks.

 http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php

Your test program could have got a little better results, if you had
somehow managed to tell the system all the block numbers to read in one
go, not each time the next one after hetting the previous one. In
current version it is quite likely that it had to wait several disk
rotations for even the sectors from the same track, as for small steps
it may have missed the next sector. It does not apply for disks which
always read a full track in RAM cache, but even there all tracks are
actually read.

The fact that 5% was not slower than seqscan seems to indicate that
actually all track reads were cached inside the disk or controller.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Mike Benoit
pgbench appears to already support arbitrary SQL queries with the -f
switch, so why couldn't we just make it a little smarter and have people
enable SQL query logging for a day or two, then pass the log off to
pgbench:

pgbench -f log file

Seems to me like that wouldn't be too difficult to do, and would give
much closer real-world results than pgbench's built-in benchmark. 

On top of that the community could start offering up template
benchmarks like: busy website, data warehouse, forums, financial
and distribute them with pgbench:

pgbench -f templates/data_warehouse.pgbench
pgbench -f templates/forums.pgbench
...

From that point a brute force auto-tune utility would be pretty straight
forward to write. 

pgautotune -f templates/data_warehouse.bench,myapp.sqllog

Or if one server runs multiple custom apps that you want to tune for:

pgautotune -f myapp1.sqllog,myapp2.sqllog,myapp3.sqllog

Even if it took 48hrs to run, it would be a good burn-in test for a
brand new server. ;)


On Fri, 2006-06-02 at 19:38 -0400, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  One objection to this is that after moving off the gold standard of
  1.0 = one page fetch, there is no longer any clear meaning to the
  cost estimate units; you're faced with the fact that they're just an
  arbitrary scale.  I'm not sure that's such a bad thing, though.  For
  instance, some people might want to try to tune their settings so that
  the estimates are actually comparable to milliseconds of real time.
 
  Any chance that the correspondence to time could be made a part of the
  design on purpose and generally advise people to follow that rule?
 
 We might eventually get to that point, but I'm hesitant to try to do it
 immediately.  For one thing, I really *don't* want to get bug reports
 from newbies complaining that the cost estimates are always off by a
 factor of X.  (Not but what we haven't gotten some of those anyway :-()
 In the short term I see us sticking to the convention that seq_page_cost
 is 1.0 in a typical database, while anyone who's really hot to try to
 make the other happen is free to experiment.
 
  If we could tell people to run *benchmark* and use those numbers
  directly as a first approximation tuning, it could help quite a bit
  for people new to PostgreSQL experiencing poor performance.
 
 We don't have such a benchmark ... if we did, we could have told
 people how to use it to set the variables already.  I'm very very
 suspicious of any suggestion that it's easy to derive appropriate
 numbers for these settings from one magic benchmark.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
-- 
Mike Benoit [EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Greg Stark

Hannu Krosing [EMAIL PROTECTED] writes:

 Disks can read at full rotation speed, so skipping (not reading) some
 blocks will not make reading the remaining blocks from the same track
 faster. And if there are more than 20 8k pages per track, you still have
 a very high probablility you need to read all tracks..

Well, if there are exactly 20 you would expect a 50% chance of having to read
a track so you would expect double the effective bandwidth. It would have to
be substantially bigger to not have any noticeable effect.

 You may be able to move to the next track a little earlier compared to
 reading all blocks, but then you are likely to miss the block from next
 track and have to wait a full rotation.

No, I don't think that works out. You always have a chance of missing the
block from the next track and having to wait a full rotation and your chance
isn't increased or decreased by seeking earlier in the rotation. So you would
expect each track to take 20 block reads less time on average.


 Your test program could have got a little better results, if you had
 somehow managed to tell the system all the block numbers to read in one
 go, not each time the next one after hetting the previous one. 

I was trying to simulate the kind of read pattern that postgres generates
which I believe looks like that.

 The fact that 5% was not slower than seqscan seems to indicate that
 actually all track reads were cached inside the disk or controller.

I dunno, your first explanation seemed pretty convincing and doesn't depend on
specific assumptions about the caching. Moreover this doesn't explain why you
*do* get a speedup when reading less than 5%.


Perhaps what this indicates is that the real meat is in track sampling, not
block sampling.


-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Nicolai Petri
Hi All,

Just a small comment from a mortal user.

On Thursday 01 June 2006 19:28, Josh Berkus wrote:
 5. random_page_cost (as previously discussed) is actually a funciton of
 relatively immutable hardware statistics, and as such should not need to
 exist as a GUC once the cost model is fixed.
It's correct that the hardware statistics are quite immutable - per 
tablespace. This suggests to me that the GUC should be default value only and 
then overridable per tablespace. It's quite normal to have primary (current 
data) and secondary (historical data) storage for larger databases.
This can also fit nicely into the pg_tmp storage if tie support for multiple 
tmp folders together with tablespaces.


 6. We haven't added any way to estimate rows returned from SRFs.
This would also be very cool - currently the planner can really get annoyed 
when joining SRF functions with tables.

---
Nicolai

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Kenneth Marshall
Josh, Greg, and Tom,

I do not know how sensitive the plans will be to the correlation,
but one thought might be to map the histogram X histogram correlation
to a square grid of values. Then you can map them to an integer which
would give you 8 x 8 with binary values, a 5 x 5 with 4 values per
point, or a 4 x 4 with 8 values per point. If close is good enough,
that would be a compact way to store some histogram cross correlation
information.

Ken

On Thu, Jun 01, 2006 at 01:50:26PM -0700, Josh Berkus wrote:
 Greg, Tom,
 
 ...
  2) It isn't even clear what data you're exactly looking for. Certainly
 correlation is just shorthand here and isn't what you're actually
  looking for. 
 
 Actually, I'd think that a correlation number estimate (0 = complete 
 uncorrelated, 1 = completely correlated) would be sufficient to improve 
 row count estimation significantly, without incurring the vast overhead of 
 histogramXhistorgram manipulation.
 ...

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark

David Fetter [EMAIL PROTECTED] writes:

  In the prior discussions someone posted the paper with the algorithm
  I mentioned.  That paper mentions that previous work showed poor
  results at estimating n_distinct even with sample sizes as large as
  50% or more.
 
 Which paper?  People have referenced several different ones.

Phillip B Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct
Values Queries and Event Reports. In Proceedings of the 27th VLDB Conference,
Roma, Italy, 2001.

Which says (emphasis in original as italics):

The most well-studied approach for distinct-values estimation is to
collect a uniform random sample S of the data, store S in a database, and
then use S at query time to provide fast, approximate answers to distinct
values queries [22, 23, 27, 21, 29, 5, 28, 18, 19, 9, 7]. However,
previous work [28, 18, 9, 7] has shown powerful negative results on the
quality of distinct-values estimates based on sampling (or other
techniques that examine only part of the input data), even for the simple
case of counting the number of distinct values in a column. The strongest
negative result is due to Charikar et al. [7], who proved that estimating
the number of distinct values in a column to within a small constant
factor (with probability  1/2) requires that *nearly* *the* *entire*
*data* *set* *be* *sampled*. Moreover, all known sampling-based estimators
provide unsatisfactory results on data sets of interest [7], even for this
simple case.

And later:

Using a variety of synthetic and real-world data sets, we show that
distinct sampling gives estimates for distinct values queries that are
within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and queries studied.

Here distinct sampling is the algorithm being proposed which requires
looking at every record and keeping a sample *of the distinct values*. The
previous methods are methods based on sampling the records

I haven't read the citation [7] that proves the strong negative result for any
estimator of distinct values based on sampling. It's:

  M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation
  error guarantees for distinct values. In Proc. 19th ACM Symp. on Principles
  of Database Systems, pages 268?279, May 2000.


  Hopefully you're right that you don't need complete histograms.
  Perhaps there's some statistics concept they don't teach in stats
  101 that would cover this need.  What we're looking for is a
  function f(a,b,x) that gives the net selectivity given a and b, the
  selectivity of two clauses based on two columns, and x some simple
  value that can be easily calculated by looking at the data in
  advance.
 
 That would be neat :)

Doing a bit of basic searching around I think the tool we're looking for here
is called a chi-squared test for independence.

I haven't read up on how it works so I'm unclear if i it be calculated using a
simple O(n) scan or if it would require some non-linear post-processing after
the analyze pass which would be unfortunate.

And I haven't found anything that describes how to make use of the resulting
number. Does it actually give a formula f(a,b,x) that gives a nice convenient
expected selectivity for the clause?

Oh, and incidentally, at first glance it seems like calculating it depends on
having good distinct value sampling.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Josh Berkus
Greg,

 Using a variety of synthetic and real-world data sets, we show that
 distinct sampling gives estimates for distinct values queries that
 are within 0%-10%, whereas previous methods were typically 50%-250% off,
 across the spectrum of data sets and queries studied.

Aha.  It's a question of the level of error permissable.   For our 
estimates, being 100% off is actually OK.  That's why I was looking at 5% 
block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
cases.

 Doing a bit of basic searching around I think the tool we're looking for
 here is called a chi-squared test for independence.

Augh.  I wrote a program (in Pascal) to do this back in 1988.   Now I can't 
remember the math.  For a two-column test it's relatively 
computation-light, though, as I recall ... but I don't remember standard 
chi square works with a random sample.


-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  Using a variety of synthetic and real-world data sets, we show that
  distinct sampling gives estimates for distinct values queries that
  are within 0%-10%, whereas previous methods were typically 50%-250% off,
  across the spectrum of data sets and queries studied.
 
 Aha.  It's a question of the level of error permissable.   For our 
 estimates, being 100% off is actually OK.  That's why I was looking at 5% 
 block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
 cases.

Well, let's see. Say for example we're scanning 50,000 records out of 1M
records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
2 we get that there is at least a 5% chance of getting a result with a ratio
error at least sqrt((1M-50k)/100k ln(20)) or 5.33.

So no, even assuming you have a good unbiased sample, a 5% sample is only
going to get you to a result within 19%-533% of the real values 95% of the
time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
some distributions it may be outside that range even more than 95% of the
time.

This is entirely unlike the histograms where we have a solid foundation for a
positive result. We can guarantee that the result will be outside +/- x% *at
most* 5% of the time. For most distributions it'll be outside that range even
less.

And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
5% block sampling took just as long as reading all the blocks. Even if we
figure out what's causing that (IMHO surprising) result and improve matters I
would only expect it to be 3-4x faster than a full scan.

http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Michael Dean

Greg Stark wrote:


Josh Berkus josh@agliodbs.com writes:

 


   Using a variety of synthetic and real-world data sets, we show that
   distinct sampling gives estimates for distinct values queries that
are within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and queries studied.
 

Aha.  It's a question of the level of error permissable.   For our 
estimates, being 100% off is actually OK.  That's why I was looking at 5% 
block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
cases.
   



Well, let's see. Say for example we're scanning 50,000 records out of 1M
records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
2 we get that there is at least a 5% chance of getting a result with a ratio
error at least sqrt((1M-50k)/100k ln(20)) or 5.33.

So no, even assuming you have a good unbiased sample, a 5% sample is only
going to get you to a result within 19%-533% of the real values 95% of the
time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
some distributions it may be outside that range even more than 95% of the
time.

This is entirely unlike the histograms where we have a solid foundation for a
positive result. We can guarantee that the result will be outside +/- x% *at
most* 5% of the time. For most distributions it'll be outside that range even
less.

And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
5% block sampling took just as long as reading all the blocks. Even if we
figure out what's causing that (IMHO surprising) result and improve matters I
would only expect it to be 3-4x faster than a full scan.

http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php

 

I'm sorry to interrupt your esoteric (to me) discussion, but I have a 
very simple question:  would you define a good unbiased sample?  My 
statistics professor Dan Price (rest his soul) would tell me there are 
only random samples of some sort, and other, which are always biased, 
and never good.  Excuse my absolutes, I didn't mean Good or Evil.  And 
how do you calculate a level of error without randomization? And are you 
talking of type I or type II error?

Michael

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
 5% block sampling took just as long as reading all the blocks. Even if we
 figure out what's causing that (IMHO surprising) result and improve matters I
 would only expect it to be 3-4x faster than a full scan.

One way to reduce the I/O pain from extensive sampling would be to turn
VACUUM ANALYZE into a genuine combined operation instead of a mere
notational shorthand for two separate scans.

I'd still be worried about the CPU pain though.  ANALYZE can afford to
expend a pretty fair number of cycles per sampled tuple, but with a
whole-table sample that's going to add up.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread David Fetter
On Fri, Jun 02, 2006 at 01:39:32PM -0700, Michael Dean wrote:

 I'm sorry to interrupt your esoteric (to me) discussion, but I have
 a very simple question:  would you define a good unbiased sample?
 My statistics professor Dan Price (rest his soul) would tell me
 there are only random samples of some sort, and other, which are
 always biased, and never good.

What's at issue here is a biased estimator, not a biased sample.  If
you know an unbiased estimator of multiplicity based on a random
sample, that would be a great :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  And a 5% sample is a pretty big. In fact my tests earlier showed the i/o 
  from
  5% block sampling took just as long as reading all the blocks. Even if we
  figure out what's causing that (IMHO surprising) result and improve matters 
  I
  would only expect it to be 3-4x faster than a full scan.
 
 One way to reduce the I/O pain from extensive sampling would be to turn
 VACUUM ANALYZE into a genuine combined operation instead of a mere
 notational shorthand for two separate scans.

Except that we're also looking for every way we can to avoid having vacuum
have to touch every page so large tables with narrow hot spots can still
afford to be vacuumed regularly during peak hours.

But for most users analyze doesn't really have to run as often as vacuum. One
sequential scan per night doesn't seem like that big a deal to me.

 I'd still be worried about the CPU pain though.  ANALYZE can afford to
 expend a pretty fair number of cycles per sampled tuple, but with a
 whole-table sample that's going to add up.

That is a concern. Though the algorithm is pretty efficient, it basically
amounts to hashing all the tuples keeping only a limited number of hash
buckets and just throwing away the rest of the data. 

Things could get more complicated if we get to more complex stats than simple
n_distinct estimates and performing dependence tests on the data for
multi-column statistics.

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Josh Berkus
Greg, Tom,

 But for most users analyze doesn't really have to run as often as
 vacuum. One sequential scan per night doesn't seem like that big a deal
 to me.

Clearly you don't have any 0.5 TB databases.  

  I'd still be worried about the CPU pain though.  ANALYZE can afford to
  expend a pretty fair number of cycles per sampled tuple, but with a
  whole-table sample that's going to add up.

Agreed.  Despite conventional wisdom, most PostgreSQL databases ... even 
those with high level OLTP or very large DW ... are CPU-bound.We 
really don't want an ANALYZE which is an order-of-magnitude increase in 
CPU activity.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Todd A. Cook

Josh Berkus wrote:

Greg, Tom,


But for most users analyze doesn't really have to run as often as
vacuum. One sequential scan per night doesn't seem like that big a deal
to me.


Clearly you don't have any 0.5 TB databases.  


Perhaps something like ANALYZE FULL?  Then only those who need the
more precise statistics would pay the cost for a full scan.

-- todd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Tom Lane
I wrote:
 In general it seems to me that for CPU-bound databases, the default values
 of the cpu_xxx_cost variables are too low.  ... rather than telling people
 to manipulate all three of these variables individually, I think it might
 also be a good idea to provide a new GUC variable named something like
 cpu_speed_scale that would just be a multiplier for the other variables.
 It would default to 1.0 and our standard advice for CPU-bound databases
 would be decrease random_page_cost to 1.0 and raise cpu_speed_scale to
 10.0 or so.  Seems cleaner than telling people to muck with three or so
 individual values.

Nicolai Petri's comment about per-tablespace access costs caused me to
rethink the above proposal.  Instead of inventing cpu_speed_scale,
which seems rather baroque after thinking about it more, what I now
think we should do is invent a seq_page_cost GUC to replace the
traditionally hardwired value of 1.0 cost unit per sequential page
fetch.  Then, if you've got different tablespaces with different disk
speeds, you could imagine having per-tablespace values of seq_page_cost
and random_page_cost, whereas you probably want the CPU cost numbers
to remain the same across all tables.

I don't really want to get into inventing per-tablespace settings right
now, because the need hasn't been demonstrated; but if we ever do want
to do it, this approach will be a whole lot less confusing than
something involving a cpu_speed_scale knob.

This still leaves you twiddling two knobs (now random_page_cost and
seq_page_cost) if you want to set up the planner for an all-in-memory
database.  So it's not any more complicated for that purpose.

One objection to this is that after moving off the gold standard of
1.0 = one page fetch, there is no longer any clear meaning to the
cost estimate units; you're faced with the fact that they're just an
arbitrary scale.  I'm not sure that's such a bad thing, though.  For
instance, some people might want to try to tune their settings so that
the estimates are actually comparable to milliseconds of real time.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 Greg, Tom,
 
  But for most users analyze doesn't really have to run as often as
  vacuum. One sequential scan per night doesn't seem like that big a deal
  to me.
 
 Clearly you don't have any 0.5 TB databases.  

Actually I did not so long ago. 

Sequential scans in an OLTP query would be a disaster. But a single sequential
scan run at a controlled time wouldn't concern me as long as *all* of the
following constraints are met:

 a) You can run them at your leisure at off-peak times when your i/o bandwidth
isn't in short supply.

  b) You don't need the results urgently so you don't care if it takes a while
 to run.

  c) You don't need many of them at the same time.

Even on your production system surely you occasionally, say, take a full
backup or run select count(*) or other sanity checks on the data?

   I'd still be worried about the CPU pain though.  ANALYZE can afford to
   expend a pretty fair number of cycles per sampled tuple, but with a
   whole-table sample that's going to add up.
 
 Agreed.  Despite conventional wisdom, most PostgreSQL databases ... even 
 those with high level OLTP or very large DW ... are CPU-bound.We 
 really don't want an ANALYZE which is an order-of-magnitude increase in 
 CPU activity.

I don't think Tom was actually expressing concern about ANALYZE becoming more
expensive, but about tying ANALYZE and VACUUM together and making VACUUM more
expensive. VACUUM is something we want to encourage people to think they can
run all day long, not just occasionally.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Rod Taylor
 One objection to this is that after moving off the gold standard of
 1.0 = one page fetch, there is no longer any clear meaning to the
 cost estimate units; you're faced with the fact that they're just an
 arbitrary scale.  I'm not sure that's such a bad thing, though.  For
 instance, some people might want to try to tune their settings so that
 the estimates are actually comparable to milliseconds of real time.

Any chance that the correspondence to time could be made a part of the
design on purpose and generally advise people to follow that rule? If we
could tell people to run *benchmark* and use those numbers directly as a
first approximation tuning, it could help quite a bit for people new to
PostgreSQL experiencing poor performance.

effective_cache_size then becomes essentially the last hand-set variable
for medium sized installations.
-- 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 One objection to this is that after moving off the gold standard of
 1.0 = one page fetch, there is no longer any clear meaning to the
 cost estimate units; you're faced with the fact that they're just an
 arbitrary scale.  I'm not sure that's such a bad thing, though.  For
 instance, some people might want to try to tune their settings so that
 the estimates are actually comparable to milliseconds of real time.

 Any chance that the correspondence to time could be made a part of the
 design on purpose and generally advise people to follow that rule?

We might eventually get to that point, but I'm hesitant to try to do it
immediately.  For one thing, I really *don't* want to get bug reports
from newbies complaining that the cost estimates are always off by a
factor of X.  (Not but what we haven't gotten some of those anyway :-()
In the short term I see us sticking to the convention that seq_page_cost
is 1.0 in a typical database, while anyone who's really hot to try to
make the other happen is free to experiment.

 If we could tell people to run *benchmark* and use those numbers
 directly as a first approximation tuning, it could help quite a bit
 for people new to PostgreSQL experiencing poor performance.

We don't have such a benchmark ... if we did, we could have told
people how to use it to set the variables already.  I'm very very
suspicious of any suggestion that it's easy to derive appropriate
numbers for these settings from one magic benchmark.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus

Tom,

As you know, this is something I think about a bit too, though not 
nearly as deeply as you.



In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low.  I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things.  For a CPU-bound database all
three would need to go up more than that.  But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
cpu_speed_scale that would just be a multiplier for the other variables.


Yes, please!  I've done a bit of playing around with the cpu_* 
variables, and frankly have always manipulated them by the same multiple.


Also, my testing has shown that on *large* databases (20x RAM +) with 
fast processors (Opteron) the cpu_* values are too high.  But a lot of 
that is offsetting estimates which are too high elsewhere ... as is 
random_page_cost.




I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in.  So I figured that the errors were more or less
cancelling each other.  But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.


Yeah.  I've refrained from proposing changes because it's a 
pick-up-sticks.  If we start modifying the model, we need to fix 
*everything*, not just one item.  And then educate our users that they 
need to use the GUC variables in a different way.  Here's the issues I see:


1. n-distinct estimation is bad, as previously discussed;

2. our current heuristics sampling methods prevent us from sampling more 
than 0.5% of any reasonably large table, causing all statistics on those 
tables to be off for any table with irregular distribution of values;


3. We don't have any method to analyze inter-column correlation within a 
table;


4. We don't keep statistics on foriegn key correlation;

5. random_page_cost (as previously discussed) is actually a funciton of 
relatively immutable hardware statistics, and as such should not need to 
exist as a GUC once the cost model is fixed.


6. We haven't added any way to estimate rows returned from SRFs.


I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.


Per above, I think anything involving random_page_cost is the wrong way 
to go.  RPC is a GUC assigned by the DBA on pure guesswork.  We should 
be decreasing its use or eliminating it entirely, not increasing the 
amount we use it.


For btrees, we should be able to accurately estimate the cost of the 
index access given:

a) the depth of the tree;
b) the likelyhood that requested values are adjacent;
c) whether the index and tables are already in shared_mem, or else (d) 
and (e) below:
d) the probability that the index pages are cached in memory, which is 
composed of:

(1) the frequency with which that index is accessed, modified by
(2) whether the index is a fraction of available ram, or larger than ram
e) the probability that the relevant table pages are cached in memory, 
determined by the same two factors.


*None* of this should involve a userset parameter (random_page_cost) 
since as you pointed out months ago, the ration of seq/seek access has 
remained relatively constant over the past 6 years of HDD and I/O 
engineering.


Sadly, I don't have the math for all of the above but I'm hoping someone 
(either here or on ACM) does.


Also, this would require us to have different *formulas* and not merely 
different cost factors for different kinds of indexes.  I believe that 
this is something that Jie is already struggling with.  Jie?



The big 

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 1. n-distinct estimation is bad, as previously discussed;
 
 2. our current heuristics sampling methods prevent us from sampling more than
 0.5% of any reasonably large table, causing all statistics on those tables to
 be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

 3. We don't have any method to analyze inter-column correlation within a 
 table;
 
 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

 5. random_page_cost (as previously discussed) is actually a funciton of
 relatively immutable hardware statistics, and as such should not need to exist
 as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg,

 I'm convinced these two are more connected than you believe.

Actually, I think they are inseparable.

 I might be interested in implementing that algorithm that was posted a
 while back that involved generating good unbiased samples of discrete
 values. The algorithm was quite clever and well described and paper
 claimed impressively good results.

 However it will only make sense if people are willing to accept that
 analyze will need a full table scan -- at least for tables where the DBA
 knows that good n_distinct estimates are necessary.

What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
1 in 20 rows, should require siginificantly less scanning, and yet give us 
a large enough sample for reasonable accuracy.

  3. We don't have any method to analyze inter-column correlation within
  a table;
 
  4. We don't keep statistics on foriegn key correlation;

 Gosh these would be nice but they sound like hard problems. Has anybody
 even made any headway in brainstorming how to tackle them?

There's no time like the present!

Actually, these  both seem like fairly straightforward problems 
storage-wise.  The issue is deriving the statistics, for tables with many 
columns or FKs.  

  5. random_page_cost (as previously discussed) is actually a funciton
  of relatively immutable hardware statistics, and as such should not
  need to exist as a GUC once the cost model is fixed.

 I don't think that's true at all. Not all hardware is the same.

 Certainly the need to twiddle this GUC should be drastically reduced if
 the cache effects are modelled properly and the only excuses left are
 legitimate hardware differences.

OK ... but still, it should become a little knob rather than the big 
knob it is currently.


-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Yeah.  I've refrained from proposing changes because it's a 
 pick-up-sticks.  If we start modifying the model, we need to fix 
 *everything*, not just one item.  And then educate our users that they 
 need to use the GUC variables in a different way.  Here's the issues I see:

 [ various deficiencies in our statistics gathering ]

Those are all valid points, but pretty much orthogonal to what I'm on
about today.  The problems I'm thinking about are seen even when the
planner's rowcount estimates are dead on (as indeed they mostly were
in Philippe's example).

 5. random_page_cost (as previously discussed) is actually a funciton of 
 relatively immutable hardware statistics, and as such should not need to 
 exist as a GUC once the cost model is fixed.

Well, it'll still exist as a GUC for the same reasons the other ones are
GUCs.  But I think the main reasons people have needed to twiddle it are
(1) their database is completely RAM-resident (where RPC
*should* logically be 1.0), or
(2) they're trying to compensate for the overestimation of
nestloop indexscans.
The changes I'm proposing should help with (2).

 For btrees, we should be able to accurately estimate the cost of the 
 index access given:
 a) the depth of the tree;

Although logically the tree descent *ought* to be a factor, I haven't
seen any evidence that we need to take it into account; in fact all the
evidence points to the conclusion that we're better off ignoring it.
My theory about why is that the upper levels of the tree stay in cache.
I have to admit that's just handwaving, but counting additional disk
hits to fetch the first index tuple is clearly not the direction the
cost model needs to go in.  Look again at the examples in Philippe's
output: an indexscan fetching 1700 tuples (about 5 leaf pages probably)
took 32x longer than one fetching 7 tuples (presumably all on the same
leaf page).  There's no way that a model that counts an additional
couple of page fetches for tree descent is going to arrive at those
numbers.  And we see this over and over again: incredibly small times
to fetch a single index tuple, at least on the average when the index
is being hit many times in one query.

 c) whether the index and tables are already in shared_mem, or else (d) 
 and (e) below:
 d) the probability that the index pages are cached in memory, which is 
 composed of:
   (1) the frequency with which that index is accessed, modified by
   (2) whether the index is a fraction of available ram, or larger than ram
 e) the probability that the relevant table pages are cached in memory, 
 determined by the same two factors.

These would all be nice things to know, but I'm afraid it's pie in the
sky.  We have no reasonable way to get those numbers.  (And if we could
get them, there would be another set of problems, namely plan instability:
the planner's choices would become very difficult to reproduce.)

 The big difficulty in modeling cache effects from multiple scans is that
 we usually don't know how many times the index will be scanned.

 I think we can work around this by figuring out whether the index has 
 already been scanned in the plan, something we *can* know.  So the first 
 scan will be full cost and remaining scans will be fractional cost. 

Uh, no, because what we're normally thinking about is independent probes
into the index with different keys.  For a small number of probes you
have to figure that those all hit different leaf pages and aren't going
to amortize well.  As the number of probes goes up, you can start
charging fractional costs because it's more likely you're hitting a leaf
page you hit recently.  The Mackert/Lohman equations do this reasonably
well --- it's possible we can find something better, but those seem like
a good place to start.  The immediate problem is to get an
infrastructure in place that gives us some numbers to apply equations to.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  However it will only make sense if people are willing to accept that
  analyze will need a full table scan -- at least for tables where the DBA
  knows that good n_distinct estimates are necessary.
 
 What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
 1 in 20 rows, should require siginificantly less scanning, and yet give us 
 a large enough sample for reasonable accuracy.

a) We already use block based sampling to reduce overhead. If you're talking
   about using the entire block and not just randomly sampled tuples from
   within those blocks then your sample will be biased.

b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates.
   Your probability of being accurate for discrete processes like n_distinct
   is going to be more or less proportional to your sample. So sampling 5% of
   the table gives you only a 5% of the chance of an accurate prediction. More
   or less.

 Actually, these  both seem like fairly straightforward problems 
 storage-wise.  The issue is deriving the statistics, for tables with many 
 columns or FKs.  

Well there are problems on many levels:

1) You have n^2 possible two-column combinations. That's a lot of processing
   and storage.

2) It isn't even clear what data you're exactly looking for. Certainly
   correlation is just shorthand here and isn't what you're actually looking
   for. You need something like a complete histogram for the dependent column
   for every possible value of the forcing column. That would be an enormous
   amount of storage.

 OK ... but still, it should become a little knob rather than the big 
 knob it is currently.

Sure, the more accurately you model the cache the less need there would be to
adjust random_page_cost. I'm not sure how accurate Postgres can really get
unless it switches philosophies towards using O_DIRECT and doing it's own
caching and scheduling. But it could certainly be much better than today.
Tom's comment makes it look like there's hope for pretty accurate cache
modelling for nested loop plans.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg, Tom,

 a) We already use block based sampling to reduce overhead. If you're
 talking about using the entire block and not just randomly sampled
 tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating 
both row correlation and n-distinct based on sampling complete blocks.  
See prior discussions on this list on N-distinct.

 1) You have n^2 possible two-column combinations. That's a lot of
 processing and storage.

Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

 2) It isn't even clear what data you're exactly looking for. Certainly
correlation is just shorthand here and isn't what you're actually
 looking for. 

Actually, I'd think that a correlation number estimate (0 = complete 
uncorrelated, 1 = completely correlated) would be sufficient to improve 
row count estimation significantly, without incurring the vast overhead of 
histogramXhistorgram manipulation.

 Those are all valid points, but pretty much orthogonal to what I'm on
 about today.  The problems I'm thinking about are seen even when the
 planner's rowcount estimates are dead on (as indeed they mostly were
 in Philippe's example).

OK, I was afraid they were interdependant.  You would know better than me.

 Well, it'll still exist as a GUC for the same reasons the other ones are
 GUCs.  But I think the main reasons people have needed to twiddle it are
   (1) their database is completely RAM-resident (where RPC
   *should* logically be 1.0), or
   (2) they're trying to compensate for the overestimation of
   nestloop indexscans.
 The changes I'm proposing should help with (2).

Right.  What I'm saying is that (1) should be derived from 
estimated_cache_size and DBSIZE, not by setting an additional GUC.

  (1) the frequency with which that index is accessed, modified by
  (2) whether the index is a fraction of available ram, or larger than
  ram e) the probability that the relevant table pages are cached in
  memory, determined by the same two factors.

 These would all be nice things to know, but I'm afraid it's pie in the
 sky.  We have no reasonable way to get those numbers.  (And if we could
 get them, there would be another set of problems, namely plan
 instability: the planner's choices would become very difficult to
 reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross 
level.  Whether the cost of calculating them outweighs the benefit of 
having them is another question, resolvable only through some 
experimentation.

 a good place to start.  The immediate problem is to get an
 infrastructure in place that gives us some numbers to apply equations
 to.

No arguments, of course.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg,

  1) You have n^2 possible two-column combinations. That's a lot of
  processing and storage.

 Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

Ooops, bad math.  Andrew pointed out it's actually n*(n-1)/2, not n!.

Also, we could omit columns unlikely to correlate, such as large text 
columns, bytea and numerics with high precisions.  Also, we probably don't 
need to correlate UNIQUE columns inside ... I think.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Jim C. Nasby
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:
 
 Josh Berkus josh@agliodbs.com writes:
 
  1. n-distinct estimation is bad, as previously discussed;
  
  2. our current heuristics sampling methods prevent us from sampling more 
  than
  0.5% of any reasonably large table, causing all statistics on those tables 
  to
  be off for any table with irregular distribution of values;
 
 I'm convinced these two are more connected than you believe. For distributions
 of values in a range using small samples is on solid statistical basis. It's
 the same reason poll results based on only a few hundred people can accurately
 predict elections in a country with hundreds of millions of voters.
 
 However for things like estimating discrete values there's absolutely no solid
 foundation for these small samples. Your accuracy is going to be simply
 proportional to the sample size, meaning you can't get anything trustworthy
 without sampling large enough portions of the table that a full sequential
 scan would be just as fast.
 
 I might be interested in implementing that algorithm that was posted a while
 back that involved generating good unbiased samples of discrete values. The
 algorithm was quite clever and well described and paper claimed impressively
 good results.
 
 However it will only make sense if people are willing to accept that analyze
 will need a full table scan -- at least for tables where the DBA knows that
 good n_distinct estimates are necessary.
 
There *might* be an alternative

Suppose that the executor could keep track of what values it's seeing
from a table as it's actually running a query. These could be used to
build statistical information without actually running analyze.

Of course, the problem is that you could end up with some seriously
skewed statistics, depending on what your query patterns are. But in
some ways that's not bad; you're optimizing for the queries you most
often run.

One possible way to handle this would be to keep track of how many times
each value has been seen, as well as when it was last seen, so you have
some idea of the quality of the data. Another idea is to somehow blend
these stats with the traditional method.

  3. We don't have any method to analyze inter-column correlation within a 
  table;
  
  4. We don't keep statistics on foriegn key correlation;
 
 Gosh these would be nice but they sound like hard problems. Has anybody even
 made any headway in brainstorming how to tackle them?
 
A *huge* improvement would be gathering statistics on all multi-column
indexes. Some of the stats might not make sense in this context, but
others (such as correlation) would.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Jim C. Nasby
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote:
 These would all be nice things to know, but I'm afraid it's pie in the
 sky.  We have no reasonable way to get those numbers.  (And if we could
 get them, there would be another set of problems, namely plan instability:
 the planner's choices would become very difficult to reproduce.)

Speaking of plan instability, something that's badly needed is the
ability to steer away from query plans that *might* be the most optimal,
but also will fail horribly should the cost estimates be wrong. People
generally don't care about getting the absolutely most optimal plan;
they do care about NOT getting a plan that's horribly bad. One possible
way to do this would be to have the estimator calculate a worst-case
cost to go along with the best case cost, or maybe even 3 numbers:
ideal, what we think will happen, and worst-case. I know that index
scans already have that information, at least in the form of cost for a
plan with 0 correlation and one with perfect correlation.

Does anyone have any experience developing genetic algorithms? I think
coming up with cost estimators might be an ideal use for that
technology...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 Greg, Tom,
 
  a) We already use block based sampling to reduce overhead. If you're
  talking about using the entire block and not just randomly sampled
  tuples from within those blocks then your sample will be biased.
 
 There are actually some really good equations to work with this, estimating 
 both row correlation and n-distinct based on sampling complete blocks.  
 See prior discussions on this list on N-distinct.

In the prior discussions someone posted the paper with the algorithm I
mentioned. That paper mentions that previous work showed poor results at
estimating n_distinct even with sample sizes as large as 50% or more.

  1) You have n^2 possible two-column combinations. That's a lot of
  processing and storage.
 
 Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

You have O(n^2) possible *two-column* combinations and O(n!) n-column
combinations. I was hoping two-column combinations would be enough information
to extrapolate from for larger combinations.

  2) It isn't even clear what data you're exactly looking for. Certainly
 correlation is just shorthand here and isn't what you're actually
  looking for. 
 
 Actually, I'd think that a correlation number estimate (0 = complete 
 uncorrelated, 1 = completely correlated) would be sufficient to improve 
 row count estimation significantly, without incurring the vast overhead of 
 histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for example two
columns state and area code. The area code is pretty much 100% dependent
on state, but will show very little correlation. Similarly things like
product name and catalog number or even author and genre would be
problem cases but have little correlation. And you can easily imagine queries
that specify restrictive clauses on two such columns for which the existing
statistics will overestimate the selectivity because it assumes no
interdependency.

Hopefully you're right that you don't need complete histograms. Perhaps
there's some statistics concept they don't teach in stats 101 that would cover
this need. What we're looking for is a function f(a,b,x) that gives the net
selectivity given a and b, the selectivity of two clauses based on two
columns, and x some simple value that can be easily calculated by looking at
the data in advance.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread David Fetter
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
 
 Josh Berkus josh@agliodbs.com writes:
 
  Greg, Tom,
  
   a) We already use block based sampling to reduce overhead.  If
   you're talking about using the entire block and not just
   randomly sampled tuples from within those blocks then your
   sample will be biased.
  
  There are actually some really good equations to work with this,
  estimating both row correlation and n-distinct based on sampling
  complete blocks.  See prior discussions on this list on
  N-distinct.
 
 In the prior discussions someone posted the paper with the algorithm
 I mentioned.  That paper mentions that previous work showed poor
 results at estimating n_distinct even with sample sizes as large as
 50% or more.

Which paper?  People have referenced several different ones.

   1) You have n^2 possible two-column combinations.  That's a lot
   of processing and storage.
  
  Yes, that's the hard problem to solve.  Actually, btw, it's n!,
  not n^2.
 
 You have O(n^2) possible *two-column* combinations and O(n!)
 n-column combinations.  I was hoping two-column combinations would
 be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
well enough by ass-u-ming a lack of higher-order correllations.

   2) It isn't even clear what data you're exactly looking for.
   Certainly correlation is just shorthand here and isn't what
   you're actually looking for. 
  
  Actually, I'd think that a correlation number estimate (0 =
  complete uncorrelated, 1 = completely correlated) would be
  sufficient to improve row count estimation significantly, without
  incurring the vast overhead of histogramXhistorgram manipulation.
 
 No, correlation is actually quite uninteresting here.  Consider for
 example two columns state and area code.  The area code is
 pretty much 100% dependent on state, but will show very little
 correlation.

There are quantitative tests of independence available.  I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

 Hopefully you're right that you don't need complete histograms.
 Perhaps there's some statistics concept they don't teach in stats
 101 that would cover this need.  What we're looking for is a
 function f(a,b,x) that gives the net selectivity given a and b, the
 selectivity of two clauses based on two columns, and x some simple
 value that can be easily calculated by looking at the data in
 advance.

That would be neat :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Mark Kirkwood

Tom Lane wrote:


Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:

/*
 * Estimate the number of index pages that will be retrieved.
 *
 * For all currently-supported index types, the first page of the index is
 * a metadata page, and we should figure on fetching that plus a pro-rated
 * fraction of the remaining pages.
 */
if (index-pages  1  index-tuples  0)
{
numIndexPages = (numIndexTuples / index-tuples) * (index-pages - 1);
numIndexPages += 1;/* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
 * Compute the index access cost.
 *
 * Disk cost: our generic assumption is that the index pages will be read
 * sequentially, so they have cost 1.0 each, not random_page_cost.
 */
*indexTotalCost = numIndexPages;

There are several things wrong with this, at least in its application to
btrees.  It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages).  On the other hand it's surely
overcharging for metapage touches.  As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)


I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.



This sounds good to me, although the 2.0 - 4.0 cost jump may cause 
(more) cases of people seeing their index scans in pre-8.2 versions 
becoming some other type of access in 8.2. I guess a comment about 
testing existing applications could be placed in the 8.2 release notes?



Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea.  But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans.  Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned.  If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost.  But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan.  So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.




If I understand correctly, this is the case were we normally see folks 
needing add several 'set enable_xxx=off' statements to get the nested 
loop plan that *actually* works best :-). So, also looks good!


Cheers

Mark


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 With this model, the disk cost to fetch a single
 index entry will be estimated as random_page_cost (default 4.0) rather
 than the current fixed 2.0.  This shouldn't hurt things too much for
 simple indexscans --- especially since anyone running with a reduced
 random_page_cost won't see as much change.  And it will increase the costs
 for bitmap scans that cross many index pages, which is what we need in
 light of Philippe's numbers.

 This sounds good to me, although the 2.0 - 4.0 cost jump may cause 
 (more) cases of people seeing their index scans in pre-8.2 versions 
 becoming some other type of access in 8.2. I guess a comment about 
 testing existing applications could be placed in the 8.2 release notes?

Yeah, that comes with the territory.  One point to note is that with
this model, setting random_page_cost below 2.0 will actually make small
indexscans look *cheaper* than they do now.  So it'll certainly be
possible to make the thing jump in that direction if you need to.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] More thoughts about planner's cost estimates

2006-05-31 Thread Tom Lane
I've been thinking about the planner's costing problems again,
particularly in connection with Philippe Lang's complaint here:
http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php
Investigation showed that the planner is much too optimistic about the
usefulness of a multi-index BitmapAnd plan, and since that comparison is
just a cost-estimate comparison, it implies we are underestimating the
cost of an index scan.  A typical example from his results is

-  BitmapAnd  (cost=12.30..12.30 rows=1 width=0) (actual time=0.306..0.306 
rows=0 loops=13628)
  -  Bitmap Index Scan on lw_id_workflow  (cost=0.00..2.02 rows=7 
width=0) (actual time=0.009..0.009 rows=7 loops=13628)
Index Cond: (lw.id_workflow = outer.id)
  -  Bitmap Index Scan on lw_ordre  (cost=0.00..10.03 rows=1437 
width=0) (actual time=0.293..0.293 rows=1714 loops=13628)
Index Cond: (ordre = $4)

There are two variables involved here: the cost of touching an index page
and the cost of processing an index tuple.  Given the two comparable data
points above, we can solve for those numbers, and it turns out that the
real cost ratio on Philippe's machine is about 50 to 1.  Which says that
for him, cpu_index_tuple_cost plus cpu_operator_cost should be around
0.02, nearly an order of magnitude higher than their current default
values (0.001 and 0.0025 respectively).

In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low.  I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things.  For a CPU-bound database all
three would need to go up more than that.  But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
cpu_speed_scale that would just be a multiplier for the other variables.
It would default to 1.0 and our standard advice for CPU-bound databases
would be decrease random_page_cost to 1.0 and raise cpu_speed_scale to
10.0 or so.  Seems cleaner than telling people to muck with three or so
individual values.

Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:

/*
 * Estimate the number of index pages that will be retrieved.
 *
 * For all currently-supported index types, the first page of the index is
 * a metadata page, and we should figure on fetching that plus a pro-rated
 * fraction of the remaining pages.
 */
if (index-pages  1  index-tuples  0)
{
numIndexPages = (numIndexTuples / index-tuples) * (index-pages - 1);
numIndexPages += 1;/* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
 * Compute the index access cost.
 *
 * Disk cost: our generic assumption is that the index pages will be read
 * sequentially, so they have cost 1.0 each, not random_page_cost.
 */
*indexTotalCost = numIndexPages;

There are several things wrong with this, at least in its application to
btrees.  It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages).  On the other hand it's surely
overcharging for metapage touches.  As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)

I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in.  So I figured that the errors were more or less
cancelling each other.  But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.

I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a