Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Jim C. Nasby
On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote:
 Josh Berkus josh@agliodbs.com writes:
  It's also worth mentioning that for datatypes that only have an =
  operator the performance of compute_minimal_stats is O(N^2) when values
  are unique, so increasing sample size is a very bad idea in that case.
 
  Hmmm ... does ANALYZE check for UNIQUE constraints?
 
 Our only implementation of UNIQUE constraints is btree indexes, which
 require more than an = operator, so this seems irrelevant.

IIRC, the point was that if we know a field has to be unique, there's no
sense in doing that part of the analysis on it; you'd only care about
correlation.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 12:26 -0600, Jim C. Nasby wrote:
 On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote:
  Josh Berkus josh@agliodbs.com writes:
   It's also worth mentioning that for datatypes that only have an =
   operator the performance of compute_minimal_stats is O(N^2) when values
   are unique, so increasing sample size is a very bad idea in that case.
  
   Hmmm ... does ANALYZE check for UNIQUE constraints?
  
  Our only implementation of UNIQUE constraints is btree indexes, which
  require more than an = operator, so this seems irrelevant.
 
 IIRC, the point was that if we know a field has to be unique, there's no
 sense in doing that part of the analysis on it; you'd only care about
 correlation.

An interesting point: if we know the cardinality by definition there is
no need to ANALYZE for that column. An argument in favour of the
definitional rather than the discovery approach. As a designer I very
often already know the cardinality by definition, so I would appreciate
the opportunity to specify that to the database.

Tom has not spoken against checking for UNIQUE constraints: he is just
pointing out that there never could be a constraint in the case I was
identifying. For other datatypes we could check for them rather than
analyzing the sample, but the effect would be the same either way.

My original point was that behaviour is O(N^2) when unique; it is still
O(N^2) when nearly unique, albeit O(N^2) - O(N). Reducing stats_target
does not help there - the effect varies according to number of rows in
the sample, not num slots in MCV list.

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Manfred Koizar
On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs
[EMAIL PROTECTED] wrote:
I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof.  Just some nitpicking:

|!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate.  We introduced two stage sampling at that time.

|   * a random sample of targrows rows (or less, if there are less in the
|!  * sample of blocks). In this case, targblocks is always the same as
|!  * targrows, so we always read one row per block.

This is just wrong, unless you add on average.  Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

|   * Although every row has an equal chance of ending up in the final
|   * sample, this sampling method is not perfect: not every possible
|   * sample has an equal chance of being selected.  For large relations
|   * the number of different blocks represented by the sample tends to be
|!  * too small.  In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

|   * block.  The previous sampling method put too much credence in the row
|   * density near the start of the table.

FYI, previous refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+ /* Assume that we will have rows at least 64 bytes wide 
|+  * Currently we're very unlikely to overflow availMem here 
|+  */
|+ if ((allocrows * sizeof(HeapTuple))  (allowedMem  4))

This is a funny way of saying
if (allocrows * (sizeof(Heaptuple) + 60)  allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|-  if (bs.m  0)
|+  if (bs.m  0 )

Oops.

|+  ereport(DEBUG2,
|+ (errmsg(ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u, 
|+ stats-tupattnum,samplerows, nmultiple, f1, d)));
^
missing space here and in some more places

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
Servus
 Manfred

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Tom has not spoken against checking for UNIQUE constraints: he is just
 pointing out that there never could be a constraint in the case I was
 identifying.

More generally, arguing for or against any system-wide change on the
basis of performance of compute_minimal_stats() is folly, because that
function is not used for any common datatypes.  (Ideally it'd never be
used at all.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote:
 On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs
 [EMAIL PROTECTED] wrote:
 I enclose a patch for checking out block sampling.
 
 Can't comment on the merits of block sampling and your implementation
 thereof.  

But your thoughts are welcome...

 |!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create
 
 Linking the use of the Vitter algorithm to May 2004 is not quite
 appropriate.  We introduced two stage sampling at that time.

I'll redo some of the comments as you suggest and review coding points.

 |   * a random sample of targrows rows (or less, if there are less in the
 |!  * sample of blocks). In this case, targblocks is always the same as
 |!  * targrows, so we always read one row per block.
 
 This is just wrong, unless you add on average.  Even then it is a
 bit misleading, because in most cases we *read* more tuples than we
 use.

The current code reads a number of blocks == targrows, hence one per
block but I'll change the comment.

 |   * Although every row has an equal chance of ending up in the final
 |   * sample, this sampling method is not perfect: not every possible
 |   * sample has an equal chance of being selected.  For large relations
 |   * the number of different blocks represented by the sample tends to be
 |!  * too small.  In that case, block sampling should be used.
 
 Is the last sentence a fact or personal opinion?

That is my contention, based upon research. The existing code clearly
identifies that case as requiring a solution. We use Chaudhuri et al's
1998 result for the sufficiency of sample size, yet overlook his
estimators and his ideas for random block selection.

My first point is that we need proportional sample size, not fixed size.
[I show that for large tables we currently use a sample that is 2-5
times too small, even based on Chaudhuri's work.]

ANALYZE is very I/O bound: random row sampling would need to scan the
whole table to take a 2-3% sample in most cases. Block sampling is a
realistic way of getting a larger sample size without loss of
performance, and also a way of getting a reasonable sample when
accessing very few blocks.

We would keep random row sampling for smaller relations where the
performance effects of sample collection are not noticeable.

 I haven't been following the discussion too closely but didn't you say
 that a block sampling algorithm somehow compensates for the fact that
 the sample is not random?

I haven't said that block sampling compensates for the sample being
non-random. There is a danger of bias in the sample and I made that
clear. I've tried to mitigate that by the use of random blocks, reusing
your code and taking note of the earlier problems you solved. However,
block sampling allows us to spot cases where rows are naturally grouped
together, which row sampling doesn't spot until very high sample sizes.
I explored in detail a common case where this causes the current method
to fall down very badly. 

Brutlag  Richardson [2000] give a good run down on block sampling and
I'm looking to implement some of his block estimators to complement the
current patch. (Credit to Andrew Dunstan for locating the paper...)
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps

[I did briefly explore the possibility of hybrid row/block sampling,
using row sampling as well some block sampling to identify grouping,
with an attempt to avoid swamping the sample with biased rows; but I
don't have a strong basis for that, so I've ignored that for now.]

This patch allows us to explore the possible bias that block sampling
gives, allowing us to make a later decision to include it, or not.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-13 Thread Simon Riggs
On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote:

 Anyway, since the proof is in the pudding, Simon and I will be working on 
 some demo code for different sampling methods so that we can debate 
 results rather than theory.

I enclose a patch for checking out block sampling. This is not
production ready, yet, but works bug-free on cvstip. Code comments have
been fully updated to explain what's going on inside.

All you need to do is set analyze_blocks=b and ANALYZE will switch
over to using block sampling method and will read all the rows in b
blocks. The sample size will also be limited by maintenance_work_mem.
(Memory limitations could be smarter). This de-couples the specification
of the sample size from the specification of the MCV/histogram size
(stats_target).
[Right now, I'm not suggesting that we have a GUC named this - it just
exists for testing. If/when we agree to allow block sampling, then we
can discuss how to invoke/specify it]

The stats calculations aren't touched - it still uses Haas-Stokes.
If you set log_min_messages=DEBUG2 you'll also get more useful
explanations of what the variables are and what decisions it makes about
D for each column being analyzed.

This patch has two main effects:
- ANALYZE runs more than x10 faster to retrieve the same size sample
- you can specify much larger samples for bigger tables, without
increasing the stats targets

Generally, the larger samples will give better results for the
estimation. However, what is immediately apparent is that the
Haas-Stokes estimator actually gets even worse with block sampling in
the particular case I raised on-list. (Which is understandable, but not
desirable). ISTM this is a strike against Haas-Stokes, rather than a
strike against block sampling. So I'm looking at implementing the
Brutlag  Richardson estimator(s) that cope with
number-of-values-appearing in only one block. Not surprisingly that
means some non-trivial additional code to retrieve blockids for each
tuple and make decisions accordingly. I plan to use a similar technique
to the existing TupnoLink array to match blockids.

The BR estimators should allow a fairly fast, small sample to be taken,
making this more usable for dynamic sampling during query planning (when
desirable, see recent -perform thread).

It's also worth mentioning that for datatypes that only have an =
operator the performance of compute_minimal_stats is O(N^2) when values
are unique, so increasing sample size is a very bad idea in that case.
It may be possible to re-sample the sample, so that we get only one row
per block as with the current row sampling method. Another idea might be
just to abort the analysis when it looks fairly unique, rather than
churn through the whole sample.

Best Regards, Simon Riggs

Index: src/backend/commands/analyze.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/commands/analyze.c,v
retrieving revision 1.90
diff -c -r1.90 analyze.c
*** src/backend/commands/analyze.c	22 Nov 2005 18:17:08 -	1.90
--- src/backend/commands/analyze.c	13 Jan 2006 18:54:26 -
***
*** 62,67 
--- 62,68 
  
  /* Default statistics target (GUC parameter) */
  int			default_statistics_target = 10;
+ int analyze_blocks = 0;
  
  static int	elevel = -1;
  
***
*** 77,84 
  	HeapTuple *rows, int numrows,
  	MemoryContext col_context);
  static VacAttrStats *examine_attribute(Relation onerel, int attnum);
! static int acquire_sample_rows(Relation onerel, HeapTuple *rows,
! 	int targrows, double *totalrows, double *totaldeadrows);
  static double random_fract(void);
  static double init_selection_state(int n);
  static double get_next_S(double t, int n, double *stateptr);
--- 78,86 
  	HeapTuple *rows, int numrows,
  	MemoryContext col_context);
  static VacAttrStats *examine_attribute(Relation onerel, int attnum);
! static HeapTuple *acquire_sample_rows(Relation onerel, 
! int targblocks, int targrows, long allowedMem,
! int *samplerows, double *totalrows, double *totaldeadrows);
  static double random_fract(void);
  static double init_selection_state(int n);
  static double get_next_S(double t, int n, double *stateptr);
***
*** 108,117 
--- 110,122 
  	VacAttrStats **vacattrstats;
  	AnlIndexData *indexdata;
  	int			targrows,
+ targblocks,
  numrows;
  	double		totalrows,
  totaldeadrows;
  	HeapTuple  *rows;
+ 	long		allowedMem = 0;		/* total memory allowed, in bytes */
+ 
  
  	if (vacstmt-verbose)
  		elevel = INFO;
***
*** 348,357 
  	/*
  	 * Acquire the sample rows
  	 */
- 	rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
- 	numrows = acquire_sample_rows(onerel, rows, targrows,
-   totalrows, totaldeadrows);
  
  	/*
  	 * Compute the statistics.	Temporary results during the calculations for
  	 * each column are stored in a 

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-13 Thread Josh Berkus
Simon,

 It's also worth mentioning that for datatypes that only have an =
 operator the performance of compute_minimal_stats is O(N^2) when values
 are unique, so increasing sample size is a very bad idea in that case.
 It may be possible to re-sample the sample, so that we get only one row
 per block as with the current row sampling method. Another idea might be
 just to abort the analysis when it looks fairly unique, rather than
 churn through the whole sample.

I'd tend to do the latter.   If we haven't had a value repeat in 25 blocks, 
how likely is one to appear later?

Hmmm ... does ANALYZE check for UNIQUE constraints?   Most unique values 
are going to have a constraint, in which case we don't need to sample them 
at all for N-distinct.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-13 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 It's also worth mentioning that for datatypes that only have an =
 operator the performance of compute_minimal_stats is O(N^2) when values
 are unique, so increasing sample size is a very bad idea in that case.

 Hmmm ... does ANALYZE check for UNIQUE constraints?

Our only implementation of UNIQUE constraints is btree indexes, which
require more than an = operator, so this seems irrelevant.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-10 Thread Simon Riggs
On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote:

 So it's not the 8k block reading that's fooling Linux into reading ahead 32k.
 It seems 32k readahead is the default for Linux, or perhaps it's the
 sequential access pattern that's triggering it.

Nah, Linux 2.6 uses flexible readahead logic. It increases slowly when
you read sequentially, but halves the readahead if you do another access
type. Can't see that would give an average readahead size of 32k.

Anyway this is one just reason for change...

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-10 Thread Greg Stark

Simon Riggs [EMAIL PROTECTED] writes:

 On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote:
 
  So it's not the 8k block reading that's fooling Linux into reading ahead 
  32k.
  It seems 32k readahead is the default for Linux, or perhaps it's the
  sequential access pattern that's triggering it.
 
 Nah, Linux 2.6 uses flexible readahead logic. It increases slowly when
 you read sequentially, but halves the readahead if you do another access
 type. Can't see that would give an average readahead size of 32k.

I've actually read this code at one point in the past. IIRC the readahead is
capped at 32k, which I find interesting given the results. Since this is
testing sequential access patterns perhaps what's happening is the test for
readahead is too liberal.  

All the numbers I'm getting are consistent with a 32k readahead. Even I run my
program with a 4k block size I get performance equivalent to a full table scan
very quickly. If I use a 32k block size then the breakeven point is just over
20%.

I suppose what I really ought to do is make some pretty graphs.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-10 Thread Jim C. Nasby
FWIW, results from a FBSD6.0 box. Between every run I ran code to zero
800MB (out of 1G) of memory, which should have effectively flushed the
OS cache (it would just put the OS into swapping).

Reading 1% (1280/128000 blocks 1048576000 bytes) total time 6870595us
MB/s 1.53 effective MB/s 152.62
Reading 2% (2560/128000 blocks 1048576000 bytes) total time 13397833us
MB/s 1.57 effective MB/s 78.26
Reading 3% (3840/128000 blocks 1048576000 bytes) total time 16288218us
MB/s 1.93 effective MB/s 64.38
Reading 4% (5120/128000 blocks 1048576000 bytes) total time 17998334us
MB/s 2.33 effective MB/s 58.26
Reading 5% (6400/128000 blocks 1048576000 bytes) total time 22958791us
MB/s 2.28 effective MB/s 45.67
Reading 6% (7680/128000 blocks 1048576000 bytes) total time 21272511us
MB/s 2.96 effective MB/s 49.29
Reading 7% (8960/128000 blocks 1048576000 bytes) total time 21708130us
MB/s 3.38 effective MB/s 48.30
Reading 8% (10240/128000 blocks 1048576000 bytes) total time 23213377us
MB/s 3.61 effective MB/s 45.17
Reading 9% (11520/128000 blocks 1048576000 bytes) total time 33641027us
MB/s 2.81 effective MB/s 31.17
Reading 10% (12800/128000 blocks 1048576000 bytes) total time 22484234us
MB/s 4.66 effective MB/s 46.64
Reading 15% (19200/128000 blocks 1048576000 bytes) total time 23985072us
MB/s 6.56 effective MB/s 43.72
Reading 20% (25600/128000 blocks 1048576000 bytes) total time 26332888us
MB/s 7.96 effective MB/s 39.82
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Simon Riggs
On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
 
  Before we start debating merits of proposals based on random reads, can
  someone confirm that the sampling code actually does read randomly? I
  looked at it yesterday; there is a comment that states that blocks to be
  scanned are passed to the analyze function in physical order, and AFAICT
  the function that chooses blocks does so based strictly on applying a
  probability function to block numbers as it increments a counter. It
  seems that any reading is actually sequential and not random, which
  makes all the random_page_cost hand-waving null and void.
 
 Hm. I'm curious just how much that behaves like a sequential scan actually. I
 think I'll do some experiments. 
 
 Reading 1% (1267 read, 126733 skipped):7748264us
 Reading 2% (2609 read, 125391 skipped):   12672025us
 Reading 5% (6502 read, 121498 skipped):   19005678us
 Reading 5% (6246 read, 121754 skipped):   18509770us
 Reading 10% (12975 read, 115025 skipped): 19305446us
 Reading 20% (25716 read, 102284 skipped): 18147151us
 Reading 50% (63656 read, 64344 skipped):  18089229us
 Reading 100% (128000 read, 0 skipped):18173003us
 
 These numbers don't make much sense to me. It seems like 5% is about as slow
 as reading the whole file which is even worse than I expected. I thought I was
 being a bit pessimistic to think reading 5% would be as slow as reading 20% of
 the table.

Just to put a few rumours to bed:

- the current code does *not* use block sampling, it uses random row
sampling. (There is a part of the code that selects the next block but
that should not confuse you into thinking that the whole block is
sampled).

- yes, the random sampling is random - please read the code and comments

- yes, I would expect the results you get. If you sample 5% of rows and
each block has on average at least 20 rows, then we should expect the
majority of blocks to be hit.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Lukas Smith

Simon Riggs wrote:


- yes, the random sampling is random - please read the code and comments

- yes, I would expect the results you get. If you sample 5% of rows and
each block has on average at least 20 rows, then we should expect the
majority of blocks to be hit.


and it seems from the benchmark posted to this list that random is 
_very_ expensive (probably because the random reads are spread out so 
well, that we do alot of I/O instead of just logical I/O from some cache)


regards,
Lukas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark
Simon Riggs [EMAIL PROTECTED] writes:

 - yes, I would expect the results you get. If you sample 5% of rows and
 each block has on average at least 20 rows, then we should expect the
 majority of blocks to be hit.

These results are from my test program. 5% means 5% of 8k blocks from the test
file. In other words, reading a random 5% of the blocks from the test file in
sequential order but seeking over the skipped blocks is just as slow as
reading the entire file.

I feel like that can't be right but I can't find anything wrong with the
methodology.

An updated program is attached with which I got these results:

bash-3.00# for i in `seq 1 100` ; do umount /u6; mount /dev/sda1 /u6; 
~stark/src/pg/a.out /u6/temp/small $i ; done
Reading 1% (1280/128000 blocks 1048576000 bytes) total time 7662706us MB/s 1.37 
effective MB/s 136.84u
Reading 2% (2560/128000 blocks 1048576000 bytes) total time 12495106us MB/s 
1.68 effective MB/s 83.92
Reading 3% (3840/128000 blocks 1048576000 bytes) total time 15847342us MB/s 
1.99 effective MB/s 66.17
Reading 4% (5120/128000 blocks 1048576000 bytes) total time 18281244us MB/s 
2.29 effective MB/s 57.36
Reading 5% (6400/128000 blocks 1048576000 bytes) total time 18988843us MB/s 
2.76 effective MB/s 55.22
Reading 6% (7680/128000 blocks 1048576000 bytes) total time 19225394us MB/s 
3.27 effective MB/s 54.54
Reading 7% (8960/128000 blocks 1048576000 bytes) total time 19462241us MB/s 
3.77 effective MB/s 53.88
Reading 8% (10240/128000 blocks 1048576000 bytes) total time 19747881us MB/s 
4.25 effective MB/s 53.10
Reading 9% (11520/128000 blocks 1048576000 bytes) total time 19451411us MB/s 
4.85 effective MB/s 53.91
Reading 10% (12800/128000 blocks 1048576000 bytes) total time 19546511us MB/s 
5.36 effective MB/s 53.65
Reading 11% (14080/128000 blocks 1048576000 bytes) total time 18989375us MB/s 
6.07 effective MB/s 55.22
Reading 12% (15360/128000 blocks 1048576000 bytes) total time 18722848us MB/s 
6.72 effective MB/s 56.01
Reading 13% (16640/128000 blocks 1048576000 bytes) total time 18621588us MB/s 
7.32 effective MB/s 56.31
Reading 14% (17920/128000 blocks 1048576000 bytes) total time 18581751us MB/s 
7.90 effective MB/s 56.43
Reading 15% (19200/128000 blocks 1048576000 bytes) total time 18422160us MB/s 
8.54 effective MB/s 56.92
Reading 16% (20480/128000 blocks 1048576000 bytes) total time 18148012us MB/s 
9.24 effective MB/s 57.78
Reading 17% (21760/128000 blocks 1048576000 bytes) total time 18147779us MB/s 
9.82 effective MB/s 57.78
Reading 18% (23040/128000 blocks 1048576000 bytes) total time 18023256us MB/s 
10.47 effective MB/s 58.18
Reading 19% (24320/128000 blocks 1048576000 bytes) total time 18039846us MB/s 
11.04 effective MB/s 58.13
Reading 20% (25600/128000 blocks 1048576000 bytes) total time 18081214us MB/s 
11.60 effective MB/s 57.99
...

#include sys/types.h
#include sys/stat.h
#include sys/time.h
#include time.h
#include fcntl.h
#include unistd.h

#include stdio.h
#include stdlib.h

#define BLOCKSIZE 8192

int main(int argc, char *argv[], char *arge[]) 
{
  char *fn;
  int fd;
  int perc;
  struct stat statbuf;
  struct timeval tv1,tv2;
  off_t size, offset;
  char *buf[BLOCKSIZE];
  int b_toread, b_toskip, b_read=0, b_skipped=0;
  long us;

  fn = argv[1];
  perc = atoi(argv[2]);

  fd = open(fn, O_RDONLY);
  fstat(fd, statbuf);
  size = statbuf.st_size;
  
  size = size/BLOCKSIZE*BLOCKSIZE;
  
  gettimeofday(tv1, NULL);

  srandom(getpid()^tv1.tv_sec^tv1.tv_usec);

  b_toread = size/BLOCKSIZE*perc/100;
  b_toskip = size/BLOCKSIZE-b_toread;

  for(offset=0;offsetsize;offset+=BLOCKSIZE) {
if (random()%(b_toread+b_toskip)  b_toread) {
  lseek(fd, offset, SEEK_SET);
  read(fd, buf, BLOCKSIZE);
  b_toread--;
  b_read++;
} else {
  b_toskip--;
  b_skipped++;
}
  }
  
  gettimeofday(tv2, NULL);
  
  us = (tv2.tv_sec-tv1.tv_sec)*100 + (tv2.tv_usec-tv1.tv_usec);
  
  fprintf(stderr,
	  Reading %d%% (%d/%d blocks %ld bytes) total time %ldus MB/s %.2f effective MB/s %.2f\n,
	  perc,
	  b_read, b_read+b_skipped, size,
	  us,
	  (double)b_read*BLOCKSIZE/us,
	  (double)size/us
	  );
  exit(0);
}



-- 
greg

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark

  These numbers don't make much sense to me. It seems like 5% is about as slow
  as reading the whole file which is even worse than I expected. I thought I 
  was
  being a bit pessimistic to think reading 5% would be as slow as reading 20% 
  of
  the table.

I have a theory. My test program, like Postgres, is reading in 8k chunks.
Perhaps that's fooling Linux into thinking it's a sequential read and reading
in 32k chunks internally. That would effectively make a 25% scan a full table
scan. And a 5% scan would be a 20% scan which is about where I would have
expected the breakeven point to be.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark

Greg Stark [EMAIL PROTECTED] writes:

   These numbers don't make much sense to me. It seems like 5% is about as 
   slow
   as reading the whole file which is even worse than I expected. I thought 
   I was
   being a bit pessimistic to think reading 5% would be as slow as reading 
   20% of
   the table.
 
 I have a theory. My test program, like Postgres, is reading in 8k chunks.
 Perhaps that's fooling Linux into thinking it's a sequential read and reading
 in 32k chunks internally. That would effectively make a 25% scan a full table
 scan. And a 5% scan would be a 20% scan which is about where I would have
 expected the breakeven point to be.

Well my theory was sort of half right. It has nothing to do with fooling Linux
into thinking it's a sequential read. Apparently this filesystem was created
with 32k blocks. I don't remember if that was intentional or if ext2/3 did it
automatically based on the size of the filesystem.

So it doesn't have wide-ranging implications for Postgres's default 8k block
size. But it is a good lesson about the importance of not using a larger
filesystem block than Postgres's block size. The net effect is that if the
filesystem block is N*8k then your random_page_cost goes up by a factor of N.
That could be devastating for OLTP performance.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark

Greg Stark [EMAIL PROTECTED] writes:

 Well my theory was sort of half right. It has nothing to do with fooling Linux
 into thinking it's a sequential read. Apparently this filesystem was created
 with 32k blocks. I don't remember if that was intentional or if ext2/3 did it
 automatically based on the size of the filesystem.
 
 So it doesn't have wide-ranging implications for Postgres's default 8k block
 size. But it is a good lesson about the importance of not using a larger
 filesystem block than Postgres's block size. The net effect is that if the
 filesystem block is N*8k then your random_page_cost goes up by a factor of N.
 That could be devastating for OLTP performance.

Hm, apparently I spoke too soon. tune2fs says the block size is in fact 4k.
Yet the performance of the block reading test program with 4k or 8k blocks
behaves as if Linux is reading 32k blocks. And in fact when I run it with 32k
blocks I get kind of behaviour we were expecting where the breakeven point is
around 20%.

So it's not the 8k block reading that's fooling Linux into reading ahead 32k.
It seems 32k readahead is the default for Linux, or perhaps it's the
sequential access pattern that's triggering it.

I'm trying to test it with posix_fadvise() set to random access but I'm having
trouble compiling. Do I need a special #define to get posix_fadvise from
glibc?

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-08 Thread Kenneth Marshall
On Fri, Jan 06, 2006 at 06:36:52PM -0500, Greg Stark wrote:
 Josh Berkus josh@agliodbs.com writes:
 
   These numbers don't make much sense to me. It seems like 5% is about as
   slow as reading the whole file which is even worse than I expected. I
   thought I was being a bit pessimistic to think reading 5% would be as
   slow as reading 20% of the table.
  
  It's about what *I* expected.  Disk seeking is the bane of many access 
  methods.
 
 Sure, but that bad? That means realistic random_page_cost values should be
 something more like 20 rather than 4. And that's with seeks only going to
 subsequent blocks in a single file, which one would expect to average less
 than the half rotation that a random seek would average. That seems worse than
 anyone expects.
 
  Anyway, since the proof is in the pudding, Simon and I will be working on 
  some demo code for different sampling methods so that we can debate 
  results rather than theory.
 
 Note that if these numbers are realistic then there's no i/o benefit to any
 sampling method that requires anything like 5% of the entire table and is
 still unreliable. Instead it makes more sense to implement an algorithm that
 requires a full table scan and can produce good results more reliably.
 
 -- 
 greg
 

I have not looked closely at the sampling process in PostgreSQL, but given
that getting an accurate answer is expensive in terms of read/seeks, would
it be possible to iteratively refine the estimate overtime? Ideally we could
use the same disk blocks that are already being read to satisfy a query.
The initial estimate could be done with a minimal impact and then we piggy-
back the I/O needed to refine the estimate on the normal query I/O. The
obvious limit would be if the entire table were scanned we would get a 100%
exact estimate of the distict value at that time, and otherwise a progressively
more accurate approximation as queries are run. This would allow us to
hide the I/O in other normal block accesses.

Ken

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Jim C. Nasby
On Fri, Jan 06, 2006 at 01:24:41AM -0500, Greg Stark wrote:
  5% based on block-based sampling is reasonable; that means a straight 5% of 
  the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
  sampling, that would require as much as 25% of the table, making it easier 
  to just scan the whole thing.
 
 Postgres's current sample sizes are clearly geared towards the histograms
 where they are entirely realistic. All of the distinct estimates are clearly
 just ad hoc attempts based on the existing sampling. 
 
 Is a mechanism that is only 5x faster than reading the whole table (assuming
 random_page_cost of 4) and is off by more than a factor of three 10% of the
 time really worth it?

Before we start debating merits of proposals based on random reads, can
someone confirm that the sampling code actually does read randomly? I
looked at it yesterday; there is a comment that states that blocks to be
scanned are passed to the analyze function in physical order, and AFAICT
the function that chooses blocks does so based strictly on applying a
probability function to block numbers as it increments a counter. It
seems that any reading is actually sequential and not random, which
makes all the random_page_cost hand-waving null and void.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 Before we start debating merits of proposals based on random reads, can
 someone confirm that the sampling code actually does read randomly?

Well, it's not so much that it's not random, as that it's not
sequential --- it skips blocks, and therefore you'd expect that
kernel-level read-ahead would not kick in, or at least not be very
effective.

If there weren't much else going on, you could still assume that
you'd be paying less seek cost than in a genuinely random-order
fetching of the same number of blocks.

Not sure how these effects would add up.  I agree that some
investigation would be wise before making any claims about how
expensive the current method actually is.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Greg Stark

Jim C. Nasby [EMAIL PROTECTED] writes:

 Before we start debating merits of proposals based on random reads, can
 someone confirm that the sampling code actually does read randomly? I
 looked at it yesterday; there is a comment that states that blocks to be
 scanned are passed to the analyze function in physical order, and AFAICT
 the function that chooses blocks does so based strictly on applying a
 probability function to block numbers as it increments a counter. It
 seems that any reading is actually sequential and not random, which
 makes all the random_page_cost hand-waving null and void.

Hm. I'm curious just how much that behaves like a sequential scan actually. I
think I'll do some experiments. 

Reading 1% (1267 read, 126733 skipped):  7748264us
Reading 2% (2609 read, 125391 skipped): 12672025us
Reading 5% (6502 read, 121498 skipped): 19005678us
Reading 5% (6246 read, 121754 skipped): 18509770us
Reading 10% (12975 read, 115025 skipped):   19305446us
Reading 20% (25716 read, 102284 skipped):   18147151us
Reading 50% (63656 read, 64344 skipped):18089229us
Reading 100% (128000 read, 0 skipped):  18173003us

These numbers don't make much sense to me. It seems like 5% is about as slow
as reading the whole file which is even worse than I expected. I thought I was
being a bit pessimistic to think reading 5% would be as slow as reading 20% of
the table.

Anyone see anything wrong my my methodology?

#include sys/types.h
#include sys/stat.h
#include sys/time.h
#include time.h
#include fcntl.h
#include unistd.h

#include stdio.h
#include stdlib.h

#define BLOCKSIZE 8192

int main(int argc, char *argv[], char *arge[]) 
{
  char *fn;
  int fd;
  int perc;
  struct stat statbuf;
  struct timeval tv1,tv2;
  off_t size, offset;
  char *buf[BLOCKSIZE];
  int b_read=0, b_skipped=0;

  fn = argv[1];
  perc = atoi(argv[2]);

  fd = open(fn, O_RDONLY);
  fstat(fd, statbuf);
  size = statbuf.st_size;
  
  size = size/BLOCKSIZE*BLOCKSIZE;
  
  gettimeofday(tv1, NULL);

  srandom(getpid()^tv1.tv_sec^tv1.tv_usec);

  for(offset=0;offsetsize;offset+=BLOCKSIZE) {
if (random()%100  perc) {
  lseek(fd, offset, SEEK_SET);
  read(fd, buf, BLOCKSIZE);
  b_read++;
} else {
  b_skipped++;
}
  }
  
  gettimeofday(tv2, NULL);
  
  fprintf(stderr,
	  Reading %d%% (%d read, %d skipped): %ldus\n,
	 (int)perc, b_read, b_skipped,
	 (tv2.tv_sec-tv1.tv_sec)*100 + (tv2.tv_usec-tv1.tv_usec)
	 );
  exit(0);
}



-- 
greg

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Josh Berkus
Greg,

 These numbers don't make much sense to me. It seems like 5% is about as
 slow as reading the whole file which is even worse than I expected. I
 thought I was being a bit pessimistic to think reading 5% would be as
 slow as reading 20% of the table.

It's about what *I* expected.  Disk seeking is the bane of many access 
methods.

Anyway, since the proof is in the pudding, Simon and I will be working on 
some demo code for different sampling methods so that we can debate 
results rather than theory.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Greg Stark
Josh Berkus josh@agliodbs.com writes:

  These numbers don't make much sense to me. It seems like 5% is about as
  slow as reading the whole file which is even worse than I expected. I
  thought I was being a bit pessimistic to think reading 5% would be as
  slow as reading 20% of the table.
 
 It's about what *I* expected.  Disk seeking is the bane of many access 
 methods.

Sure, but that bad? That means realistic random_page_cost values should be
something more like 20 rather than 4. And that's with seeks only going to
subsequent blocks in a single file, which one would expect to average less
than the half rotation that a random seek would average. That seems worse than
anyone expects.

 Anyway, since the proof is in the pudding, Simon and I will be working on 
 some demo code for different sampling methods so that we can debate 
 results rather than theory.

Note that if these numbers are realistic then there's no i/o benefit to any
sampling method that requires anything like 5% of the entire table and is
still unreliable. Instead it makes more sense to implement an algorithm that
requires a full table scan and can produce good results more reliably.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Simon Riggs
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  The approach I suggested uses the existing technique for selecting
  random blocks, then either an exhaustive check on all of the rows in a
  block or the existing random row approach, depending upon available
  memory. We need to check all of the rows in a reasonable sample of
  blocks otherwise we might miss clusters of rows in large tables - which
  is the source of the problems identified.
  
  The other reason was to increase the sample size, which is a win in any
  form of statistics.
 
 Only if your sample is random and independent. The existing mechanism tries
 fairly hard to ensure that every record has an equal chance of being selected.
 If you read the entire block and not appropriate samples then you'll introduce
 systematic sampling errors. For example, if you read an entire block you'll be
 biasing towards smaller records.

Yes, I discussed that, following Brutlag  Richardson [2000]. The bottom
line is if there is no clustering, block sampling is random, which is
good; if there is clustering, then you spot it, which is good.

 I think it would be useful to have a knob to increase the sample size
 separately from the knob for the amount of data retained in the statistics
 tables. Though I think you'll be disappointed and find you have to read an
 unreasonably large sample out of the table before you get more useful distinct
 estimates.

OK, I'll look at doing that.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Rod Taylor
  Do you *really* want the median estimate in these case?  Are you certain 
  you 
  do not want something with the opposite behavior of Chaudhuri's estimate so 
  that for small sample sizes the bias is toward a high estimate of D? 
  (Converges on D from the right instead of the left.)
  
  Chaudhuri's -D-- needed
  Estimate   estimate
 
 Hmmm.  Yeah, I see what you mean.  True, the ideal approach would to 
 deterime for each query operation whether a too-low D or a too-high D 
 was more risky, and then use the more conservative number.   However, 
 that would complicate the query planner enough that I think Tom would 
 leave us. :-p

You could have some specific functions vote themselves out if their cost
is shakey. We know that the cost of a miscalculated nestloop is huge, so
after calculating the common case it might apply a multiplier for the
risk involved.

There have been lots of requests for a way to achieve more consistent
plans that have a determined worst case performance, even if they never
perform as well in the best case as another algorithm might. Perhaps
this could be a GUC.

PlanCost + PlanCost * Risk * RiskGUC

Risk is a number that indicates how badly things can go wrong.

RiskGUC is an integer multiplier. Someone who is risk averse (wants a
predictable execution time rather than the best possible time) would set
this value high. Others who want the best possible plan in most cases
even if it performs poorly once in a while will set the value very low,
possibly 0.

-- 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  Only if your sample is random and independent. The existing mechanism tries
  fairly hard to ensure that every record has an equal chance of being 
  selected.
  If you read the entire block and not appropriate samples then you'll 
  introduce
  systematic sampling errors. For example, if you read an entire block you'll 
  be
  biasing towards smaller records.
 
 Did you read any of the papers on block-based sampling?   These sorts of 
 issues
 are specifically addressed in the algorithms.

We *currently* use a block based sampling algorithm that addresses this issue
by taking care to select rows within the selected blocks in an unbiased way.
You were proposing reading *all* the records from the selected blocks, which
throws away that feature.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  These statements are at odds with my admittedly basic understanding of
  statistics.  Isn't the power of a sample more related to the absolute size 
  of
  the sample than the sample as fraction of the population?  Why not just pick
  a smallish sample size, say about 3000, and apply it to all the tables, even
  the ones with just a single row (modify appropriately from block sampling).
 
 Nope, it's definitely proportional.   As a simple example, a sample of 500 
 rows
 in a table of 1000 rows should yeild stats estimates with 90%+ accuracy.  But 
 a
 sample of 500 rows in a 600,000,000 row table is so small as to be nearly
 useless; it's quite possible to get all the same value in a random sample of 
 0.1% even on a column with a D/N of 0.001.   If you look at the papers cited,
 almost all researchers more recent than Chaudhuri use a proportional sample
 size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like 

  SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Josh Berkus
Greg,

 We *currently* use a block based sampling algorithm that addresses this
 issue by taking care to select rows within the selected blocks in an
 unbiased way. You were proposing reading *all* the records from the
 selected blocks, which throws away that feature.

The block-based algorithms have specific math to cope with this.   Stuff 
which is better grounded in statistical analysis than our code.   Please 
read the papers before you judge the solution.

 Worse, my recollection from the paper I mentioned earlier was that
 sampling small percentages like 3-5% didn't get you an acceptable
 accuracy. Before you got anything reliable you found you were sampling
 very large percentages of the table. And note that if you have to sample
 anything over 10-20% you may as well just read the whole table. Random
 access reads are that much slower.

Right, which is why researchers are currently looking for something better.  
The Brutlag  Richardson claims to be able to produce estimates which are 
within +/- 3x 90% of the time using a 5% sample, which is far better than 
our current accuracy.  Nobody claims to be able to estimate based on  
0.1% of the table, which is what Postgres tries to do for large tables.

5% based on block-based sampling is reasonable; that means a straight 5% of 
the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
sampling, that would require as much as 25% of the table, making it easier 
to just scan the whole thing.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Jim C. Nasby
On Thu, Jan 05, 2006 at 10:12:29AM -0500, Greg Stark wrote:
 Worse, my recollection from the paper I mentioned earlier was that sampling
 small percentages like 3-5% didn't get you an acceptable accuracy. Before you
 got anything reliable you found you were sampling very large percentages of
 the table. And note that if you have to sample anything over 10-20% you may as
 well just read the whole table. Random access reads are that much slower.

If I'm reading backend/commands/analyze.c right, the heap is accessed
linearly, only reading blocks that get selected but reading them in heap
order, which shouldn't be anywhere near as bad as random access.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 Greg,
 
  We *currently* use a block based sampling algorithm that addresses this
  issue by taking care to select rows within the selected blocks in an
  unbiased way. You were proposing reading *all* the records from the
  selected blocks, which throws away that feature.
 
 The block-based algorithms have specific math to cope with this.   Stuff 
 which is better grounded in statistical analysis than our code.   Please 
 read the papers before you judge the solution.

I'm confused since Postgres's current setup *was* based on papers on just this
topic. I haven't read any of the papers myself, I'm just repeating what was
previously discussed when the current block sampling code went in. There was
extensive discussion of the pros and cons of different algorithms taken from
various papers and how they related to Postgres. I don't recall any of them
suggesting that there was any way to take a sample which included every row
from a sampled block and then somehow unbias the sample after the fact.

 Right, which is why researchers are currently looking for something better.  
 The Brutlag  Richardson claims to be able to produce estimates which are 
 within +/- 3x 90% of the time using a 5% sample, which is far better than 
 our current accuracy.  Nobody claims to be able to estimate based on  
 0.1% of the table, which is what Postgres tries to do for large tables.
 
 5% based on block-based sampling is reasonable; that means a straight 5% of 
 the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
 sampling, that would require as much as 25% of the table, making it easier 
 to just scan the whole thing.

Postgres's current sample sizes are clearly geared towards the histograms
where they are entirely realistic. All of the distinct estimates are clearly
just ad hoc attempts based on the existing sampling. 

Is a mechanism that is only 5x faster than reading the whole table (assuming
random_page_cost of 4) and is off by more than a factor of three 10% of the
time really worth it?

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
Improving N-Distinct estimation
===
v1.1

OBJECTIVES

Answer these points...

- Are there any performance issues that can be directly attributed to
mis-estimation of N-Distinct (D) by the ANALYZE command? 

- If so, can we do better than we currently achieve? How?

- Would altering the estimate of D cause problems in other places?

Comments are sought on the problem and the possible solutions. Please
get involved if you can help with detailed analyses on this topic.

SUMMARY

The estimation of D is difficult and imprecise. The current method works
well in many cases, yet breaks down *badly* in one particular very
common use case of database design: a large dependent table with a
multi-column Primary Key. In some cases the estimate of D *decreases* as
the size of the table increases, though the estimate of D is an
underestimate in almost all cases, whatever the table design.

PostgreSQL cvstip currently seriously under estimates D for very large
dependent tables with rows that are clustered around one of the columns
of a multi-column Primary Key. The mis-estimation identified can lead to
poor system performance should the mis-estimation lead to the use of
HashAggregate or general mis-selection of optimal join plans.

An example of this is the orders and lineitem tables from DBT-3/TPC-H,
which have a 1:M relationship. There are M lineitems associated with
each order and all are inserted in one transaction into lineitem. If M
is relatively large, then problems may ensue.
A problem SQL statement may be something like:
SELECT l_orderkey, sum(l_extendedprice) from lineitem;

This issue could have an impact on any large table in a 1:M relationship
where the actual number of distinct values, D, is much larger than the
sample size, n. (D  n). This can also effect associative tables where
more than one 1:M relationship exists to separate tables, such as Fact
tables in a star schema.

The issue is alleviated by setting the column statistics target higher,
though this merely increases the size range of the table over which
problems may occur.

There are a number of ways we can improve on the current estimates,
using techniques suggested in later statistical research.

Some proposals are made and comments are sought on the problem and the
possible solutions.

It is possible that we need more than one estimate of D for various
purposes. We might potentially need a low estimate of D for use in join
planning, whereas a higher estimate to reduce the risk of hash table
operations. This approach might be taken initially to allow us to
implement improved estimators without throwing out many previously good
plans.

WHAT WE CURRENTLY DO WITH ANALYZE

Notation
D = estimate of the number of distinct values (aka n_distinct)
N = number of rows in table
n = number of rows in sample

Sampling method
* Fixed sample size, no matter how big table, following Chaudhuri et
al's 1998 paper on sample size sufficiency for statistics histograms.

* Sample blocks = sample rows = 300 * col stats target 

Results
* Count rows/value for all values observed in sample
f1 = number of unique values in sample
d = number of values in sample

* If f1 == n = assume unique: D = N and scale with N
else If f1 == 0 = assume D = d
else = apply Haas-Stokes [1998] estimator

* If D  10% of N = scale with N

[There are a variety of techniques selected from Haas-Stokes [1998],
Chaudhuri et al [1998], Vitter and Knuth. Sometimes these authors have
discussed the same subject and come up with different answers, so you
need to be careful to say which reference you mean when discussing
these.]

ISSUES

1. Estimation of D; mentioned above and covered in more detail below.
(see ESTIMATES OF D FOR DEPENDENT TABLES)

2. The sample size calculation correctly follows Chaudhuri et al [1998]
when the number of rows in the table is 1 million. However, smaller
tables are overestimated and larger tables are underestimated. The
sample size should be multiplied by 2.3 (i.e. ln(10)) for every x10
larger table size. e.g. a 100 million row table requires sample size 4.6
times larger to have the same accuracy for histogram selection.


OBSERVATIONS

1. *All* methods of statistical analysis are improved by larger sample
fractions. The D estimator method currently in use shows an optimum of
accuracy and sample fraction at around 5% of a table, as shown in the
author's original paper [Haas Stokes (1998)]. The current
implementation's error rates climb higher as table size increases.

2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
row sampling technique rather than a block sampling technique. This
would translate directly into a performance drop from large sample
ratios, but since we currently use a fixed sample size this problem is
not yet visible for larger tables. With a 2GB table, we would typically
sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

3. Large values of statistics target (up to 1000) could cause a 

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 [ ... a large amount of analysis based on exactly one test case ... ]

I think you are putting too much emphasis on fixing one case and not
enough on considering what may happen in other cases ...

In general, estimating n-distinct from a sample is just plain a hard
problem, and it's probably foolish to suppose we'll ever be able to
do it robustly.  What we need is to minimize the impact when we get
it wrong.  So I agree with the comment that we need to finish the
unfinished project of making HashAggregate tables expansible, but
I'm dubious about the rest of this.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Tom,

 In general, estimating n-distinct from a sample is just plain a hard
 problem, and it's probably foolish to suppose we'll ever be able to
 do it robustly.  What we need is to minimize the impact when we get
 it wrong.  

Well, I think it's pretty well proven that to be accurate at all you need 
to be able to sample at least 5%, even if some users choose to sample 
less.   Also I don't think anyone on this list disputes that the current 
algorithm is very inaccurate for large tables.  Or do they?

While I don't think that we can estimate N-distinct completely accurately, 
I do think that we can get within +/- 5x for 80-90% of all cases, instead 
of 40-50% of cases like now.  We can't be perfectly accurate, but we can 
be *more* accurate.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Jim C. Nasby
On Wed, Jan 04, 2006 at 07:10:29PM +, Simon Riggs wrote:
 3. We should also apply multi-column heuristics to the estimation of D,
 once we have estimated all columns. For column groups (pairs, triples
 etc) that form part of a PK, we know that it must be true that D1 *
 D2 ... Dk = N. In many cases we will be very confident of our estimate
 of D when we decide = d. i.e. When we have two columns, we can use this
 to infer that D1 = N/d when D2 = d. So we can do this in any case where
 we have confident estimates of all but one column; the required
 information is available at that time.
 e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
 that there are at most 10 l_linenumber values in the table, then there
 should be N/10 values for l_orderkey, so set it to that if it is lower
 (only).

Sorry if I'm pointing out the obwious, but I would do this for any 
unique index, not just a PK. (It should still hold for any unique index,
right?)

Also, was an approach of sampling random rows within random blocks
considered? Something like:

until row sample size reached
read random block
sample x% of rows in that block randomly
done
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Simon,

 - Are there any performance issues that can be directly attributed to
 mis-estimation of N-Distinct (D) by the ANALYZE command?

Yes.   There's at least one query (maybe two) from TPC-H which bombs 
because of bad N-distinct estimation, even with stats_target =1000.  Based 
on my experience with data warehouses, this also occurs in the field.

 - If so, can we do better than we currently achieve? How?

Replace the current algorithm and broaden the sample.

 - Would altering the estimate of D cause problems in other places?

Unlike Index Cost Estimation, I wouldn't expect it to.  We make pretty 
proper use of D right now, it's just that for some common cases our 
estimates of D are bad.

 The estimation of D is difficult and imprecise. The current method works
 well in many cases, yet breaks down *badly* in one particular very
 common use case of database design: a large dependent table with a
 multi-column Primary Key. In some cases the estimate of D *decreases* as
 the size of the table increases, though the estimate of D is an
 underestimate in almost all cases, whatever the table design.

Actually, the current estimator underestimates D for *any* large table's 
high-cardinality columns, primary key, multi-column, or not. 
Chaudhuri's calculation seems to be designed to yield the smallest 
number of cardinal values that could reasonably be expected to yield the 
provided sample.   That is, if the estimate range within a stdev of 2.0 
gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.

This conservative approach makes sense when you're only considering join 
strategies.  That is, given an unreliable estimate you want to estimate 
D low so that you don't wrongly choose a nested loop, the cost for which 
mistake being much higher than the cost of performing an unnecessary 
hash join.   It's conservative in that sense.

However,   PostgreSQL now has a whole set of hash operations and other 
query types for which a 
too-low estimate of D causes query lockup.   So for these operations, 
Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my 
testing with TPCH showed that estimate error is usually OK within +/- 
5x.  Beyond that any you start to get bad query plans.

(yes, I know all of the above begs examples.  I'm swamped.   I believe I 
posted examples when I first started talking about n-distinct estimation a 
year ago)

So I think it's vital that we look at algorithms designed to deliver us 
the median estimated D, not the lowest reasonable, in addition to 
increasing sample size.  The block-based estimator functions which 
Andrew and I looked at seem designed to do that provided a sample of 
between 1% and 10%.

 1. *All* methods of statistical analysis are improved by larger sample
 fractions. The D estimator method currently in use shows an optimum of
 accuracy and sample fraction at around 5% of a table, as shown in the
 author's original paper [Haas Stokes (1998)]. The current
 implementation's error rates climb higher as table size increases.

I read 5 different papers on ACM about sampling D.  All of them were 
united in saying that you couldn't get even slightly accurate estimates 
with less than 3% sampling.

 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
 row sampling technique rather than a block sampling technique. This
 would translate directly into a performance drop from large sample
 ratios, but since we currently use a fixed sample size this problem is
 not yet visible for larger tables. With a 2GB table, we would typically
 sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

This woudl be a reason to use block-sampling ONLY, rather than hybrid 
sampling.

 3. Large values of statistics target (up to 1000) could cause a number
 of problems with statistics catalog table growth (mentioned on
 -perform). Setting these values appropriately can take significant
 effort. Automatic scaling of such parameters is desirable.

Well, decoupling the n-distinct sample size from the # of heuristics rows 
would fix that.

 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high,
 nor does it use less if it is set low. Actual memory usage isn't
 measured at all. With very long rows this is max 24 MB with default
 stats target settings and BLCKSZ, or 2.4 GB with highest stats target
 (1000). This probably is of lower importance since stats targets are
 only usually set higher on larger databases, which typically have larger
 memory configurations anyway - and very long rows are uncommon because
 of TOAST. Typical memory usage by ANALYZE would be  1 MB with default
 settings i.e. maybe as low as a 0.01% sample for a very large table.

Yeah, I think I pointed out that ANALYZE doesn't seem to be using 
maintenance_mem a year ago.   It should, although there should be options 
to use less than maintenance_mem if the user wants low-impact analyze.

 There is a further effect of concern here. We currently apply a lift
 

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
On Wed, 2006-01-04 at 14:49 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  [ ... a large amount of analysis based on exactly one test case ... ]

[Hmmm, those are your opinions, not my words. Funny guy ;-) ]

The one test case just happens to be a very common 1:M relationship, an
example of which occurs within the TPC-H/DBT-3 test database. I picked
that so it was an obvious publicly accessible test case, rather than an
isolated customer issue that couldn't be aired fully. I was hoping to
allow the problem to be explored and improved.

 I think you are putting too much emphasis on fixing one case and not
 enough on considering what may happen in other cases ...

It's not just one problem, but knowing that you would accept only a
detailed analysis, I researched that one case so that it was
indisputable. 

 I'm dubious about the rest of this.

Excellent. Much better than It's Wrong. I'll write some code and run some 
tests.

Thanks for reading it all; sorry it had to be so long.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Greg Stark
Josh Berkus josh@agliodbs.com writes:

 Tom,
 
  In general, estimating n-distinct from a sample is just plain a hard
  problem, and it's probably foolish to suppose we'll ever be able to
  do it robustly.  What we need is to minimize the impact when we get
  it wrong.  
 
 Well, I think it's pretty well proven that to be accurate at all you need 
 to be able to sample at least 5%, even if some users choose to sample 
 less.   Also I don't think anyone on this list disputes that the current 
 algorithm is very inaccurate for large tables.  Or do they?

I think it's worse than that. It's proven that to get any kind of accuracy in
this kind of estimate you basically have to look at the entire table.

Someone posted a URL to a paper that had a very clever way of estimating
distinct values. It required scanning the entire table but only kept a sample
of the values found using a method that guaranteed the sample was
representative not of the entire table but of the distinct values.

I think you're right that a reasonable sample size for this kind of estimate
is going to be proportional to the table size, not the constant sized sample
that regular statistics need. However that same paper has some preliminary
verbiage about how previous papers had found that the sample sizes needed were
unacceptably large. Something like 90%.

Unfortunately I can't find the reference to the paper this moment. I'll look
again later.

I think it would be interesting to get the algorithm for sampling from that
paper in. It would only be able to be used on a VACUUM ANALYZE but currently
VACUUMs are necessary anyways. I do wonder what would happen when we get the
incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily
though. Then it would be less useful.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
On Wed, 2006-01-04 at 17:57 -0600, Jim C. Nasby wrote:
 On Wed, Jan 04, 2006 at 07:10:29PM +, Simon Riggs wrote:
  3. We should also apply multi-column heuristics to the estimation of D,
  once we have estimated all columns. For column groups (pairs, triples
  etc) that form part of a PK, we know that it must be true that D1 *
  D2 ... Dk = N. In many cases we will be very confident of our estimate
  of D when we decide = d. i.e. When we have two columns, we can use this
  to infer that D1 = N/d when D2 = d. So we can do this in any case where
  we have confident estimates of all but one column; the required
  information is available at that time.
  e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
  that there are at most 10 l_linenumber values in the table, then there
  should be N/10 values for l_orderkey, so set it to that if it is lower
  (only).
 
 Sorry if I'm pointing out the obwious, but I would do this for any 
 unique index, not just a PK. (It should still hold for any unique index,
 right?)

Yes. It's just a less common case to have  1 unique indexes.

 Also, was an approach of sampling random rows within random blocks
 considered? Something like:
 
 until row sample size reached
 read random block
 sample x% of rows in that block randomly
 done

Yes. 

I was trying to maintain the existing approach as much as possible,
rather than ripping everything out and starting again which could cause
just as many problems as it solves. So evolution, rather than
revolution.

The approach I suggested uses the existing technique for selecting
random blocks, then either an exhaustive check on all of the rows in a
block or the existing random row approach, depending upon available
memory. We need to check all of the rows in a reasonable sample of
blocks otherwise we might miss clusters of rows in large tables - which
is the source of the problems identified.

The other reason was to increase the sample size, which is a win in any
form of statistics.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
 I think you're right that a reasonable sample size for this kind of
 estimate
 is going to be proportional to the table size, not the constant sized
 sample
 that regular statistics need. 

Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.

I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).

Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.

 On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
  This one looks *really* good. 
  
   http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
  
  It does require a single full table scan 

The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.

Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Trent Shipley
Sorry to interupt.  The discussion is interesting, but I need some help to 
follow along.

On Wednesday 2006-01-04 17:07, Josh Berkus wrote:
 Simon,

  - Are there any performance issues that can be directly attributed to
  mis-estimation of N-Distinct (D) by the ANALYZE command?

 Yes.   There's at least one query (maybe two) from TPC-H which bombs
 because of bad N-distinct estimation, even with stats_target =1000.  Based
 on my experience with data warehouses, this also occurs in the field.

  - If so, can we do better than we currently achieve? How?

 Replace the current algorithm and broaden the sample.

Is replace the algorithm the same as saying contextually use some estimate 
of D that is not Chaudhuri?

  - Would altering the estimate of D cause problems in other places?

 Unlike Index Cost Estimation, I wouldn't expect it to.  We make pretty
 proper use of D right now, it's just that for some common cases our
 estimates of D are bad.

  The estimation of D is difficult and imprecise. The current method works
  well in many cases, yet breaks down *badly* in one particular very
  common use case of database design: a large dependent table with a
  multi-column Primary Key. In some cases the estimate of D *decreases* as
  the size of the table increases, though the estimate of D is an
  underestimate in almost all cases, whatever the table design.

 Actually, the current estimator underestimates D for *any* large table's
 high-cardinality columns, primary key, multi-column, or not.
 Chaudhuri's calculation seems to be designed to yield the smallest
 number of cardinal values that could reasonably be expected to yield the
 provided sample.   That is, if the estimate range within a stdev of 2.0
 gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.

 This conservative approach makes sense when you're only considering join
 strategies.  That is, given an unreliable estimate you want to estimate
 D low so that you don't wrongly choose a nested loop, the cost for which
 mistake being much higher than the cost of performing an unnecessary
 hash join.   It's conservative in that sense.

So Chaudhuri's estimate of D is appropriate (and is working) when making 
decisions about joins.

 However,   PostgreSQL now has a whole set of hash operations and other
 query types for which a
 too-low estimate of D causes query lockup.

Why?  

 So for these operations, 
 Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my
 testing with TPCH showed that estimate error is usually OK within +/-
 5x.  Beyond that any you start to get bad query plans.

 (yes, I know all of the above begs examples.  I'm swamped.   I believe I
 posted examples when I first started talking about n-distinct estimation a
 year ago)

 So I think it's vital that we look at algorithms designed to deliver us
 the median estimated D, not the lowest reasonable, in addition to
 increasing sample size.  The block-based estimator functions which
 Andrew and I looked at seem designed to do that provided a sample of
 between 1% and 10%.

Do you *really* want the median estimate in these case?  Are you certain you 
do not want something with the opposite behavior of Chaudhuri's estimate so 
that for small sample sizes the bias is toward a high estimate of D? 
(Converges on D from the right instead of the left.)

Chaudhuri's -D-- needed
Estimate   estimate

  1. *All* methods of statistical analysis are improved by larger sample
  fractions. The D estimator method currently in use shows an optimum of
  accuracy and sample fraction at around 5% of a table, as shown in the
  author's original paper [Haas Stokes (1998)]. The current
  implementation's error rates climb higher as table size increases.

 I read 5 different papers on ACM about sampling D.  All of them were
 united in saying that you couldn't get even slightly accurate estimates
 with less than 3% sampling.

These statements are at odds with my admittedly basic understanding of 
statistics.  Isn't the power of a sample more related to the absolute size of 
the sample than the sample as fraction of the population?  Why not just pick 
a smallish sample size, say about 3000, and apply it to all the tables, even 
the ones with just a single row (modify appropriately from block sampling).

  2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
  row sampling technique rather than a block sampling technique. This
  would translate directly into a performance drop from large sample
  ratios, but since we currently use a fixed sample size this problem is
  not yet visible for larger tables. With a 2GB table, we would typically
  sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

 This woudl be a reason to use block-sampling ONLY, rather than hybrid
 sampling.

  3. Large values of statistics target (up to 1000) could cause a number
  of problems with statistics catalog table growth (mentioned on

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Greg Stark

Simon Riggs [EMAIL PROTECTED] writes:

 The approach I suggested uses the existing technique for selecting
 random blocks, then either an exhaustive check on all of the rows in a
 block or the existing random row approach, depending upon available
 memory. We need to check all of the rows in a reasonable sample of
 blocks otherwise we might miss clusters of rows in large tables - which
 is the source of the problems identified.
 
 The other reason was to increase the sample size, which is a win in any
 form of statistics.

Only if your sample is random and independent. The existing mechanism tries
fairly hard to ensure that every record has an equal chance of being selected.
If you read the entire block and not appropriate samples then you'll introduce
systematic sampling errors. For example, if you read an entire block you'll be
biasing towards smaller records.

I think it would be useful to have a knob to increase the sample size
separately from the knob for the amount of data retained in the statistics
tables. Though I think you'll be disappointed and find you have to read an
unreasonably large sample out of the table before you get more useful distinct
estimates.

Certainly it's worth testing this in a low impact way like just keeping the
existing sample method and dialing up the sample sizes before you try anything
that would sacrifice the statistical validity of the more solid estimates.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus

Trent,

Sorry to interupt.  The discussion is interesting, but I need some help to 
follow along.


Thought-out commentary is welcome.


Is replace the algorithm the same as saying contextually use some estimate 
of D that is not Chaudhuri?


Yes.  I favor a block-based approach like Brutlag, largely because it 
allows us to increase the sample size without dramatically increasing I/O.


So Chaudhuri's estimate of D is appropriate (and is working) when making 
decisions about joins.


Some kinds of joins.   It avoids, for example, risky use of nested loops.


However,   PostgreSQL now has a whole set of hash operations and other
query types for which a
too-low estimate of D causes query lockup.



Why?  


Two specific examples, both of which I've encountered in the field:

1) too-low D will cause an aggregate query to use a hashagg which is 
larger than memory resulting in swapping (or disk spill when it's fixed) 
which makes the query very much slower than if the hashagg was not used.


2) much too-low D will cause the planner to pick a seq scan when it's 
not necessary, resulting in increased query times.



Do you *really* want the median estimate in these case?  Are you certain you 
do not want something with the opposite behavior of Chaudhuri's estimate so 
that for small sample sizes the bias is toward a high estimate of D? 
(Converges on D from the right instead of the left.)


Chaudhuri's -D-- needed
Estimate   estimate


Hmmm.  Yeah, I see what you mean.  True, the ideal approach would to 
deterime for each query operation whether a too-low D or a too-high D 
was more risky, and then use the more conservative number.   However, 
that would complicate the query planner enough that I think Tom would 
leave us. :-p



These statements are at odds with my admittedly basic understanding of 
statistics.  Isn't the power of a sample more related to the absolute size of 
the sample than the sample as fraction of the population?  Why not just pick 
a smallish sample size, say about 3000, and apply it to all the tables, even 
the ones with just a single row (modify appropriately from block sampling).


Nope, it's definitely proportional.   As a simple example, a sample of 
500 rows in a table of 1000 rows should yeild stats estimates with 90%+ 
accuracy.  But a sample of 500 rows in a 600,000,000 row table is so 
small as to be nearly useless; it's quite possible to get all the same 
value in a random sample of  0.1% even on a column with a D/N of 0.001. 
   If you look at the papers cited, almost all researchers more recent 
than Chaudhuri use a proportional sample size.


And we're taking the fixed-sample-size approach now, and it's not 
working very well for large tables.


--Josh Berkus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus

Greg,


Only if your sample is random and independent. The existing mechanism tries
fairly hard to ensure that every record has an equal chance of being selected.
If you read the entire block and not appropriate samples then you'll introduce
systematic sampling errors. For example, if you read an entire block you'll be
biasing towards smaller records.


Did you read any of the papers on block-based sampling?   These sorts of 
issues are specifically addressed in the algorithms.


--Josh

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus

Folks,

Nope, it's definitely proportional.   As a simple example, a sample of 
500 rows in a table of 1000 rows should yeild stats estimates with 90%+ 
accuracy.  But a sample of 500 rows in a 600,000,000 row table is so 
small as to be nearly useless; it's quite possible to get all the same 
value in a random sample of  0.1% even on a column with a D/N of 0.001. 


I meant a D/N of 0.1.  Sorry.

--Josh

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match