Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-07 Thread Josh Berkus
Petr,

Just user-tested SYSTEM_ROWS and SYSTEM_TIME.  They work as expected.
Useful!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Josh Berkus
On 08/06/2015 01:10 PM, Simon Riggs wrote:
 Given, user-stated probability of accessing a block of P and N total
 blocks, there are a few ways to implement block sampling.
 
 1. Test P for each block individually. This gives a range of possible
 results, with 0 blocks being possible outcome, though decreasing in
 probability as P increases for fixed N. This is the same way BERNOULLI
 works, we just do it for blocks rather than rows.
 
 2. We calculate P/N at start of scan and deliver this number blocks by
 random selection from N available blocks.
 
 At present we do (1), exactly as documented. (2) is slightly harder
 since we'd need to track which blocks have been selected already so we
 can use a random selection with no replacement algorithm. On a table
 with uneven distribution of rows this would still return a variable
 sample size, so it didn't seem worth changing.

Aha, thanks!

So, seems like this is just a doc issue? That is, we just need to
document that using SYSTEM on very small sample sizes may return
unexpected numbers of results ... and maybe also how the algorithm
actually works.

I agree that implementing (2) makes more sense as an additional
algorithm for someone to write in the future.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Alvaro Herrera
Petr Jelinek wrote:
 On 2015-08-06 22:25, Josh Berkus wrote:

 If there is no appropriate place, I'll just write a blog.
 
 There is a blog post on 2ndQ blog page which tries to describe the sampling
 methods visually, not sure if it's more obvious from that or not. It's
 somewhat broken on planet though (only title there).

http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5/

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Simon Riggs
On 6 August 2015 at 20:14, Josh Berkus j...@agliodbs.com wrote:


 This table has around 185 rows per page.  As the sample size goes up,
 the number times I get zero rows goes down, but those results seem to
 still include data pages with zero rows.  For example, here's a series
 of results from a 0.04 sample against the million-row table.

 370
 370
 370
 555
 555
 185
 0
 925

 Since this is a synthetic table I just generated, it contains almost
 exactly 185 rows per data page for every data page.  So on a 0.04%
 sample, the variation between 370 rows and 555 rows (whether we have 2
 or 3 data pages) is expected, since 0.04% of 5406 data pages is 2.16 pages.

 The results of 0, 185 and 925 are not.  It really seems like SYSTEM is
 treating 0.04% as a maximum, but taking a random number of data pages
 somewhere around that maximum, using math which can choose numbers of
 pages far outside of the % requested by the user, and which includes 0.

 Speaking from a user perspective, SYSTEM seems broken to me.  I can't
 imagine using it for anything with a that degree of variation in the
 number of results returned, especially if it's possible to return zero
 rows from a populated table.


The docs say...

The SYSTEM method does block-level sampling with each block having the
specified chance of being selected; all rows in each selected block are
returned.

SQLStandard suggests the parameter value should be an integer, which would
limit it to 1% sample. Allowing a non-integer sample size is a PostgreSQL
extension. There is no guidance on how to implement SYSTEM in the standard.

Given, user-stated probability of accessing a block of P and N total
blocks, there are a few ways to implement block sampling.

1. Test P for each block individually. This gives a range of possible
results, with 0 blocks being possible outcome, though decreasing in
probability as P increases for fixed N. This is the same way BERNOULLI
works, we just do it for blocks rather than rows.

2. We calculate P/N at start of scan and deliver this number blocks by
random selection from N available blocks.

At present we do (1), exactly as documented. (2) is slightly harder since
we'd need to track which blocks have been selected already so we can use a
random selection with no replacement algorithm. On a table with uneven
distribution of rows this would still return a variable sample size, so it
didn't seem worth changing.

This is exactly why TABLESAMPLE was designed to allow you to write your own
sampling algorithm, so you can do it anyway you want.

The context here is that SELECT count(*) from the test table takes 10ms.
The intended use case for this feature is against tables that would give
some kind of performance problem, which isn't the case here. On a table
100x larger, the sample is 1000 +/- 40, which seems more acceptable. The
variability of Bernoulli is much less, but then it is roughly 100x slower.
On a 1TB table, SYSTEM is going to be your only choice.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Josh Berkus
On 08/06/2015 01:14 PM, Josh Berkus wrote:
 On 08/06/2015 01:10 PM, Simon Riggs wrote:
 Given, user-stated probability of accessing a block of P and N total
 blocks, there are a few ways to implement block sampling.

 1. Test P for each block individually. This gives a range of possible
 results, with 0 blocks being possible outcome, though decreasing in
 probability as P increases for fixed N. This is the same way BERNOULLI
 works, we just do it for blocks rather than rows.

 2. We calculate P/N at start of scan and deliver this number blocks by
 random selection from N available blocks.

 At present we do (1), exactly as documented. (2) is slightly harder
 since we'd need to track which blocks have been selected already so we
 can use a random selection with no replacement algorithm. On a table
 with uneven distribution of rows this would still return a variable
 sample size, so it didn't seem worth changing.
 
 Aha, thanks!
 
 So, seems like this is just a doc issue? That is, we just need to
 document that using SYSTEM on very small sample sizes may return
 unexpected numbers of results ... and maybe also how the algorithm
 actually works.

Following up on this ... where is TABLESAMPLE documented other than in
the SELECT command?  Doc search on the website is having issues right
now.  I'm happy to write a doc patch.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Simon Riggs
On 6 August 2015 at 21:14, Josh Berkus j...@agliodbs.com wrote:

 On 08/06/2015 01:10 PM, Simon Riggs wrote:
  Given, user-stated probability of accessing a block of P and N total
  blocks, there are a few ways to implement block sampling.
 
  1. Test P for each block individually. This gives a range of possible
  results, with 0 blocks being possible outcome, though decreasing in
  probability as P increases for fixed N. This is the same way BERNOULLI
  works, we just do it for blocks rather than rows.
 
  2. We calculate P/N at start of scan and deliver this number blocks by
  random selection from N available blocks.


(My mistake, that would be P*N)


  At present we do (1), exactly as documented. (2) is slightly harder
  since we'd need to track which blocks have been selected already so we
  can use a random selection with no replacement algorithm. On a table
  with uneven distribution of rows this would still return a variable
  sample size, so it didn't seem worth changing.

 Aha, thanks!

 So, seems like this is just a doc issue? That is, we just need to
 document that using SYSTEM on very small sample sizes may return
 unexpected numbers of results ... and maybe also how the algorithm
 actually works.


For me, the docs seem exactly correct. The mathematical implications of
that just aren't recorded explicitly.

I will try to reword or add something to make it clear that this can return
a variable number of blocks and thus produces a result with greater
variability in the number of rows returned.

It's documented on the SELECT page only; plus there is a whole new section
on writing tablesample functions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Petr Jelinek

On 2015-08-06 22:25, Josh Berkus wrote:

On 08/06/2015 01:19 PM, Simon Riggs wrote:

For me, the docs seem exactly correct. The mathematical implications of
that just aren't recorded explicitly.


Well, for the SELECT page, all we need is the following (one changed
sentence):

The SYSTEM method is significantly faster than the BERNOULLI method when
small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects, and
may return a highly variable number of results for very small sample sizes.



BTW this was one of the motivations for making tsm_system_rows contrib 
module, that one will give you exact number of tuples while still doing 
page level sampling. But since it does linear probing it's only useful 
if you want those really small amounts of tuples because it will always 
do random I/O even if you are scanning large part of the table.




I will try to reword or add something to make it clear that this can
return a variable number of blocks and thus produces a result with
greater variability in the number of rows returned.

It's documented on the SELECT page only; plus there is a whole new
section on writing tablesample functions.


Seems like it would be nice to have more detailed user docs somewhere
which explain the sampling algos we have, especially if we get more in
the future.  Not sure where would be appropriate for that, though.

If there is no appropriate place, I'll just write a blog.



There is a blog post on 2ndQ blog page which tries to describe the 
sampling methods visually, not sure if it's more obvious from that or 
not. It's somewhat broken on planet though (only title there).


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Josh Berkus
On 08/06/2015 12:45 PM, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 6 August 2015 at 20:14, Josh Berkus j...@agliodbs.com wrote:
 Speaking from a user perspective, SYSTEM seems broken to me.  I can't
 imagine using it for anything with a that degree of variation in the
 number of results returned, especially if it's possible to return zero
 rows from a populated table.
 
 Please bear in mind you have requested a very small random sample of blocks.
 
 Indeed.  My expectation about it is that you'd get the requested number of
 rows *on average* over many tries (which is pretty much what Josh's
 results show).  Since what SYSTEM actually returns must be a multiple of
 the number of rows per page, if you make a request that's less than that
 number of rows, you must get zero rows some of the time.  Otherwise the
 sampling logic is cheating.

Except that I'm getting zero results when requesting rows which equate
to 2-3 pages, not just less than one page.  Please do read the full bug
report.

And why should the *number* of pages sampled be random?  That's not what
the documentation says that SYSTEM does.  But it's pretty clearly what's
happening; this table has exactly 185 rows per page on every page.  So
the only reason for there to be a variation in the number of rows
returned is that SYSTEM is varying the number of pages sampled.

On testing, the zero results don't go away until the number of pages
requested goes up to 5 (0.095).  So that suggests that SYSTEM is varying
the number of pages retrieved by +/- 4.  Why?  What purpose does that serve?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 6 August 2015 at 20:14, Josh Berkus j...@agliodbs.com wrote:
 Speaking from a user perspective, SYSTEM seems broken to me.  I can't
 imagine using it for anything with a that degree of variation in the
 number of results returned, especially if it's possible to return zero
 rows from a populated table.

 Please bear in mind you have requested a very small random sample of blocks.

Indeed.  My expectation about it is that you'd get the requested number of
rows *on average* over many tries (which is pretty much what Josh's
results show).  Since what SYSTEM actually returns must be a multiple of
the number of rows per page, if you make a request that's less than that
number of rows, you must get zero rows some of the time.  Otherwise the
sampling logic is cheating.

I do *not* think that we should force the sample to contain at least one
page, which is the only way that we could satisfy the complaint as stated.

Perhaps we need to adjust the documentation to make it clearer that
block-level sampling is not the thing to use if you want a sample that
doesn't amount to a reasonable number of blocks.  But I see absolutely
no evidence here that the sampling isn't behaving exactly as expected.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Petr Jelinek

On 2015-08-06 22:17, Josh Berkus wrote:

On 08/06/2015 01:14 PM, Josh Berkus wrote:

On 08/06/2015 01:10 PM, Simon Riggs wrote:

Given, user-stated probability of accessing a block of P and N total
blocks, there are a few ways to implement block sampling.

1. Test P for each block individually. This gives a range of possible
results, with 0 blocks being possible outcome, though decreasing in
probability as P increases for fixed N. This is the same way BERNOULLI
works, we just do it for blocks rather than rows.

2. We calculate P/N at start of scan and deliver this number blocks by
random selection from N available blocks.

At present we do (1), exactly as documented. (2) is slightly harder
since we'd need to track which blocks have been selected already so we
can use a random selection with no replacement algorithm. On a table
with uneven distribution of rows this would still return a variable
sample size, so it didn't seem worth changing.


Aha, thanks!

So, seems like this is just a doc issue? That is, we just need to
document that using SYSTEM on very small sample sizes may return
unexpected numbers of results ... and maybe also how the algorithm
actually works.




Yes, it's expected behavior on very small sample size so doc patch seems 
best fix.



Following up on this ... where is TABLESAMPLE documented other than in
the SELECT command?  Doc search on the website is having issues right
now.  I'm happy to write a doc patch.



The user documentation is only in SELECT page, the rest is API docs.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Josh Berkus
On 08/06/2015 01:19 PM, Simon Riggs wrote:
 On 6 August 2015 at 21:14, Josh Berkus j...@agliodbs.com
 mailto:j...@agliodbs.com wrote:
 
 On 08/06/2015 01:10 PM, Simon Riggs wrote:
  Given, user-stated probability of accessing a block of P and N total
  blocks, there are a few ways to implement block sampling.
 
  1. Test P for each block individually. This gives a range of possible
  results, with 0 blocks being possible outcome, though decreasing in
  probability as P increases for fixed N. This is the same way BERNOULLI
  works, we just do it for blocks rather than rows.
 
  2. We calculate P/N at start of scan and deliver this number blocks by
  random selection from N available blocks.
 
 
 (My mistake, that would be P*N) 
  
 
  At present we do (1), exactly as documented. (2) is slightly harder
  since we'd need to track which blocks have been selected already so we
  can use a random selection with no replacement algorithm. On a table
  with uneven distribution of rows this would still return a variable
  sample size, so it didn't seem worth changing.
 
 Aha, thanks!
 
 So, seems like this is just a doc issue? That is, we just need to
 document that using SYSTEM on very small sample sizes may return
 unexpected numbers of results ... and maybe also how the algorithm
 actually works.
 
 
 For me, the docs seem exactly correct. The mathematical implications of
 that just aren't recorded explicitly.

Well, for the SELECT page, all we need is the following (one changed
sentence):

The SYSTEM method is significantly faster than the BERNOULLI method when
small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects, and
may return a highly variable number of results for very small sample sizes.

 
 I will try to reword or add something to make it clear that this can
 return a variable number of blocks and thus produces a result with
 greater variability in the number of rows returned.
 
 It's documented on the SELECT page only; plus there is a whole new
 section on writing tablesample functions.

Seems like it would be nice to have more detailed user docs somewhere
which explain the sampling algos we have, especially if we get more in
the future.  Not sure where would be appropriate for that, though.

If there is no appropriate place, I'll just write a blog.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Simon Riggs
On 6 August 2015 at 20:14, Josh Berkus j...@agliodbs.com wrote:


 The results of 0, 185 and 925 are not.  It really seems like SYSTEM is
 treating 0.04% as a maximum, but taking a random number of data pages
 somewhere around that maximum, using math which can choose numbers of
 pages far outside of the % requested by the user, and which includes 0.


Thanks for the report, I'll look at this now.


 Speaking from a user perspective, SYSTEM seems broken to me.  I can't
 imagine using it for anything with a that degree of variation in the
 number of results returned, especially if it's possible to return zero
 rows from a populated table.


Please bear in mind you have requested a very small random sample of blocks.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


[HACKERS] Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

2015-08-06 Thread Josh Berkus
Version: 9.5alpha2

Issue: when requesting small samples, SYSTEM often returns zero rows,
and sometimes returns unexpected numbers of rows.

Example:

create table thous ( id int, val text );
insert into thous select i, i::text || '-val'  from
generate_series(1,10) as gs(i);
analyze;

This is a 100,000 row table, so a 0.01% sample should be 10 rows.  Since
10 rows is far less than what's on one data page, the documentation
suggests that I should get 1 data page worth of rows.   However:

postgres=# select * from thous tablesample system ( 0.01 );
 id | val
+-
(0 rows)

Time: 0.636 ms

... this query consistently returns 0 rows, in 20 runs.

Ok,let's try a million-row table, which is the example we have in the docs.

postgres=# select * from mil tablesample system ( 0.01 );
 id | val
+-
(0 rows)

Hmmm?

On testing, the query against the million-row table returns 0 rows
around 50% of the time.

This table has around 185 rows per page.  As the sample size goes up,
the number times I get zero rows goes down, but those results seem to
still include data pages with zero rows.  For example, here's a series
of results from a 0.04 sample against the million-row table.

370
370
370
555
555
185
0
925

Since this is a synthetic table I just generated, it contains almost
exactly 185 rows per data page for every data page.  So on a 0.04%
sample, the variation between 370 rows and 555 rows (whether we have 2
or 3 data pages) is expected, since 0.04% of 5406 data pages is 2.16 pages.

The results of 0, 185 and 925 are not.  It really seems like SYSTEM is
treating 0.04% as a maximum, but taking a random number of data pages
somewhere around that maximum, using math which can choose numbers of
pages far outside of the % requested by the user, and which includes 0.

Speaking from a user perspective, SYSTEM seems broken to me.  I can't
imagine using it for anything with a that degree of variation in the
number of results returned, especially if it's possible to return zero
rows from a populated table.

BERNOULLI works as expected.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers