subject:"Re\: \[HACKERS\] Gsoc2012 idea, tablesample"

 Florian Pflug  wrote:
 On May10, 2012, at 18:36 , Kevin Grittner wrote:
 Robert Haas  wrote:
 
 I wonder if you could do this with something akin to the Bitmap
 Heap Scan machinery. Populate a TID bitmap with a bunch of
 randomly chosen TIDs, fetch them all in physical order
 and if you don't get as many rows as you need, rinse and repeat
 until you do.
 
 If you get too many, it is important that you read all the way to
 the end and then randomly omit some of them.  While a bit of a
 bother, that's pretty straightforward and should be pretty fast,
 assuming you're not, like, an order of magnitude high.
 
 Why is that? From a statistical point of view it shouldn't matter
 whether you pick N random samples, or pick M = N random samples an
 then randomly pick N from M. (random implying uniformly distributed
 here).
 
 That sounds to me like exactly what what Robert and I both said.
 While passing the heap with the bitmap, if you get to the number you
 want you don't stop there -- you read all of them (M in your
 parlance) and randomly drop M minus N of them.  Or, if you prefer, I
 guess you could *pick* N of them.  I don't see a logical difference.

What I meant to say was and drop the last M minus N, not and randomly
drop M minus N. Which, of course, makes my statement utterly wrong since
the tuples are sorted by TID so you'd penalize tuples with a larger TID.
Duh! Sorry for the noise, guys...

 But falling short is tougher; making up the difference could be an
 iterative process, which could always wind up with having you read
 all tuples in the table without filling your sample.
 
 But the likelihood of that happening is extremely low, no?
 
 That depends.  What if someone just did a mass delete and your
 statistics aren't yet up to date when they ask to pick a relatively
 large percentage of the rows.
 
 Unless the sampling percentage is very high
 
 Or the statistics are not current.  I agree, this shouldn't happen
 often, but we can never know, going in, whether it *is* the case.
 You *could* always wind up needing to read the entire table, and
 still not hit the initially-targeted number of rows.  Now, arguably
 you could use data gleaned from each pass to adjust the target or
 inform the size of the next pass.  My point is that we selected too
 few is a lot more complicated that the we selected too many case.
 
 but that case isn't of much practical importance anyway.
 
 It's important that it's handled in some sane way when it happens.
 And it will happen.

Hm. Maybe one can get rid of these sorts of problems by factoring in
the expected density of the table beforehand and simply accepting that
the results will be inaccurate if the statistics are outdated?

One could, for example, simply pick

  N := SamplingPercentage * MaxTuplesPerPage / AvgLiveTuplesPerPage

where

 AvgLiveTuplesPerPage := #Tuples / #Pages

random TIDs, fetch the live ones, and return them. I'm not totally sure
whether this approach is sensible to non-uniformity in the tuple to
line-pointer assignment, though.

 But something else comes to mind. Does the standard permit samples
 taken with the BERNOULLI method to contain the same tuple multiple
 times?
 
 I'm pretty sure not.  That would be nonsensical.
 
 If not, any kind of TID-based approach will have to all previously
 fetched TIDs, which seems doable but unfortunate...
 
 Right.  You would always need to ignore a duplicate random choice in
 any one cycle of generating ctid values; and if you are iterating
 because you fell short, you would also need to ignore values from
 previous iterations.  And OR your new bitmap against the previous
 ones to save for the next iteration, if needed.  I never said it
 couldn't be done; it's just very fussy, and you want to avoid a very
 large number of iterations in the case that someone deleted 99.99% of
 your rows right before you ran your sample and before autovacuum
 caught up.

Actually, thinking more about this, if the approach sketched above
turns out to work, then one might be able to get away without remembering
previously computed TIDs, thus removing a lot of complexity.

Say, for example, you simply start out with a single random TID tid[0].
The you produce the sequence of random TIDs by setting

  tid[n+1] = tid[n] + random(1 = x = 2*D-1)

where D is the expected distance between one TID from the sample set
and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
tid[n] ) #TIDs). This will give you approximately uniformly distributed
TIDs, I think, and will even return them in physical order. The 100$ question
is, by how much does this violate the independence requirement, i.e. how
far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
Some search through the literature should be able to answer that.

Should the dependency turn out to be too high, one might be able to
lower it by scaling up N by a factor q, and then discarding each generated
TID with probability 1/q. This amounts to a gradual

Re: [HACKERS] Gsoc2012 idea, tablesample

[ Sorry for the self-reply ]

On May11, 2012, at 13:17 , Florian Pflug wrote:
 Actually, thinking more about this, if the approach sketched above
 turns out to work, then one might be able to get away without remembering
 previously computed TIDs, thus removing a lot of complexity.
 
 Say, for example, you simply start out with a single random TID tid[0].
 The you produce the sequence of random TIDs by setting
 
  tid[n+1] = tid[n] + random(1 = x = 2*D-1)
 
 where D is the expected distance between one TID from the sample set
 and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
 tid[n] ) #TIDs). This will give you approximately uniformly distributed
 TIDs, I think, and will even return them in physical order. The 100$ question
 is, by how much does this violate the independence requirement, i.e. how
 far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
 Some search through the literature should be able to answer that.

Actually, one can easily do better then that. Say you've determined that
to pick each tuple with probability p_tup, you need to pick each TID with
probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability
of a single TID being live). Now, looping over all TIDs and picking each with
probability p_tid is, of course, just a another (more error-prone) way of
iterating over all tuples and picking each with probability p_tup. But instead
of looping, you can jump from from picked TID to the next. Observe that, after
having picked tid X, the probability of picking X+i as the next tid is
(1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs
and then pick the i-th one. Thus, the following algorithm returns a sorted list
of TIDs which should be uniformly distributed and independent (or so I think,
at least)

  1) Starting with a random tid tid[0], chosen such that
   P(tid[0] = i) = (1 - p_tid)^i * p_tid

  2) Setting tid[n+1] = tid[n] + d, which d  0 chosen such that
   P(d = i) = (1 - p_tid)^(i-1) * p_tid

  3) Abort if max(tid) is = #TIDs.

This all hinges on the ability to produce a sufficient accurate estimate of the
TID density p_tup/p_tid, of course.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Sandro Santilli

On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote:
 On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
  Ants Aasma a...@cybertec.at wrote:
  It seems to me that the simplest thing to do would be to lift the
  sampling done in analyze.c (acquire_sample_rows) and use that to
  implement the SYSTEM sampling method.
 
  Definitely.  I thought we had all agreed on that ages ago.
 
 Right, and I don't think we should be considering any of this other
 stuff until that basic thing is implemented and working.

Agreed. That's what I'd love to see as well, for the GIS part.

--strk; 

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Florian Pflug f...@phlo.org wrote:
 
 Maybe one can get rid of these sorts of problems by factoring in
 the expected density of the table beforehand and simply accepting
 that the results will be inaccurate if the statistics are
 outdated?
 
 One could, for example, simply pick
 
   N := SamplingPercentage * MaxTuplesPerPage /
AvgLiveTuplesPerPage
 
 where
 
  AvgLiveTuplesPerPage := #Tuples / #Pages
 
 random TIDs, fetch the live ones, and return them.
 
To clarify, I read this as using reltuples and relpages for the
table, and returning only tuples which are visible according to the
query's snapshot.  (i.e., I think you used live to mean two
different things there.)
 
Unless I'm missing something, I think that works for percentage
selection, which is what the standard talks about, without any need
to iterate through addition samples.  Good idea!  We don't need to
do any second pass to pare down initial results, either.  This
greatly simplifies coding while providing exactly what the standard
requires.
 
 I'm not totally sure whether this approach is sensible to
 non-uniformity in the tuple to line-pointer assignment, though.
 
I think we can solve that by going high enough with tuple numbers to
reach the highest tuple ID that might be in use in the table, and
*not* following HOT chains.  (If we follow HOT chains, we could have
several distinct ctid values which returned the same tuple.)  Or am
I thinking about this incorrectly?
 
 [more complex alternatives]
 
I really think your first suggestion covers it perfectly; these more
complex techniques don't seem necessary to me.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Florian Pflug f...@phlo.org writes:
 This all hinges on the ability to produce a sufficient accurate estimate of 
 the
 TID density p_tup/p_tid, of course.

I think that's the least of its problems.  AFAICS this analysis ignores
(1) the problem that the TID space is nonuniform, ie we don't know how
many tuples in each page until we look;
(2) the problem that we don't know the overall number of tuples
beforehand.

I'm not sure that there is any way to deal with (1) fully without
examining every single page, but algorithms that assume that the TIDs
are numbered linearly are broken before they start.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Florian Pflug f...@phlo.org wrote:
 Maybe one can get rid of these sorts of problems by factoring in
 the expected density of the table beforehand and simply accepting
 that the results will be inaccurate if the statistics are
 outdated?
 
 Unless I'm missing something, I think that works for percentage
 selection, which is what the standard talks about, without any need
 to iterate through addition samples.  Good idea!  We don't need to
 do any second pass to pare down initial results, either.  This
 greatly simplifies coding while providing exactly what the standard
 requires.
 
 I'm not totally sure whether this approach is sensible to
 non-uniformity in the tuple to line-pointer assignment, though.

If you're willing to accept that the quality of the results depends on
having up-to-date stats, then I'd suggest (1) use the planner's existing
technology to estimate the number of rows in the table; (2) multiply
by sampling factor you want to get a desired number of sample rows;
(3) use ANALYZE's existing technology to acquire that many sample rows.
While the ANALYZE code isn't perfect with respect to the problem of
nonuniform TID density, it certainly will be a lot better than
pretending that that problem doesn't exist.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Tom Lane t...@sss.pgh.pa.us wrote:
 
 If you're willing to accept that the quality of the results
 depends on having up-to-date stats, then I'd suggest (1) use the
 planner's existing technology to estimate the number of rows in
 the table; (2) multiply by sampling factor you want to get a
 desired number of sample rows; (3) use ANALYZE's existing
 technology to acquire that many sample rows.  While the ANALYZE
 code isn't perfect with respect to the problem of nonuniform TID
 density, it certainly will be a lot better than pretending that
 that problem doesn't exist.
 
That is basically what we've been talking about for SYSTEM samples,
which I think we've agreed should be finished and committed before
programming on the BERNOULLI option.  Nobody is pretending that the
problem doesn't exist.  Let me restate the option I'm thinking
solves the problem well, since it would be easy to have missed what
I was responding to with all the various suggestions on the thread.
 
The user has asked for a BERNOULLI sample S of some percent of the
table.  We determine the maximum tuple index M which could be used
for any ctid in the table.  We get an estimate of the number of
pages P in the table using whatever is the best available technique.
If you assume that the number of tuples (visible or not) is T, the
approximate number of tuples we want to randomly select is S*T.  We
will be probing random page numbers 1..P for random tuple indexes
1..M.  So how many random probes by ctid does that require? The
chance of a hit on each probe is ((T/P)/M) -- the average number of
tuples per page divided by the number of tuple index values allowed.
So we need (S*T)/((T/P)/M) probes.  Simplifying that winds up being
S*M*P the product of the sample size as a percentage, the maximum
tuple index on a page, and the number of pages.  (A calculation some
may have jumped to as intuitively obvious.)
 
So let's call the number of probes N.  We randomly generate N
distinct ctid values, where each is a random page number 1..P and a
random index 1..M.  We attempt to read each of these in block number
order, not following HOT chains.  For each, if the tuple exists and
is visible, it is part of our result set.
 
Since T cancels out of that equation, we don't need to worry about
estimating it.  Results will be correct for any value of M which is
*at least* as large as the maximum tuple index in the table,
although values of M larger than that will degrade performance.  The
same holds true for the number of pages.
 
Shouldn't that get us the randomly chosen sample we're looking for? 
Is there a problem you think this ignores?
 
Credit where credit is due: I *think* this is merely my restatement
of something Florian was suggesting, building on a suggestion from
Robert.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Robert Haas

On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 We
 will be probing random page numbers 1..P for random tuple indexes
 1..M.  So how many random probes by ctid does that require? The
 chance of a hit on each probe is ((T/P)/M) -- the average number of
 tuples per page divided by the number of tuple index values allowed.
 So we need (S*T)/((T/P)/M) probes.  Simplifying that winds up being
 S*M*P the product of the sample size as a percentage, the maximum
 tuple index on a page, and the number of pages.  (A calculation some
 may have jumped to as intuitively obvious.)

 So let's call the number of probes N.  We randomly generate N
 distinct ctid values, where each is a random page number 1..P and a
 random index 1..M.  We attempt to read each of these in block number
 order, not following HOT chains.  For each, if the tuple exists and
 is visible, it is part of our result set.

 Since T cancels out of that equation, we don't need to worry about
 estimating it.  Results will be correct for any value of M which is
 *at least* as large as the maximum tuple index in the table,
 although values of M larger than that will degrade performance.  The
 same holds true for the number of pages.

The trouble is, AFAICS, that you can't bound M very well without
scanning the whole table.  I mean, it's bounded by theoretical limit,
but that's it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Kevin Grittner kevin.gritt...@wicourts.gov writes:
 [ uniformly sample the TID space defined as (1..P, 1..M) ]

 Shouldn't that get us the randomly chosen sample we're looking for? 
 Is there a problem you think this ignores?

Not sure.  The issue that I'm wondering about is that the line number
part of the space is not uniformly populated, ie, small line numbers
are much more likely to exist than large ones.  (In the limit that
density goes to zero, when you pick M much too large.)  It's not clear
to me whether this gives an unbiased probability of picking real tuples,
as opposed to hypothetical TIDs.

Another issue is efficiency.  In practical cases you'll have to greatly
overestimate M compared to the typical actual-number-of-tuples-per-page,
which will lead to a number of target TIDs N that's much larger than
necessary, which will make the scan slow --- I think in practice you'll
end up doing a seqscan or something that might as well be one, because
unless S is *really* tiny it'll hit just about every page.  We can have
that today without months worth of development effort, using the WHERE
random()  S technique.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Robert Haas robertmh...@gmail.com wrote:
 
 The trouble is, AFAICS, that you can't bound M very well without
 scanning the whole table.  I mean, it's bounded by theoretical
 limit, but that's it.
 
What would the theoretical limit be?  (black size - page header size
- minimum size of one tuple) / item pointer size?  So, on an 8KB
page, somewhere in the neighborhood of 1350?  Hmm.  If that's right,
that would mean a 1% random sample would need 13.5 probes per page,
meaning there wouldn't tend to be a lot of pages missed.  Still, the
technique for getting a random sample seems sound, unless someone
suggests something better.  Maybe we just want to go straight to a
seqscan to get to the pages we want to probe rather than reading
just the ones on the probe list in physical order?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Robert Haas robertmh...@gmail.com wrote:
 The trouble is, AFAICS, that you can't bound M very well without
 scanning the whole table.  I mean, it's bounded by theoretical
 limit, but that's it.
 
 What would the theoretical limit be?  (black size - page header size
 - minimum size of one tuple) / item pointer size?  So, on an 8KB
 page, somewhere in the neighborhood of 1350?

Your math is off --- I get something less than 300, even if the tuples
are assumed to be empty of data.  (Header size 24 bytes, plus 4-byte
line pointer, so at least 28 bytes per tuple, so at most 292 on an 8K
page.)  But you still end up probing just about every page for a 1%
sample.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

On May11, 2012, at 17:20 , Robert Haas wrote:
 On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner
 Since T cancels out of that equation, we don't need to worry about
 estimating it.  Results will be correct for any value of M which is
 *at least* as large as the maximum tuple index in the table,
 although values of M larger than that will degrade performance.  The
 same holds true for the number of pages.
 
 The trouble is, AFAICS, that you can't bound M very well without
 scanning the whole table.  I mean, it's bounded by theoretical limit,
 but that's it.

Hm, but maybe Kevin's observation that the actual value of M shouldn't
matter as long as it's large enough helps here. What if you start out
with M=1 and generate your first TID. After reading the page, but before
returning a tuple, you check if M is indeed an upper bound on the tuple
indices. If it isn't, you increase M, recompute N (the number of probes),
determine a new random tuple index, and return the tuple (if it is live).

Would that introduce bias? I'd think not, because scaling up N shouldn't,
per Kevin's argument, change the probability of a TID being picked. So
increasing N in the middle of a scan should neither penalize nor favour
tuples which were already returned compared to those which will follow,
no?

(And yes, even if there don't turn out to be any obvious holes in this
argument, it requires more formal proof that I was able to give here
before being turned into code. Or at the very least excessive testing
which all kinds of data)

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 [ uniformly sample the TID space defined as (1..P, 1..M) ]
 
 Shouldn't that get us the randomly chosen sample we're looking
 for?  Is there a problem you think this ignores?
 
 Not sure.  The issue that I'm wondering about is that the line
 number part of the space is not uniformly populated, ie, small
 line numbers are much more likely to exist than large ones.  (In
 the limit that density goes to zero, when you pick M much too
 large.)  It's not clear to me whether this gives an unbiased
 probability of picking real tuples, as opposed to hypothetical
 TIDs.
 
I'm convinced it generates a random sample, since every tuple in the
*possible* space has an equal chance of selection, and we ignore the
ones which don't exist or aren't visible, each of the remaining
(existing, visible) tuples is as likely to be chosen as any other. 
I'm that's not a formal proof, but I'm sure I could work one up.
 
 Another issue is efficiency.  In practical cases you'll have to
 greatly overestimate M compared to the typical actual-number-of-
 tuples-per-page, which will lead to a number of target TIDs N
 that's much larger than necessary, which will make the scan slow
 --- I think in practice you'll end up doing a seqscan or something
 that might as well be one, because unless S is *really* tiny it'll
 hit just about every page.  We can have that today without months
 worth of development effort, using the WHERE random()  S
 technique.
 
I think you've probably got me there.  The technique you describe
should be equally random, and  thinking through the work involved in
each, the random()  S approach seems almost certain to be cheaper. 
Is it worth supporting the TABLESAMPLE BERNOULLI option as syntactic
sugar for that?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

On May11, 2012, at 16:03 , Kevin Grittner wrote:
 [more complex alternatives]
 
 I really think your first suggestion covers it perfectly; these more
 complex techniques don't seem necessary to me.

The point of the more complex techniques (especially the algorithm in
my second mail, the reply to self) was simply to optimize the generation
of a random, uniformly distributed, unique and sorted list of TIDs.

The basic idea is to make sure we generate the TIDs in physical order,
and thus automatically ensure that they are unique. The reduces the memory
(or disk) requirement to O(1) instead of O(n), and (more importantly,
actually) makes the actual implementation much simpler.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Robert Haas

On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 The trouble is, AFAICS, that you can't bound M very well without
 scanning the whole table.  I mean, it's bounded by theoretical
 limit, but that's it.

 What would the theoretical limit be?

MaxHeapTuplesPerPage?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Robert Haas robertmh...@gmail.com wrote:
 On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 The trouble is, AFAICS, that you can't bound M very well without
 scanning the whole table.  I mean, it's bounded by theoretical
 limit, but that's it.

 What would the theoretical limit be?
 
 MaxHeapTuplesPerPage?
 
What about dead line pointers without corresponding tuples?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Greg Stark

On Fri, May 11, 2012 at 6:16 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 MaxHeapTuplesPerPage?

 What about dead line pointers without corresponding tuples?

Actually we don't allow there to be more than MaxHeapTuplesPerPage
line pointers even if some of them are dead line pointers.

I think the argument then was that there could be bugs in code that
isn't expecting more so perhaps that doesn't justify baking that into
more places when we could instead be removing that assumption.

Also, if I absorbed enough of this conversation skimming backwards it
seems the algorithm would be very inefficient with such a conservative
estimate. On a table with normal width rows it would mean usually
picking tuples that don't exist and having to try again. As Tom
pointed out that would mean even a small sample would have to read
nearly the whole table to find the sample.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Qi Huang


Hi, AllThanks for your ideas on the implementation of TABLESAMPLE. I have a 
summary below of the high level requirements from the -hacker thread till now. 
Please give further comment and if I missed any point, please fell free to add. 
1. Build a new type of node, as I should not use SEQSCAN node, which will be 
scanning the whole table. One of the aims of TABLESAMPLE is to avoid scanning 
every row and thus save time. 
2. use TIDSCAN to directly access tuples. The below way of using ctid proposed 
by Kevin looks good.
-One technique which might be suitably random without reading the-whole table 
would be to figure out a maximum block number and tuple-ID for the table, and 
generate a series of random ctid values to-read. If the tuple doesn't exist or 
is not visible to the snapshot,-you ignore it and continue, until you have read 
the requisite number-of rows. You could try to generate them in advance and 
sort them by-block number, but then you need to solve the problems of what to 
do-if that set of ctids yields too many rows or too few rows, both of-which 
have sticky issues.--Kevin   I think this technique could be considered as an 
implementation algo for BERNOULLI method. It looks that it could still reduce a 
lot of cost compared to just assign random number to every tuple and then 
retrieve. 
3. Adding specific sampling, like dollar unit sampling, or geographic index 
support, which meets specific requirement, but costs more. -I have a gut 
feeling that Dollar Unit Sampling and other weighted-samples can be done with a 
similar approach of uniformly sampling-blocks and then running weighted 
reservoir sampling [1] over the-result.-[1]: 
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf-Cheers,--Ants 
Aasma
  Aasma proposed the above method for doing these problems, of using the 
reservoir weighted random sampling. 
4. Implement the way ANALYZE is doing sampling. -I'm looking for a way to fetch 
random samples these days so I confirm-the need for a quick way to fetch the 
same sample that analyze-command fetches but at SQL level.---strk; 
Proposed by Sandro Santilli, we can look at the way ANALYZE is doing the 
sampling and try to implement it in TABLESAMPLE. 


The above are the four points I summarized from the -hackers discussion. I put 
a brief introduction the this gSoc project on Postgres Wiki. You are welcomed 
to check and modify to enhance it. 
http://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

I will add the ideas formalized in this mailing thread to the wiki page once 
our discussion has been completed. 

Thanks.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Florian Pflug


On May10, 2012, at 10:43 , Qi Huang wrote:
 2. use TIDSCAN to directly access tuples. The below way of using ctid 
 proposed by Kevin looks good.
 
 -One technique which might be suitably random without reading the
 -whole table would be to figure out a maximum block number and tuple
 -ID for the table, and generate a series of random ctid values to
 -read. If the tuple doesn't exist or is not visible to the snapshot,
 -you ignore it and continue, until you have read the requisite number
 -of rows. You could try to generate them in advance and sort them by
 -block number, but then you need to solve the problems of what to do
 -if that set of ctids yields too many rows or too few rows, both of
 -which have sticky issues.

I think this technique could be considered as an implementation algo for 
 BERNOULLI method. It looks that it could still reduce a lot of cost compared 
 to just assign random number to every tuple and then retrieve.

One problem I see with this approach is that its efficiency depends on the 
average tuple length, at least with a naive approach to random ctid generator. 
The simplest way to generate those randomly without introducing bias is to 
generate a random page index between 0 and the relation's size in pages, and 
then generate random tuple index between 0 and MaxHeapTuplesPerPage, which is 
291 on x86-64 assuming the standard page size of 8k.

The current toasting threshold (TOAST_TUPLE_THRESHOLD) is approximately 2k, so 
having tables with an average heap tuple size of a few hundred bytes doesn't 
seem unlikely. Now, assume the average tuple length is 128 bytes, i.e. on 
average you'll have ~ 8k/128 = 64 live tuples / page if the fill factor is 100% 
and all tuples are live. To account for lower fill factors and dead tuples, 
let's thus say there are 50 live tuples / page. Then, on average, only every 
6th randomly generated ctid will point to a live tuple. But whether or not it 
does can only be decided after reading the page from disk, so you end up with a 
rate of 6 random-access reads per returned tuple.

IIRC, the cutoff point where an index scan loses compared to a sequential scan 
is somewhere around 10% of the table read, i.e. if a predicate selects more 
than 10% of the available rows, a sequential scan is more efficient than an 
index scan. Scaling that with the 1/6-th success rate from above means that 
Kevin's approach would only beat a sequential scan if the sampling percentage 
isn't much larger than 1%, assuming an average row size of 128 bytes.

The algorithm still seems like a good choice for very small sampling 
percentages, though.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Florian Pflug f...@phlo.org wrote:
 
 One problem I see with this approach is that its efficiency
 depends on the average tuple length, at least with a naive
 approach to random ctid generator. The simplest way to generate
 those randomly without introducing bias is to generate a random
 page index between 0 and the relation's size in pages,
 
Right.
 
 and then generate random tuple index between 0 and
 MaxHeapTuplesPerPage, which is 291 on x86-64 assuming the standard
 page size of 8k.
 
I think we can do better than that without moving too far from the
simplest way.  It seems like we should be able to get a more
accurate calculation of a minimum base tuple size based on the table
definition, and calculate the maximum number of those which could
fit on a page.  On the other hand, ctid uses a line pointer, doesn't
it?  Do we need to worry about dead line pointers allowing higher
tuple indexes than the calculated maximum number of tuples per page?
 
 The current toasting threshold (TOAST_TUPLE_THRESHOLD) is
 approximately 2k, so having tables with an average heap tuple size
 of a few hundred bytes doesn't seem unlikely. Now, assume the
 average tuple length is 128 bytes, i.e. on average you'll have ~
 8k/128 = 64 live tuples / page if the fill factor is 100% and all
 tuples are live. To account for lower fill factors and dead
 tuples, let's thus say there are 50 live tuples / page. Then, on 
 average, only every 6th randomly generated ctid will point to a
 live tuple. But whether or not it does can only be decided after
 reading the page from disk, so you end up with a rate of 6
 random-access reads per returned tuple.
 
 IIRC, the cutoff point where an index scan loses compared to a
 sequential scan is somewhere around 10% of the table read, i.e. if
 a predicate selects more than 10% of the available rows, a
 sequential scan is more efficient than an index scan.
 
That ratio *might* be better for a ctid scan, since you don't have
the index access in the mix.
 
 Scaling that with the 1/6-th success rate from above means that
 Kevin's approach would only beat a sequential scan if the sampling
 percentage isn't much larger than 1%, assuming an average row size
 of 128 bytes.
 
 The algorithm still seems like a good choice for very small
 sampling percentages, though.
 
Yeah, even with a maximum tuple count calculated by page, there are
certainly going to be cases where another approach will be faster,
especially where the sample is a relatively high percentage of the
table.  It would be good to have multiple plans compete on costs, if
possible.  It would not surprise me at all if the typical break-even
point between the two techniques was somewhere on the order of a 1%
sample size, but one would hope we could get there on the basis of
estimated costs rather than using arbitrary rules or forcing the
user to guess and code a choice explicitly.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Robert Haas

On Thu, May 10, 2012 at 10:28 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 One problem I see with this approach is that its efficiency
 depends on the average tuple length, at least with a naive
 approach to random ctid generator. The simplest way to generate
 those randomly without introducing bias is to generate a random
 page index between 0 and the relation's size in pages,

 Right.

 and then generate random tuple index between 0 and
 MaxHeapTuplesPerPage, which is 291 on x86-64 assuming the standard
 page size of 8k.

 I think we can do better than that without moving too far from the
 simplest way.  It seems like we should be able to get a more
 accurate calculation of a minimum base tuple size based on the table
 definition, and calculate the maximum number of those which could
 fit on a page.  On the other hand, ctid uses a line pointer, doesn't
 it?  Do we need to worry about dead line pointers allowing higher
 tuple indexes than the calculated maximum number of tuples per page?

I wonder if you could do this with something akin to the Bitmap Heap
Scan machinery.  Populate a TID bitmap with a bunch of randomly chosen
TIDs, fetch them all in physical order, and if you don't get as many
rows as you need, rinse and repeat until you do.

I'm worried this project is getting so complicated that it will be
beyond the ability of a new hacker to get anything useful done.  Can
we simplify the requirements here to something that is reasonable for
a beginner?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Robert Haas robertmh...@gmail.com wrote:
 
 I wonder if you could do this with something akin to the Bitmap
 Heap Scan machinery.  Populate a TID bitmap with a bunch of
 randomly chosen TIDs, fetch them all in physical order
 
It would be pretty hard for any other plan to beat that by very
much, so it seems like a good approach which helps keep things
simple.
 
 and if you don't get as many rows as you need, rinse and repeat
 until you do.
 
Ay, there's the rub.  If you get too many, it is important that you
read all the way to the end and then randomly omit some of them. 
While a bit of a bother, that's pretty straightforward and should be
pretty fast, assuming you're not, like, an order of magnitude high. 
But falling short is tougher; making up the difference could be an
iterative process, which could always wind up with having you read
all tuples in the table without filling your sample.  Still, this
approach seems like it would perform better than generating random
ctid values and randomly fetching until you've tried them all.
 
 I'm worried this project is getting so complicated that it will be
 beyond the ability of a new hacker to get anything useful done. 
 Can we simplify the requirements here to something that is
 reasonable for a beginner?
 
I would be inclined to omit monetary unit sampling from the first
commit.  Do the parts specified in the standard first and get it
committed.  Useful as unit sampling is, it seems like the hardest to
do, and should probably be done if time permits or left as a
future enhancement.  It's probably enough to just remember that it's
there and make a best effort attempt not to paint ourselves in a
corner which precludes its development.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Ants Aasma

On Thu, May 10, 2012 at 6:33 PM, Robert Haas robertmh...@gmail.com wrote:
 I'm worried this project is getting so complicated that it will be
 beyond the ability of a new hacker to get anything useful done.  Can
 we simplify the requirements here to something that is reasonable for
 a beginner?

It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c (acquire_sample_rows) and use that to
implement the SYSTEM sampling method. The language in the standard
leads me to believe that Vitter's algorithm used in analyze.c is
exactly what was intended by the authors. The difference between
Vitter's algorithm and a pure Bernoulli process is precisely that
Vitter's method increases the chances to see multiple rows picked from
a single page.

One tricky issue is that tablesample is defined in terms of a
percentage of the underlying table while Vitter's algorithm needs a
fixed number of rows. The standard does state that the result needs to
contain approximately the stated percentage of rows. I'm not sure if
calculating the amount of rows to return from reltuples would fit that
definition of approximate. If not, would re-estimating the amount of
reltuples after sampling and taking appropriate corrective action make
it better or would an accurate number be necessary. Getting an
accurate number efficiently would require solving of the COUNT(*)
issue.

For the Bernoulli case I can't think of anything simple that would
better than just scanning the table or poking it with random TIDs.
(the latter has the same problem of estimating the desired result set
size) It seems to me that Vitter's approach could be amended to
produce independent samples by selecting slightly more pages than
result tuples and tweaking the acceptance levels to cancel out the
bias. But that definitely isn't in the territory of simple and would
require rigorous statistical analysis.

And as for the monetary unit sampling, I agree that this is better
left as an optional extra.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Ants Aasma a...@cybertec.at wrote:
 
 It seems to me that the simplest thing to do would be to lift the
 sampling done in analyze.c (acquire_sample_rows) and use that to
 implement the SYSTEM sampling method. 
 
Definitely.  I thought we had all agreed on that ages ago.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Robert Haas

On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Ants Aasma a...@cybertec.at wrote:
 It seems to me that the simplest thing to do would be to lift the
 sampling done in analyze.c (acquire_sample_rows) and use that to
 implement the SYSTEM sampling method.

 Definitely.  I thought we had all agreed on that ages ago.

Right, and I don't think we should be considering any of this other
stuff until that basic thing is implemented and working.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Florian Pflug

On May10, 2012, at 18:36 , Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
 
 I wonder if you could do this with something akin to the Bitmap
 Heap Scan machinery.  Populate a TID bitmap with a bunch of
 randomly chosen TIDs, fetch them all in physical order
 and if you don't get as many rows as you need, rinse and repeat
 until you do.
 
 Ay, there's the rub.  If you get too many, it is important that you
 read all the way to the end and then randomly omit some of them.

Why is that? From a statistical point of view it shouldn't matter
whether you pick N random samples, or pick M = N random samples an
then randomly pick N from M. (random implying uniformly distributed
here).

 While a bit of a bother, that's pretty straightforward and should be
 pretty fast, assuming you're not, like, an order of magnitude high. 
 But falling short is tougher; making up the difference could be an
 iterative process, which could always wind up with having you read
 all tuples in the table without filling your sample.

But the likelihood of that happening is extremely low, no? Unless the
sampling percentage is very high, that is, but that case isn't of much
practical importance anyway.

But something else comes to mind. Does the standard permit samples taken
with the BERNOULLI method to contain the same tuple multiple times? If
not, any kind of TID-based approach will have to all previously fetched
TIDs, which seems doable but unfortunate...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

[point splintered in quoting re-joined]
 
Florian Pflug  wrote:
 On May10, 2012, at 18:36 , Kevin Grittner wrote:
 Robert Haas  wrote:

 I wonder if you could do this with something akin to the Bitmap
 Heap Scan machinery. Populate a TID bitmap with a bunch of
 randomly chosen TIDs, fetch them all in physical order
 and if you don't get as many rows as you need, rinse and repeat
 until you do.

 If you get too many, it is important that you read all the way to
 the end and then randomly omit some of them.  While a bit of a
 bother, that's pretty straightforward and should be pretty fast,
 assuming you're not, like, an order of magnitude high.

 Why is that? From a statistical point of view it shouldn't matter
 whether you pick N random samples, or pick M = N random samples an
 then randomly pick N from M. (random implying uniformly distributed
 here).
 
That sounds to me like exactly what what Robert and I both said.
While passing the heap with the bitmap, if you get to the number you
want you don't stop there -- you read all of them (M in your
parlance) and randomly drop M minus N of them.  Or, if you prefer, I
guess you could *pick* N of them.  I don't see a logical difference.
 
 But falling short is tougher; making up the difference could be an
 iterative process, which could always wind up with having you read
 all tuples in the table without filling your sample.

 But the likelihood of that happening is extremely low, no?
 
That depends.  What if someone just did a mass delete and your
statistics aren't yet up to date when they ask to pick a relatively
large percentage of the rows.
 
 Unless the sampling percentage is very high
 
Or the statistics are not current.  I agree, this shouldn't happen
often, but we can never know, going in, whether it *is* the case.
You *could* always wind up needing to read the entire table, and
still not hit the initially-targeted number of rows.  Now, arguably
you could use data gleaned from each pass to adjust the target or
inform the size of the next pass.  My point is that we selected too
few is a lot more complicated that the we selected too many case.
 
 but that case isn't of much practical importance anyway.
 
It's important that it's handled in some sane way when it happens.
And it will happen.
 
 But something else comes to mind. Does the standard permit samples
 taken with the BERNOULLI method to contain the same tuple multiple
 times?
 
I'm pretty sure not.  That would be nonsensical.
 
 If not, any kind of TID-based approach will have to all previously
 fetched TIDs, which seems doable but unfortunate...
 
Right.  You would always need to ignore a duplicate random choice in
any one cycle of generating ctid values; and if you are iterating
because you fell short, you would also need to ignore values from
previous iterations.  And OR your new bitmap against the previous
ones to save for the next iteration, if needed.  I never said it
couldn't be done; it's just very fussy, and you want to avoid a very
large number of iterations in the case that someone deleted 99.99% of
your rows right before you ran your sample and before autovacuum
caught up.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Sandro Santilli

On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
 On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli s...@keybit.net wrote:
  I'd love to see enhanced CTID operators, to fetch all visible tuples in a 
  page
  using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
 
 Among other things, this would enable user-space implementation of
 tablesample. Given the operator =~(tid, int) that matches the page
 number and planner/executor integration so that it results in a TID
 scan, you would need the following functions:
 
 random_pages(tbl regclass, samples int) returns int[]
 aggregate function:
 reservoir_sample(item anyelement, samples int) returns anyarray
 
 Implementations for both of the functions could be adapted from analyze.c.
 
 Then tablesample could be implemented with the following query:
 SELECT (SELECT reservoir_sample(some_table, 50) AS samples
FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
 FROM random_pages('some_table', 50) AS rnd_pgtids;
 
 Actually, now that I think about it, it could actually be implemented
 without any modifications to core at some cost to efficiency.
 random_pages would have to return tid[] that contains for each
 generated pagenumber all possible tids on that page.

This is exactly what I'm after.
I've actually started crafting such a TableSample function and I'm in the
process to refine the signature so your suggested interface above is 
very useful, thanks !

But I don't understand the reservoir_sample call, what is it supposed to do ?
And how flexibly anyarray return would be ? Could you return arbitrary 
typed rowtypes from it ?

 By making the building blocks available users get more flexibility.
 The downside would be that we can't automatically make better sampling
 methods available.

One approach doesn't preclude the other. TABLESAMPLE will still be useful,
also for SQL compliance.

--strk; 

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Sandro Santilli

On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
 On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:

  SELECT (SELECT reservoir_sample(some_table, 50) AS samples
 FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
  FROM random_pages('some_table', 50) AS rnd_pgtids;
 
 But I don't understand the reservoir_sample call, what is it supposed to do ?

Ok got it, that was probably to avoid:

 ERROR:  more than one row returned by a subquery used as an expression

But this also works nicely:

 SELECT * FROM lots_of_points
 WHERE ctid = ANY ( ARRAY[(SELECT random_tids('lots_of_points', 10))] );

and still uses tidscan.

The advanced TID operator would be for random_tids to only return pages rather
than full tids...

--strk; 

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Ants Aasma

On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli s...@keybit.net wrote:
 On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
 On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:

  SELECT (SELECT reservoir_sample(some_table, 50) AS samples
     FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
  FROM random_pages('some_table', 50) AS rnd_pgtids;

 But I don't understand the reservoir_sample call, what is it supposed to do ?

 Ok got it, that was probably to avoid:

  ERROR:  more than one row returned by a subquery used as an expression

No, it's to avoid bias towards tuples on more sparsely populated
pages. See http://en.wikipedia.org/wiki/Reservoir_sampling or
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=ff271644e0f93ee99bfe9c1f536f3dd48455d8d2;hb=HEAD#l1027

 The advanced TID operator would be for random_tids to only return pages rather
 than full tids...

Exactly. But when mainly IO bound (ie. sampling from a large table on
spinning rust) the overhead of probing with TID scan as opposed to
sequentially scanning the pages should be small enough. When CPU bound
I suspect that the function call machinery overhead for
reservoir_sample is going to become a large issue, so a built in
tablesample also has an edge there.


Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Sandro Santilli

On Sat, Apr 21, 2012 at 02:28:52PM +0800, Qi Huang wrote:
 
 Hi, Heikki
...
  Another idea that Robert Haas suggested was to add support doing a TID 
  scan for a query like WHERE ctid '(501,1)'. That's not enough work 
  for GSoC project on its own, but could certainly be a part of it.
 
 the first one and the last one are still not clear. 

The last one was the TID scan on filters like ctid  '(501,1)'.
TID scans are the fastest access method as they directly access
explicitly referenced addresses. Starting from this observation a sampling
function may select random pages and tuples within pages and directly
access them, optimizing accesses by grouping tuples within the same
page so to fetch them all togheter.

This is what the ANALYZE command already does when providing samples
for the type analyzers.

Unfortunately it looks like at SQL level only the equality operator triggers
a TID scan, so things like WHERE ctid  '(501,1)' won't be as fast as
fetching all visible tuples in the first 501 pages.

I think that's what Heikki was referring about.

I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

--strk; 

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Ants Aasma

On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli s...@keybit.net wrote:
 I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
 using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

Among other things, this would enable user-space implementation of
tablesample. Given the operator =~(tid, int) that matches the page
number and planner/executor integration so that it results in a TID
scan, you would need the following functions:

random_pages(tbl regclass, samples int) returns int[]
aggregate function:
reservoir_sample(item anyelement, samples int) returns anyarray

Implementations for both of the functions could be adapted from analyze.c.

Then tablesample could be implemented with the following query:
SELECT (SELECT reservoir_sample(some_table, 50) AS samples
   FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
FROM random_pages('some_table', 50) AS rnd_pgtids;

Actually, now that I think about it, it could actually be implemented
without any modifications to core at some cost to efficiency.
random_pages would have to return tid[] that contains for each
generated pagenumber all possible tids on that page.

By making the building blocks available users get more flexibility.
The downside would be that we can't automatically make better sampling
methods available.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-21 Thread Qi Huang


Hi, Heikki
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.
 
 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.
 


   Based on the discussion these days and my understanding, I don't see much 
change to be made in my proposal. For the 3 points you raised, the first one 
and the last one are still not clear. Especially the first point, I see that 
some people think making the SQL syntax into grammar is first choice. For the 
second point, the SQL standard 2003 defines two methods for sampling, SYSTEM 
and BERNOULLI. I think there might be possible quality refinement for them. For 
the optimization statistics, I have an idea of using it to assign different 
sampling percentages to different pages, but I'm not sure about the detail yet, 
I need to see into and learn the optimization statistics (this part is 
mentioned by Neil in his idea, so I think there should be way of using it). 
Also there might be enhance on specific sampling, like monetary unit sampling 
or the geographic indexes sampling. I can do this part(sampling quality 
improvement) as research based project. We can still discuss deeper to 
 see whether these can be done and how we can do them. 
I post my current amended proposal below. The changes are in red color. 
---Project
 Details: 

Neil Conway has come up
with an implementation at 2007 while he gave a talk of introducing to hacking
in PostgreSQL. The code was just for demo purpose and was incomplete. It was
not integrated into PostgreSQL patch. The PostgreSQL now is quite different from
2007. To implement this query, I need to understand Neil’s implementation and
use the general idea to implement in the most up-to-date PostgreSQL release. In
the end, I need to test the implementation till it can be released in the next
patch. I will also explore possible ways of further enhancing the sampling 
quality, like using optimization statistics to produce more accurate sample. 
There is also suggestions that I can integrate different sampling types into 
this query, like for accounting data, I can use monetary unit sampling to get 
the result, or implement the geographic indexes sampling. These specific types 
of sampling might require at least a sequence scan on data, which means slow 
down the sampling speed, but sampling quality will enhance greatly and be very 
useful in some specific fields. 


List
of features:   

1. TABLESAMPLE using
select, delete, and update

2. SYSTEM method

3. REPEATABLE support

4. BERNOULLI method

5. Use optimizer
statistics to produce a more accurate sample

6. non-integer
sample percentage and repeat seed7. sampling quality enhancement

4, 5 and 6 are not
included in Neil’s implementation. 

For 5, we can use
optimizer statistics to refine the algorithm for the random number selection of
pages or rows. The sample produced shall be more accurate. 



Inch-stones: 

1.  Conduct
the basic features' implementation, able to query TABLESAMPLE clause using
select, SYSTEM, with different combination of SQL queries. 

2.
 Implementation of other basic features, REPEATABLE and BERNOULLI. 

3.  Improvement
implementation. Support for using optimizer statistics to produce more accurate
sample, non-integer sample percentage and repeat seed, and sampling quality 
improvement. 

 

Project Schedule: 

1. From
April 23rd-May 10th: learning and understanding.

2. From
Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and
no REPEATABLE support. And do testing.

3. Mid
June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI
method, and do testing. Improvement 5 and 6 are also implemented now. 

4. Mid July-
Mid Aug: Explore ways of improving sampling quality should be done at period 2 
and 3. This period will be used to implement those ideas. 
---
 




Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli

On Tue, Apr 17, 2012 at 04:29:52PM -0400, Stephen Frost wrote:
 Josh,
 
 * Josh Berkus (j...@agliodbs.com) wrote:
  FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
  worked with geographic indexes.  This would be tremendously useful for
  constructing low-resolution zoom out tiles on maps and similar.
 
 I'm familiar with the concept of 'zoom out' tiles and PostGIS, but I
 don't actually see the connection between that and TABLESAMPLE.  Perhaps
 I'm missing something, but I've never seen a case where you create 'zoom
 out' tiles by just grabbing some portion of the data at random, as that
 would end up creating empty spots or missing pieces.

Actually a random sample would really be representative of the data
distribution. What the type analyzer gets is a sample and that sample
is what the estimator looks at to answer the question:

 How many rows fall in this rectangle ?

You can see how well it works by passing your queries using  operator
to EXPLAIN ANALYZE and compare estimated/real.

I'm looking for a way to fetch random samples these days so I confirm
the need for a quick way to fetch the same sample that analyze
command fetches but at SQL level.

--strk; 

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli

On Mon, 16 Apr 2012 23:17:25 -0700, Heikki Linnakangas wrote:

 1. We probably don't want the SQL syntax to be added to the
grammar. This should be written as an extension, using custom
functions as the API, instead of extra SQL syntax.

I can't find the discussion about this, have any pointer ?

I've found a patch of 2007 by Gavin Sherry implementing the SQL 2003
TABLESAMPLE syntax. May be a good starting point ?
http://www.neilconway.org/talks/hacking/

--strk;

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Stephen Frost

* Sandro Santilli (s...@keybit.net) wrote:
 Actually a random sample would really be representative of the data
 distribution. What the type analyzer gets is a sample and that sample
 is what the estimator looks at to answer the question:

That might work if all you have is point data, but lines, polygons, etc,
you're typically going to want to see, just not at the same resolution..
At least, when you're talking about 'zoom-out' tiles, which is what this
was about up thread.

 I'm looking for a way to fetch random samples these days so I confirm
 the need for a quick way to fetch the same sample that analyze
 command fetches but at SQL level.

I'm all for supporting that and implementing this feature, I just don't
think it's going to be all that useful for zoom-out tiles when complex
geometries are involved.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli

On Thu, Apr 19, 2012 at 08:47:51AM -0400, Stephen Frost wrote:
 * Sandro Santilli (s...@keybit.net) wrote:
  Actually a random sample would really be representative of the data
  distribution. What the type analyzer gets is a sample and that sample
  is what the estimator looks at to answer the question:
 
 That might work if all you have is point data, but lines, polygons, etc,
 you're typically going to want to see, just not at the same resolution..
 At least, when you're talking about 'zoom-out' tiles, which is what this
 was about up thread.
 
  I'm looking for a way to fetch random samples these days so I confirm
  the need for a quick way to fetch the same sample that analyze
  command fetches but at SQL level.
 
 I'm all for supporting that and implementing this feature, I just don't
 think it's going to be all that useful for zoom-out tiles when complex
 geometries are involved.

Me neither. But for points it sounds very useful.
And we know it is useful for lines and polygons as well when it comes
to estimate overlaps... (since the estimator does a good job even for
lines and polygons)

I really hope Neil Conway work of 2007 could make it into PostgreSQL.

Look, the same work was a topic of an homework assignment at Berkley in
2005: http://inst.eecs.berkeley.edu/~cs186/fa05/hw/hw2/hw2.html

And the whole thing is in the SQL standard 2003 

--strk;

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample





Hi, Heikki   Thanks for your advice.I will change my plan accordingly. But 
I have a few questions.
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 

1. This should be written as an extension, using custom functions as the API. 
Could you explain a bit more what does this mean?   
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.

2. In the plan, I mentioned using optimizer statistics to improve the quality 
of sampling. I may emphasize on that point. I will read about monetary unit 
sampling and add into the plan about possibility of implementing this idea. 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.

3. I read about the replies on using ctid. But I don't quite understand how 
that might help. ctid is just a physical location of row version within the 
table. If I do where ctid'(501, 1)', what is actually happening? Can I add 
in this as an optional implementation? I think I can check how to do this if I 
can have enough time in this project. 

Best Regards and ThanksHuang Qi VictorComputer Science Department- National 
University of Singapore

 Date: Tue, 17 Apr 2012 09:16:29 +0300
 From: heikki.linnakan...@enterprisedb.com
 To: j...@agliodbs.com
 CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; 
 alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; 
 cbbro...@gmail.com; kevin.gritt...@wicourts.gov
 Subject: [HACKERS] Gsoc2012 idea, tablesample
 
 On 24.03.2012 22:12, Joshua Berkus wrote:
  Qi,
 
  Yeah, I can see that.  That's a sign that you had a good idea for a 
  project, actually: your idea is interesting enough that people want to 
  debate it.  Make a proposal on Monday and our potential mentors will help 
  you refine the idea.
 
 Yep. The discussion withered, so let me try to summarize:
 
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.
 
 I think this would be a useful project if those two points are taken 
 care of.
 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid  '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Besides, I saw the Gsoc site editing has been closed. Should I just submit 
through this mailing list with attachment?

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

 Date: Tue, 17 Apr 2012 09:16:29 +0300
 From: heikki.linnakan...@enterprisedb.com
 To: j...@agliodbs.com
 CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; 
 alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; 
 cbbro...@gmail.com; kevin.gritt...@wicourts.gov
 Subject: [HACKERS] Gsoc2012 idea, tablesample

 On 24.03.2012 22:12, Joshua Berkus wrote:
  Qi,

  Yeah, I can see that.  That's a sign that you had a good idea for a 
  project, actually: your idea is interesting enough that people want to 
  debate it.  Make a proposal on Monday and our potential mentors will help 
  you refine the idea.

 Yep. The discussion withered, so let me try to summarize:

 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.

 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.

 I think this would be a useful project if those two points are taken 
 care of.

 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid  '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.

 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com

 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Heikki Linnakangas

On 17.04.2012 14:55, Qi Huang wrote:

Hi, Heikki Thanks for your advice.I will change my plan accordingly. But
I have a few questions.

1. We probably don't want the SQL syntax to be added to the grammar.
This should be written as an extension, using custom functions as the
API, instead of extra SQL syntax.

1. This should be written as an extension, using custom functions as the API.
Could you explain a bit more what does this mean?

I mean, it won't be integrated into the PostgeSQL server code. Rather,
it will be a standalone module that can be distributed as a separate
.tar.gz file, and installed on a server. PostgreSQL has some facilities
to help you package code as extensions that can be easily distributed
and installed.

2. It's not very useful if it's just a dummy replacement for WHERE
random() ?. It has to be more advanced than that. Quality of the
sample is important, as is performance. There was also an interesting
idea of on implementing monetary unit sampling.

2. In the plan, I mentioned using optimizer statistics to improve the quality
of sampling.

Yeah, that's one approach. Would be nice to hear more about that, how
exactly you can use optimizer statistics to help the sampling.

I may emphasize on that point. I will read about monetary unit sampling and add
into the plan about possibility of implementing this idea.

Ok, sounds good.

Another idea that Robert Haas suggested was to add support doing a TID
scan for a query like WHERE ctid '(501,1)'. That's not enough work
for GSoC project on its own, but could certainly be a part of it.

3. I read about the replies on using ctid. But I don't quite understand how that might help.
ctid is just a physical location of row version within the table. If I do where
ctid'(501, 1)', what is actually happening?

At the moment, if you do WHERE ctid = '(501,1)', you get an access plan
with a TidScan, which quickly fetches the row from that exact physical
location. But if you do WHERE ctid '(501,1'), you get a SeqScan,
which scans the whole table. That's clearly wasteful, you know the
physical range of pages you need to scan: everything up to page 501. But
the SeqScan will scan pages 501, too. The idea is to improve that so
that you'd only scan the pages up to page 501.

Can I add in this as an optional implementation? I think I can check how to do
this if I can have enough time in this project.

Yeah, that sounds reasonable.

Besides, I saw the Gsoc site editing has been closed. Should I just submit
through this mailing list with attachment?

Just post the updated details to this mailing list. Preferably inline,
not as an attachment. You don't need to post the contact details,
biography, etc, just updated inch-stones and project details parts.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost

* Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote:
 1. We probably don't want the SQL syntax to be added to the grammar.
 This should be written as an extension, using custom functions as
 the API, instead of extra SQL syntax.

Err, I missed that, and don't particularly agree with it..  Is there a
serious issue with the grammar defined in the SQL standard?  The other
DBs which provide this- do they use the SQL grammar or something else?

I'm not sure that I particularly *like* the SQL grammar, but if we're
going to implement this, we should really do it 'right'.

 2. It's not very useful if it's just a dummy replacement for WHERE
 random()  ?. It has to be more advanced than that. Quality of the
 sample is important, as is performance. There was also an
 interesting idea of on implementing monetary unit sampling.

In reviewing this, I got the impression (perhaps mistaken..), that
different sampling methods are defined by the SQL standard and that it
would simply be us to implement them according to what the standard
requires.

 I think this would be a useful project if those two points are taken
 care of.

Doing it 'right' certainly isn't going to be simply taking what Neil did
and updating it, and I understand Tom's concerns about having this be
more than a hack on seqscan, so I'm a bit nervous that this would turn
into something bigger than a GSoC project.

 Another idea that Robert Haas suggested was to add support doing a
 TID scan for a query like WHERE ctid  '(501,1)'. That's not
 enough work for GSoC project on its own, but could certainly be a
 part of it.

I don't think Robert's suggestion would be part of a 'tablesample'
patch.  Perhaps a completely different project which was geared towards
allowing hidden columns to be used in various ways in a WHERE clause..
Of course, we'd need someone to actually define that; I don't think
someone relatively new to the project is going to know what experienced
hackers want to do with system columns.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Tom Lane

Stephen Frost sfr...@snowman.net writes:
 * Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote:
 Another idea that Robert Haas suggested was to add support doing a
 TID scan for a query like WHERE ctid  '(501,1)'. That's not
 enough work for GSoC project on its own, but could certainly be a
 part of it.

 I don't think Robert's suggestion would be part of a 'tablesample'
 patch.

Yeah, I don't see the connection either.  It seems more like this
would be a localized hack in tidpath.c and nodeTidscan.c.  I think
it'd be a neat beginning project for somebody, but it's not really
related to the GSoC project as proposed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Greg Stark

On Tue, Apr 17, 2012 at 2:49 PM, Stephen Frost sfr...@snowman.net wrote:
 * Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote:
 1. We probably don't want the SQL syntax to be added to the grammar.
 This should be written as an extension, using custom functions as
 the API, instead of extra SQL syntax.

 Err, I missed that, and don't particularly agree with it..  Is there a
 serious issue with the grammar defined in the SQL standard?  The other
 DBs which provide this- do they use the SQL grammar or something else?

 I'm not sure that I particularly *like* the SQL grammar, but if we're
 going to implement this, we should really do it 'right'.

I think the danger is that fiddling with the grammar can be a ratsnest
of learning how to deal with bison grammars and learning how to
interpret the ANSI standard and bikeshedding. These are all parts of
the open source world so maybe an argument could be made they should
be part of a GSOC project but I fear they would swallow the whole
project.

But I think I agree that doing it as an external module would be
strange and not very useful. I picture it instead as a new scan type
which is basically a copy of  heapscan or tidscan and uses various
algorithms to decide which tuples to return. For a first cut pof I
would leave out the grammar and just have a guc that enabled replacing
the heap scan with a sample scan everywhere.

But that would have to be done as a patch to Postgres to add the new
scan type. It wouldn't make it much easier to have a hook that
replaced one scan type with another I don't think.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample


  2. It's not very useful if it's just a dummy replacement for WHERE
  random()  ?. It has to be more advanced than that. Quality of the
  sample is important, as is performance. There was also an
  interesting idea of on implementing monetary unit sampling.
 
 In reviewing this, I got the impression (perhaps mistaken..), that
 different sampling methods are defined by the SQL standard and that it
 would simply be us to implement them according to what the standard
 requires.
 
  I think this would be a useful project if those two points are taken
  care of.
 
 Doing it 'right' certainly isn't going to be simply taking what Neil did
 and updating it, and I understand Tom's concerns about having this be
 more than a hack on seqscan, so I'm a bit nervous that this would turn
 into something bigger than a GSoC project.
 

As Christopher Browne mentioned, for this sampling method, it is not possible 
without scanning the whole data set. It improves the sampling quality but 
increases the sampling cost. I think it should also be using only for some 
special sampling types, not for general. The general sampling methods, as in 
the SQL standard, should have only SYSTEM and BERNOULLI methods. 

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost

Qi,

* Qi Huang (huangq...@hotmail.com) wrote:
  Doing it 'right' certainly isn't going to be simply taking what Neil did
  and updating it, and I understand Tom's concerns about having this be
  more than a hack on seqscan, so I'm a bit nervous that this would turn
  into something bigger than a GSoC project.
 
 As Christopher Browne mentioned, for this sampling method, it is not possible 
 without scanning the whole data set. It improves the sampling quality but 
 increases the sampling cost. I think it should also be using only for some 
 special sampling types, not for general. The general sampling methods, as in 
 the SQL standard, should have only SYSTEM and BERNOULLI methods. 

I'm not sure what sampling method you're referring to here.  I agree
that we need to be looking at implementing the specific sampling methods
listed in the SQL standard.  How much information is provided in the
standard about the requirements placed on these sampling methods?  Does
the SQL standard only define SYSTEM and BERNOULLI?  What do the other
databases support?  What does SQL say the requirements are for 'SYSTEM'?

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Christopher Browne

On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost sfr...@snowman.net wrote:
 Qi,

 * Qi Huang (huangq...@hotmail.com) wrote:
  Doing it 'right' certainly isn't going to be simply taking what Neil did
  and updating it, and I understand Tom's concerns about having this be
  more than a hack on seqscan, so I'm a bit nervous that this would turn
  into something bigger than a GSoC project.

 As Christopher Browne mentioned, for this sampling method, it is not 
 possible without scanning the whole data set. It improves the sampling 
 quality but increases the sampling cost. I think it should also be using 
 only for some special sampling types, not for general. The general sampling 
 methods, as in the SQL standard, should have only SYSTEM and BERNOULLI 
 methods.

 I'm not sure what sampling method you're referring to here.  I agree
 that we need to be looking at implementing the specific sampling methods
 listed in the SQL standard.  How much information is provided in the
 standard about the requirements placed on these sampling methods?  Does
 the SQL standard only define SYSTEM and BERNOULLI?  What do the other
 databases support?  What does SQL say the requirements are for 'SYSTEM'?

Well, there may be cases where the quality of the sample isn't
terribly important, it just needs to be reasonable.

I browsed an article on the SYSTEM/BERNOULLI representations; they
both amount to simple picks of tuples.

- BERNOULLI implies picking tuples with a specified probability.

- SYSTEM implies picking pages with a specified probability.  (I think
we mess with this in ways that'll be fairly biased in view that tuples
mayn't be of uniform size, particularly if Slightly Smaller strings
stay in the main pages, whilst Slightly Larger strings get TOASTed...)

I get the feeling that this is a somewhat-magical feature (in that
users haven't much hope of understanding in what ways the results are
deterministic) that is sufficiently magical that anyone serious
about their result sets is likely to be unhappy to use either SYSTEM
or BERNOULLI.

Possibly the forms of sampling that people *actually* need, most of
the time, are more like Dollar Unit Sampling, which are pretty
deterministic, in ways that mandate that they be rather expensive
(e.g. - guaranteeing Seq Scan).
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Greg Stark

On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne cbbro...@gmail.com wrote:
 I get the feeling that this is a somewhat-magical feature (in that
 users haven't much hope of understanding in what ways the results are
 deterministic) that is sufficiently magical that anyone serious
 about their result sets is likely to be unhappy to use either SYSTEM
 or BERNOULLI.

These both sound pretty useful. BERNOULLI is fine for cases where
you aren't worried about time dependency on your data. If you're
looking for the average or total value of some column for example.

SYSTEM just means I'm willing to trade some unspecified amount of
speed for some unspecified amount of accuracy which presumably is
only good if you trust the database designers to make a reasonable
trade-off for cases where speed matters and the accuracy requirements
aren't very strict.

 Possibly the forms of sampling that people *actually* need, most of
 the time, are more like Dollar Unit Sampling, which are pretty
 deterministic, in ways that mandate that they be rather expensive
 (e.g. - guaranteeing Seq Scan).

I don't know about that but the cases I would expect to need other
distributions would be ones where you're looking at the tuples in a
non-linear way. Things like what's the average gap between events or
what's the average number of instances per value.  These might
require a full table scan but might still be useful if the data is
going to be subsequently aggregated or joined in ways that would be
too expensive on the full data set.

But we shouldn't let best be the enemy of the good here. Having SYSTEM
and BERNOULLI would solve most use cases and having those would make
it easier to add more later.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Josh Berkus

Qi, Hackers:

FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
worked with geographic indexes.  This would be tremendously useful for
constructing low-resolution zoom out tiles on maps and similar.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost

Josh,

* Josh Berkus (j...@agliodbs.com) wrote:
 FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
 worked with geographic indexes.  This would be tremendously useful for
 constructing low-resolution zoom out tiles on maps and similar.

I'm familiar with the concept of 'zoom out' tiles and PostGIS, but I
don't actually see the connection between that and TABLESAMPLE.  Perhaps
I'm missing something, but I've never seen a case where you create 'zoom
out' tiles by just grabbing some portion of the data at random, as that
would end up creating empty spots or missing pieces.

My experience has been with running an algorithm on each record in the
data set to reduce the number of points on a given record (which ends up
reducing the size of the record, etc).  You could also filter out
records which wouldn't be visible due to the small 'size' of the record.
Again, neither of those would benefit from a TABLESAMPLE command.
Perhaps they're thinking it's going to use a GIST index to only pull out
records matching a certain condition..?  Except we have WHERE for that..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Ants Aasma

On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com wrote:
 Well, there may be cases where the quality of the sample isn't
 terribly important, it just needs to be reasonable.

 I browsed an article on the SYSTEM/BERNOULLI representations; they
 both amount to simple picks of tuples.

 - BERNOULLI implies picking tuples with a specified probability.

 - SYSTEM implies picking pages with a specified probability.  (I think
 we mess with this in ways that'll be fairly biased in view that tuples
 mayn't be of uniform size, particularly if Slightly Smaller strings
 stay in the main pages, whilst Slightly Larger strings get TOASTed...)

Dealing with non uniform sizes isn't too hard. analyze.c already does
that. Given a table with B blocks it takes a uniform sample of b
blocks, and runs Vitter's reservoir sampling algorithm over the
resulting blocks to get s random tuples in a single pass. It's quite
easy to prove that this results in each tuple having an equal
probability to appear in the final table. However, it isn't fully
independent sampling - depending on the value of b compared to s and
B, there is a slightly higher probability to see multiple tuples
picked from one page. I'm too lazy to do the math, but for the analyze
case of b = s it probably isn't significant for most practical
purposes, even if B is really large. And it seems to me that when b =
s the reservoir sampling thresholds could be tweaked at block
boundaries to even out the dependencies.

The ratio of b to s could be tweaked to get lower quality sampling
(samples are more spatially clumped) in exchange for less random I/O.

 Possibly the forms of sampling that people *actually* need, most of
 the time, are more like Dollar Unit Sampling, which are pretty
 deterministic, in ways that mandate that they be rather expensive
 (e.g. - guaranteeing Seq Scan).

I have a gut feeling that Dollar Unit Sampling and other weighted
samples can be done with a similar approach of uniformly sampling
blocks and then running weighted reservoir sampling [1] over the
result.

[1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

Cheers,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample