* Qi Huang (huangq...@hotmail.com) wrote:
Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first
and try make SYSTEM sampling work. Other parts, I'll look at them later.
That sounds like the right first steps to me. Reviewing this
discussion, it was my thought to
...@heroku.com; huangq...@hotmail.com;
f...@phlo.org; pgsql-hackers@postgresql.org; sfr...@snowman.net
Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote:
On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
kevin.gritt...@wicourts.gov
Florian Pflug wrote:
On May10, 2012, at 18:36 , Kevin Grittner wrote:
Robert Haas wrote:
I wonder if you could do this with something akin to the Bitmap
Heap Scan machinery. Populate a TID bitmap with a bunch of
randomly chosen TIDs, fetch them all in physical order
and if you don't get
[ Sorry for the self-reply ]
On May11, 2012, at 13:17 , Florian Pflug wrote:
Actually, thinking more about this, if the approach sketched above
turns out to work, then one might be able to get away without remembering
previously computed TIDs, thus removing a lot of complexity.
Say, for
On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote:
On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
Ants Aasma a...@cybertec.at wrote:
It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c
Florian Pflug f...@phlo.org wrote:
Maybe one can get rid of these sorts of problems by factoring in
the expected density of the table beforehand and simply accepting
that the results will be inaccurate if the statistics are
outdated?
One could, for example, simply pick
N :=
Florian Pflug f...@phlo.org writes:
This all hinges on the ability to produce a sufficient accurate estimate of
the
TID density p_tup/p_tid, of course.
I think that's the least of its problems. AFAICS this analysis ignores
(1) the problem that the TID space is nonuniform, ie we don't know
Kevin Grittner kevin.gritt...@wicourts.gov writes:
Florian Pflug f...@phlo.org wrote:
Maybe one can get rid of these sorts of problems by factoring in
the expected density of the table beforehand and simply accepting
that the results will be inaccurate if the statistics are
outdated?
Tom Lane t...@sss.pgh.pa.us wrote:
If you're willing to accept that the quality of the results
depends on having up-to-date stats, then I'd suggest (1) use the
planner's existing technology to estimate the number of rows in
the table; (2) multiply by sampling factor you want to get a
On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
We
will be probing random page numbers 1..P for random tuple indexes
1..M. So how many random probes by ctid does that require? The
chance of a hit on each probe is ((T/P)/M) -- the average number of
tuples
Kevin Grittner kevin.gritt...@wicourts.gov writes:
[ uniformly sample the TID space defined as (1..P, 1..M) ]
Shouldn't that get us the randomly chosen sample we're looking for?
Is there a problem you think this ignores?
Not sure. The issue that I'm wondering about is that the line number
Robert Haas robertmh...@gmail.com wrote:
The trouble is, AFAICS, that you can't bound M very well without
scanning the whole table. I mean, it's bounded by theoretical
limit, but that's it.
What would the theoretical limit be? (black size - page header size
- minimum size of one tuple) /
Kevin Grittner kevin.gritt...@wicourts.gov writes:
Robert Haas robertmh...@gmail.com wrote:
The trouble is, AFAICS, that you can't bound M very well without
scanning the whole table. I mean, it's bounded by theoretical
limit, but that's it.
What would the theoretical limit be? (black size
On May11, 2012, at 17:20 , Robert Haas wrote:
On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner
Since T cancels out of that equation, we don't need to worry about
estimating it. Results will be correct for any value of M which is
*at least* as large as the maximum tuple index in the table,
Tom Lane t...@sss.pgh.pa.us wrote:
Kevin Grittner kevin.gritt...@wicourts.gov writes:
[ uniformly sample the TID space defined as (1..P, 1..M) ]
Shouldn't that get us the randomly chosen sample we're looking
for? Is there a problem you think this ignores?
Not sure. The issue that I'm
On May11, 2012, at 16:03 , Kevin Grittner wrote:
[more complex alternatives]
I really think your first suggestion covers it perfectly; these more
complex techniques don't seem necessary to me.
The point of the more complex techniques (especially the algorithm in
my second mail, the reply to
On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
Robert Haas robertmh...@gmail.com wrote:
The trouble is, AFAICS, that you can't bound M very well without
scanning the whole table. I mean, it's bounded by theoretical
limit, but that's it.
What would the
Robert Haas robertmh...@gmail.com wrote:
On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
Robert Haas robertmh...@gmail.com wrote:
The trouble is, AFAICS, that you can't bound M very well without
scanning the whole table. I mean, it's bounded by theoretical
On Fri, May 11, 2012 at 6:16 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
MaxHeapTuplesPerPage?
What about dead line pointers without corresponding tuples?
Actually we don't allow there to be more than MaxHeapTuplesPerPage
line pointers even if some of them are dead line pointers.
I
Hi, AllThanks for your ideas on the implementation of TABLESAMPLE. I have a
summary below of the high level requirements from the -hacker thread till now.
Please give further comment and if I missed any point, please fell free to add.
1. Build a new type of node, as I should not use
On May10, 2012, at 10:43 , Qi Huang wrote:
2. use TIDSCAN to directly access tuples. The below way of using ctid
proposed by Kevin looks good.
-One technique which might be suitably random without reading the
-whole table would be to figure out a maximum block number and tuple
-ID for the
Florian Pflug f...@phlo.org wrote:
One problem I see with this approach is that its efficiency
depends on the average tuple length, at least with a naive
approach to random ctid generator. The simplest way to generate
those randomly without introducing bias is to generate a random
page
On Thu, May 10, 2012 at 10:28 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
One problem I see with this approach is that its efficiency
depends on the average tuple length, at least with a naive
approach to random ctid generator. The simplest way to generate
those randomly without
Robert Haas robertmh...@gmail.com wrote:
I wonder if you could do this with something akin to the Bitmap
Heap Scan machinery. Populate a TID bitmap with a bunch of
randomly chosen TIDs, fetch them all in physical order
It would be pretty hard for any other plan to beat that by very
much,
On Thu, May 10, 2012 at 6:33 PM, Robert Haas robertmh...@gmail.com wrote:
I'm worried this project is getting so complicated that it will be
beyond the ability of a new hacker to get anything useful done. Can
we simplify the requirements here to something that is reasonable for
a beginner?
Ants Aasma a...@cybertec.at wrote:
It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c (acquire_sample_rows) and use that to
implement the SYSTEM sampling method.
Definitely. I thought we had all agreed on that ages ago.
-Kevin
--
Sent via
On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
Ants Aasma a...@cybertec.at wrote:
It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c (acquire_sample_rows) and use that to
implement the SYSTEM sampling method.
On May10, 2012, at 18:36 , Kevin Grittner wrote:
Robert Haas robertmh...@gmail.com wrote:
I wonder if you could do this with something akin to the Bitmap
Heap Scan machinery. Populate a TID bitmap with a bunch of
randomly chosen TIDs, fetch them all in physical order
and if you don't get
[point splintered in quoting re-joined]
Florian Pflug wrote:
On May10, 2012, at 18:36 , Kevin Grittner wrote:
Robert Haas wrote:
I wonder if you could do this with something akin to the Bitmap
Heap Scan machinery. Populate a TID bitmap with a bunch of
randomly chosen TIDs, fetch them all
On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli s...@keybit.net wrote:
I'd love to see enhanced CTID operators, to fetch all visible tuples in a
page
using a tidscan. Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
SELECT (SELECT reservoir_sample(some_table, 50) AS samples
FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
FROM random_pages('some_table', 50) AS rnd_pgtids;
On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli s...@keybit.net wrote:
On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
SELECT (SELECT reservoir_sample(some_table, 50) AS samples
FROM some_table WHERE ctid =~
On Sat, Apr 21, 2012 at 02:28:52PM +0800, Qi Huang wrote:
Hi, Heikki
...
Another idea that Robert Haas suggested was to add support doing a TID
scan for a query like WHERE ctid '(501,1)'. That's not enough work
for GSoC project on its own, but could certainly be a part of it.
the
On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli s...@keybit.net wrote:
I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
using a tidscan. Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
Among other things, this would enable user-space implementation of
Hi, Heikki
1. We probably don't want the SQL syntax to be added to the grammar.
This should be written as an extension, using custom functions as the
API, instead of extra SQL syntax.
2. It's not very useful if it's just a dummy replacement for WHERE
random() ?. It has to be more
On Tue, Apr 17, 2012 at 04:29:52PM -0400, Stephen Frost wrote:
Josh,
* Josh Berkus (j...@agliodbs.com) wrote:
FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
worked with geographic indexes. This would be tremendously useful for
constructing low-resolution zoom
On Mon, 16 Apr 2012 23:17:25 -0700, Heikki Linnakangas wrote:
1. We probably don't want the SQL syntax to be added to the
grammar. This should be written as an extension, using custom
functions as the API, instead of extra SQL syntax.
I can't find the discussion about this, have any
* Sandro Santilli (s...@keybit.net) wrote:
Actually a random sample would really be representative of the data
distribution. What the type analyzer gets is a sample and that sample
is what the estimator looks at to answer the question:
That might work if all you have is point data, but lines,
On Thu, Apr 19, 2012 at 08:47:51AM -0400, Stephen Frost wrote:
* Sandro Santilli (s...@keybit.net) wrote:
Actually a random sample would really be representative of the data
distribution. What the type analyzer gets is a sample and that sample
is what the estimator looks at to answer the
On 24.03.2012 22:12, Joshua Berkus wrote:
Qi,
Yeah, I can see that. That's a sign that you had a good idea for a project,
actually: your idea is interesting enough that people want to debate it. Make
a proposal on Monday and our potential mentors will help you refine the idea.
Yep. The
...@gmail.com; dan...@heroku.com;
cbbro...@gmail.com; kevin.gritt...@wicourts.gov
Subject: [HACKERS] Gsoc2012 idea, tablesample
On 24.03.2012 22:12, Joshua Berkus wrote:
Qi,
Yeah, I can see that. That's a sign that you had a good idea for a
project, actually: your idea is interesting
To: j...@agliodbs.com
CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de;
alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com;
cbbro...@gmail.com; kevin.gritt...@wicourts.gov
Subject: [HACKERS] Gsoc2012 idea, tablesample
On 24.03.2012 22:12, Joshua
On 17.04.2012 14:55, Qi Huang wrote:
Hi, Heikki Thanks for your advice.I will change my plan accordingly. But
I have a few questions.
1. We probably don't want the SQL syntax to be added to the grammar.
This should be written as an extension, using custom functions as the
API, instead of
* Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote:
1. We probably don't want the SQL syntax to be added to the grammar.
This should be written as an extension, using custom functions as
the API, instead of extra SQL syntax.
Err, I missed that, and don't particularly agree with
Stephen Frost sfr...@snowman.net writes:
* Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote:
Another idea that Robert Haas suggested was to add support doing a
TID scan for a query like WHERE ctid '(501,1)'. That's not
enough work for GSoC project on its own, but could certainly
On Tue, Apr 17, 2012 at 2:49 PM, Stephen Frost sfr...@snowman.net wrote:
* Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote:
1. We probably don't want the SQL syntax to be added to the grammar.
This should be written as an extension, using custom functions as
the API, instead of
2. It's not very useful if it's just a dummy replacement for WHERE
random() ?. It has to be more advanced than that. Quality of the
sample is important, as is performance. There was also an
interesting idea of on implementing monetary unit sampling.
In reviewing this, I got the
Qi,
* Qi Huang (huangq...@hotmail.com) wrote:
Doing it 'right' certainly isn't going to be simply taking what Neil did
and updating it, and I understand Tom's concerns about having this be
more than a hack on seqscan, so I'm a bit nervous that this would turn
into something bigger than a
On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost sfr...@snowman.net wrote:
Qi,
* Qi Huang (huangq...@hotmail.com) wrote:
Doing it 'right' certainly isn't going to be simply taking what Neil did
and updating it, and I understand Tom's concerns about having this be
more than a hack on
On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne cbbro...@gmail.com wrote:
I get the feeling that this is a somewhat-magical feature (in that
users haven't much hope of understanding in what ways the results are
deterministic) that is sufficiently magical that anyone serious
about their
Qi, Hackers:
FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
worked with geographic indexes. This would be tremendously useful for
constructing low-resolution zoom out tiles on maps and similar.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via
Josh,
* Josh Berkus (j...@agliodbs.com) wrote:
FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
worked with geographic indexes. This would be tremendously useful for
constructing low-resolution zoom out tiles on maps and similar.
I'm familiar with the concept of 'zoom
On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com wrote:
Well, there may be cases where the quality of the sample isn't
terribly important, it just needs to be reasonable.
I browsed an article on the SYSTEM/BERNOULLI representations; they
both amount to simple picks of
Date: Wed, 18 Apr 2012 02:45:09 +0300
Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
From: a...@cybertec.at
To: cbbro...@gmail.com
CC: sfr...@snowman.net; pgsql-hackers@postgresql.org
On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com
wrote:
Well, there may
54 matches
Mail list logo