Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-14 Thread Stephen Frost
* Qi Huang (huangq...@hotmail.com) wrote: Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first and try make SYSTEM sampling work. Other parts, I'll look at them later. That sounds like the right first steps to me. Reviewing this discussion, it was my thought to

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-13 Thread Qi Huang
...@heroku.com; huangq...@hotmail.com; f...@phlo.org; pgsql-hackers@postgresql.org; sfr...@snowman.net Subject: Re: [HACKERS] Gsoc2012 idea, tablesample On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote: On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner kevin.gritt...@wicourts.gov

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
Florian Pflug wrote: On May10, 2012, at 18:36 , Kevin Grittner wrote: Robert Haas wrote: I wonder if you could do this with something akin to the Bitmap Heap Scan machinery. Populate a TID bitmap with a bunch of randomly chosen TIDs, fetch them all in physical order and if you don't get

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
[ Sorry for the self-reply ] On May11, 2012, at 13:17 , Florian Pflug wrote: Actually, thinking more about this, if the approach sketched above turns out to work, then one might be able to get away without remembering previously computed TIDs, thus removing a lot of complexity. Say, for

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Sandro Santilli
On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote: On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Ants Aasma a...@cybertec.at wrote: It seems to me that the simplest thing to do would be to lift the sampling done in analyze.c

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Florian Pflug f...@phlo.org wrote: Maybe one can get rid of these sorts of problems by factoring in the expected density of the table beforehand and simply accepting that the results will be inaccurate if the statistics are outdated? One could, for example, simply pick N :=

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
Florian Pflug f...@phlo.org writes: This all hinges on the ability to produce a sufficient accurate estimate of the TID density p_tup/p_tid, of course. I think that's the least of its problems. AFAICS this analysis ignores (1) the problem that the TID space is nonuniform, ie we don't know

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: Florian Pflug f...@phlo.org wrote: Maybe one can get rid of these sorts of problems by factoring in the expected density of the table beforehand and simply accepting that the results will be inaccurate if the statistics are outdated?

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: If you're willing to accept that the quality of the results depends on having up-to-date stats, then I'd suggest (1) use the planner's existing technology to estimate the number of rows in the table; (2) multiply by sampling factor you want to get a

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Robert Haas
On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: We will be probing random page numbers 1..P for random tuple indexes 1..M.  So how many random probes by ctid does that require? The chance of a hit on each probe is ((T/P)/M) -- the average number of tuples

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: [ uniformly sample the TID space defined as (1..P, 1..M) ] Shouldn't that get us the randomly chosen sample we're looking for? Is there a problem you think this ignores? Not sure. The issue that I'm wondering about is that the line number

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: The trouble is, AFAICS, that you can't bound M very well without scanning the whole table. I mean, it's bounded by theoretical limit, but that's it. What would the theoretical limit be? (black size - page header size - minimum size of one tuple) /

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: Robert Haas robertmh...@gmail.com wrote: The trouble is, AFAICS, that you can't bound M very well without scanning the whole table. I mean, it's bounded by theoretical limit, but that's it. What would the theoretical limit be? (black size

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
On May11, 2012, at 17:20 , Robert Haas wrote: On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner Since T cancels out of that equation, we don't need to worry about estimating it. Results will be correct for any value of M which is *at least* as large as the maximum tuple index in the table,

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: [ uniformly sample the TID space defined as (1..P, 1..M) ] Shouldn't that get us the randomly chosen sample we're looking for? Is there a problem you think this ignores? Not sure. The issue that I'm

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
On May11, 2012, at 16:03 , Kevin Grittner wrote: [more complex alternatives] I really think your first suggestion covers it perfectly; these more complex techniques don't seem necessary to me. The point of the more complex techniques (especially the algorithm in my second mail, the reply to

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Robert Haas
On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: The trouble is, AFAICS, that you can't bound M very well without scanning the whole table.  I mean, it's bounded by theoretical limit, but that's it. What would the

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: The trouble is, AFAICS, that you can't bound M very well without scanning the whole table. I mean, it's bounded by theoretical

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Greg Stark
On Fri, May 11, 2012 at 6:16 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: MaxHeapTuplesPerPage? What about dead line pointers without corresponding tuples? Actually we don't allow there to be more than MaxHeapTuplesPerPage line pointers even if some of them are dead line pointers. I

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Qi Huang
Hi, AllThanks for your ideas on the implementation of TABLESAMPLE. I have a summary below of the high level requirements from the -hacker thread till now. Please give further comment and if I missed any point, please fell free to add. 1. Build a new type of node, as I should not use

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Florian Pflug
On May10, 2012, at 10:43 , Qi Huang wrote: 2. use TIDSCAN to directly access tuples. The below way of using ctid proposed by Kevin looks good. -One technique which might be suitably random without reading the -whole table would be to figure out a maximum block number and tuple -ID for the

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
Florian Pflug f...@phlo.org wrote: One problem I see with this approach is that its efficiency depends on the average tuple length, at least with a naive approach to random ctid generator. The simplest way to generate those randomly without introducing bias is to generate a random page

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Robert Haas
On Thu, May 10, 2012 at 10:28 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: One problem I see with this approach is that its efficiency depends on the average tuple length, at least with a naive approach to random ctid generator. The simplest way to generate those randomly without

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: I wonder if you could do this with something akin to the Bitmap Heap Scan machinery. Populate a TID bitmap with a bunch of randomly chosen TIDs, fetch them all in physical order It would be pretty hard for any other plan to beat that by very much,

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Ants Aasma
On Thu, May 10, 2012 at 6:33 PM, Robert Haas robertmh...@gmail.com wrote: I'm worried this project is getting so complicated that it will be beyond the ability of a new hacker to get anything useful done.  Can we simplify the requirements here to something that is reasonable for a beginner?

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
Ants Aasma a...@cybertec.at wrote: It seems to me that the simplest thing to do would be to lift the sampling done in analyze.c (acquire_sample_rows) and use that to implement the SYSTEM sampling method. Definitely. I thought we had all agreed on that ages ago. -Kevin -- Sent via

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Robert Haas
On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Ants Aasma a...@cybertec.at wrote: It seems to me that the simplest thing to do would be to lift the sampling done in analyze.c (acquire_sample_rows) and use that to implement the SYSTEM sampling method.

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Florian Pflug
On May10, 2012, at 18:36 , Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: I wonder if you could do this with something akin to the Bitmap Heap Scan machinery. Populate a TID bitmap with a bunch of randomly chosen TIDs, fetch them all in physical order and if you don't get

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
[point splintered in quoting re-joined] Florian Pflug wrote: On May10, 2012, at 18:36 , Kevin Grittner wrote: Robert Haas wrote: I wonder if you could do this with something akin to the Bitmap Heap Scan machinery. Populate a TID bitmap with a bunch of randomly chosen TIDs, fetch them all

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Sandro Santilli
On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli s...@keybit.net wrote: I'd love to see enhanced CTID operators, to fetch all visible tuples in a page using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Sandro Santilli
On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote: On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: SELECT (SELECT reservoir_sample(some_table, 50) AS samples FROM some_table WHERE ctid =~ ANY (rnd_pgtids)) FROM random_pages('some_table', 50) AS rnd_pgtids;

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Ants Aasma
On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli s...@keybit.net wrote: On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote: On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: SELECT (SELECT reservoir_sample(some_table, 50) AS samples    FROM some_table WHERE ctid =~

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Sandro Santilli
On Sat, Apr 21, 2012 at 02:28:52PM +0800, Qi Huang wrote: Hi, Heikki ... Another idea that Robert Haas suggested was to add support doing a TID scan for a query like WHERE ctid '(501,1)'. That's not enough work for GSoC project on its own, but could certainly be a part of it. the

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Ants Aasma
On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli s...@keybit.net wrote: I'd love to see enhanced CTID operators, to fetch all visible tuples in a page using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange. Among other things, this would enable user-space implementation of

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-21 Thread Qi Huang
Hi, Heikki 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. 2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli
On Tue, Apr 17, 2012 at 04:29:52PM -0400, Stephen Frost wrote: Josh, * Josh Berkus (j...@agliodbs.com) wrote: FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which worked with geographic indexes. This would be tremendously useful for constructing low-resolution zoom

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli
On Mon, 16 Apr 2012 23:17:25 -0700, Heikki Linnakangas wrote: 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. I can't find the discussion about this, have any

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Stephen Frost
* Sandro Santilli (s...@keybit.net) wrote: Actually a random sample would really be representative of the data distribution. What the type analyzer gets is a sample and that sample is what the estimator looks at to answer the question: That might work if all you have is point data, but lines,

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli
On Thu, Apr 19, 2012 at 08:47:51AM -0400, Stephen Frost wrote: * Sandro Santilli (s...@keybit.net) wrote: Actually a random sample would really be representative of the data distribution. What the type analyzer gets is a sample and that sample is what the estimator looks at to answer the

[HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Heikki Linnakangas
On 24.03.2012 22:12, Joshua Berkus wrote: Qi, Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea. Yep. The

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
...@gmail.com; dan...@heroku.com; cbbro...@gmail.com; kevin.gritt...@wicourts.gov Subject: [HACKERS] Gsoc2012 idea, tablesample On 24.03.2012 22:12, Joshua Berkus wrote: Qi, Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
To: j...@agliodbs.com CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; cbbro...@gmail.com; kevin.gritt...@wicourts.gov Subject: [HACKERS] Gsoc2012 idea, tablesample On 24.03.2012 22:12, Joshua

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Heikki Linnakangas
On 17.04.2012 14:55, Qi Huang wrote: Hi, Heikki Thanks for your advice.I will change my plan accordingly. But I have a few questions. 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost
* Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote: 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. Err, I missed that, and don't particularly agree with

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes: * Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote: Another idea that Robert Haas suggested was to add support doing a TID scan for a query like WHERE ctid '(501,1)'. That's not enough work for GSoC project on its own, but could certainly

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Greg Stark
On Tue, Apr 17, 2012 at 2:49 PM, Stephen Frost sfr...@snowman.net wrote: * Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote: 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more advanced than that. Quality of the sample is important, as is performance. There was also an interesting idea of on implementing monetary unit sampling. In reviewing this, I got the

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost
Qi, * Qi Huang (huangq...@hotmail.com) wrote: Doing it 'right' certainly isn't going to be simply taking what Neil did and updating it, and I understand Tom's concerns about having this be more than a hack on seqscan, so I'm a bit nervous that this would turn into something bigger than a

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Christopher Browne
On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost sfr...@snowman.net wrote: Qi, * Qi Huang (huangq...@hotmail.com) wrote: Doing it 'right' certainly isn't going to be simply taking what Neil did and updating it, and I understand Tom's concerns about having this be more than a hack on

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Greg Stark
On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne cbbro...@gmail.com wrote: I get the feeling that this is a somewhat-magical feature (in that users haven't much hope of understanding in what ways the results are deterministic) that is sufficiently magical that anyone serious about their

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Josh Berkus
Qi, Hackers: FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which worked with geographic indexes. This would be tremendously useful for constructing low-resolution zoom out tiles on maps and similar. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost
Josh, * Josh Berkus (j...@agliodbs.com) wrote: FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which worked with geographic indexes. This would be tremendously useful for constructing low-resolution zoom out tiles on maps and similar. I'm familiar with the concept of 'zoom

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Ants Aasma
On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com wrote: Well, there may be cases where the quality of the sample isn't terribly important, it just needs to be reasonable. I browsed an article on the SYSTEM/BERNOULLI representations; they both amount to simple picks of

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
Date: Wed, 18 Apr 2012 02:45:09 +0300 Subject: Re: [HACKERS] Gsoc2012 idea, tablesample From: a...@cybertec.at To: cbbro...@gmail.com CC: sfr...@snowman.net; pgsql-hackers@postgresql.org On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com wrote: Well, there may