Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-25 Thread Robert Haas
On Sun, Mar 25, 2012 at 6:11 AM, Marc Mamin  wrote:
> Hello,
>
> Here is something we'd like to have:
>
> http://archives.postgresql.org/pgsql-hackers/2012-01/msg00650.php
>
> As we are quite busy and this issue hasn't a high priority, we haven't
> followed it until now :-(
>
> I'm only a Postgres user, not a hacker, so I don't have the knowledge to
> help on this nor to evaluate if this is might be a good Gssoc project.
>
> Just an idea for the case you are looking for another topic.

Good idea.  If anyone want so pursue it, I'd strongly suggest building
it as a contrib module rather than dedicated syntax, because I'm not
sure there'd be any consensus on adding syntax for it to core.

Actually, though, I wonder how much faster it would be than CREATE
TABLE AS?  Block-level copy should be faster than tuple-level copy,
but I'm not sure whether it would be a lot faster or only slightly
faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-25 Thread Marc Mamin
Hello,

 

Here is something we'd like to have:

 

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00650.php

 

As we are quite busy and this issue hasn't a high priority, we haven't followed 
it until now :-(

 

I'm only a Postgres user, not a hacker, so I don't have the knowledge to help 
on this nor to evaluate if this is might be a good Gssoc project.

 

Just an idea for the case you are looking for another topic.

 

best regards,

 

Marc Mamin

 

From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Qi Huang
Sent: Samstag, 24. März 2012 05:20
To: cbbro...@gmail.com; kevin.gritt...@wicourts.gov
Cc: pgsql-hackers@postgresql.org; and...@anarazel.de; 
alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; 
j...@agliodbs.com
Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

 

> Date: Thu, 22 Mar 2012 13:17:01 -0400
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: cbbro...@gmail.com
> To: kevin.gritt...@wicourts.gov
> CC: pgsql-hackers@postgresql.org
> 
> On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
>  wrote:
> > Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> Well, the standard syntax apparently aims to reduce the number of
> >>> returned rows, which ORDER BY does not.  Maybe you could do it
> >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> >>> like to sample the table without reading all of it first, so that
> >>> seems to miss the point.
> >>
> >> I think actually the traditional locution is more like
> >>   WHERE random() < constant
> >> where the constant is the fraction of the table you want.  And
> >> yeah, the presumption is that you'd like it to not actually read
> >> every row.  (Though unless the sampling density is quite a bit
> >> less than 1 row per page, it's not clear how much you're really
> >> going to win.)
> >
> > It's all going to depend on the use cases, which I don't think I've
> > heard described very well yet.
> >
> > I've had to pick random rows from, for example, a table of
> > disbursements to support a financial audit.  In those cases it has
> > been the sample size that mattered, and order didn't.  One
> > interesting twist there is that for some of these financial audits
> > they wanted the probability of a row being selected to be
> > proportional to the dollar amount of the disbursement.  I don't
> > think you can do this without a first pass across the whole data
> > set.
> 
> This one was commonly called "Dollar Unit Sampling," though the
> terminology has gradually gotten internationalized.
> http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
> 
> What the article doesn't mention is that some particularly large items
> might wind up covering multiple samples. In the example, they're
> looking for a sample every $3125 down the list. If there was a single
> transaction valued at $3, that (roughly) covers 10 of the desired
> samples.
> 
> It isn't possible to do this without scanning across the entire table.
> 
> If you want repeatability, you probably want to instantiate a copy of
> enough information to indicate the ordering chosen. That's probably
> something that needs to be captured as part of the work of the audit,
> so not only does it need to involve a pass across the data, it
> probably requires capturing a fair bit of data for posterity.
> -- 
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"

 

 

The discussion till now has gone far beyond my understanding.

Could anyone explain briefly what is the idea for now? 

The designing detail for me is still unfamiliar. I can only take time to 
understand while possible after being selected and put time on it to read 
relevant material. 

For now, I'm still curious why Neil's implementation is no longer working? The 
Postgres has been patched a lot, but the general idea behind Neil's 
implementation should still work, isn't it? 

Besides, whether this query is needed is still not decided . Seems this is 
another hard to decide point.  Is it that this topic is still not so prepared 
for the Gsoc yet? If really so, I think I still have time to switch to other 
topics. Any suggestion?

 

Thanks.

Best Regards and Thanks

Huang Qi Victor

Computer Science of National University of Singapore



Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-24 Thread Joshua Berkus
Qi,

Yeah, I can see that.  That's a sign that you had a good idea for a project, 
actually: your idea is interesting enough that people want to debate it.  Make 
a proposal on Monday and our potential mentors will help you refine the idea.

- Original Message -
> 
> 
> 
> 
> > Date: Thu, 22 Mar 2012 13:17:01 -0400
> > Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database
> > schema
> > From: cbbro...@gmail.com
> > To: kevin.gritt...@wicourts.gov
> > CC: pgsql-hackers@postgresql.org
> > 
> > On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> >  wrote:
> > > Tom Lane  wrote:
> > >> Robert Haas  writes:
> > >>> Well, the standard syntax apparently aims to reduce the number
> > >>> of
> > >>> returned rows, which ORDER BY does not. Maybe you could do it
> > >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> > >>> like to sample the table without reading all of it first, so
> > >>> that
> > >>> seems to miss the point.
> > >> 
> > >> I think actually the traditional locution is more like
> >! ; >> WHERE random() < constant
> > >> where the constant is the fraction of the table you want. And
> > >> yeah, the presumption is that you'd like it to not actually read
> > >> every row. (Though unless the sampling density is quite a bit
> > >> less than 1 row per page, it's not clear how much you're really
> > >> going to win.)
> > > 
> > > It's all going to depend on the use cases, which I don't think
> > > I've
> > > heard described very well yet.
> > > 
> > > I've had to pick random rows from, for example, a table of
> > > disbursements to support a financial audit. In those cases it has
> > > been the sample size that mattered, and order didn't. One
> > > interesting twist there is that for some of these financial
> > > audits
> > > they wanted the probability of a row being selected to be
> > > proportional ! to the dollar amount of the disbursement. I don't
> > > t hink you can do this without a first pass across the whole data
> > > set.
> > 
> > This one was commonly called "Dollar Unit Sampling," though the
> > terminology has gradually gotten internationalized.
> > http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
> > 
> > What the article doesn't mention is that some particularly large
> > items
> > might wind up covering multiple samples. In the example, they're
> > looking for a sample every $3125 down the list. If there was a
> > single
> > transaction valued at $3, that (roughly) covers 10 of the
> > desired
> > samples.
> > 
> > It isn't possible to do this without scanning across the entire
> > table.
> > 
> > If you want repeatability, you probably want to instantiate a copy
> > of
> > enough information to indicate the ordering chosen. That's probably
> > something that needs to be captured as part of the work of the
> > audit,
> > so n! ot only does it need to involve a pass across the data, it
> > probably requires capturing a fair bit of data for posterity.
> > --
> > When confronted by a difficult problem, solve it by reducing it to
> > the
> > question, "How would the Lone Ranger handle this?"
> 
> 
> 
> 
> 
> 
> The discussion till now has gone far beyond my understanding.
> Could anyone explain briefly what is the idea for now?
> The designing detail for me is still unfamiliar. I can only take time
> to understand while possible after being selected and put time on it
> to read relevant material.
> For now, I'm still curious why Neil's implementation is no longer
> working? The Postgres has been patched a lot, but the general idea
> behind Neil's implementation should still work, isn't it?
> Besides, whether this query is needed is still not decided. Seems
> this is another hard to decide point. Is it that this topic is still
> not so prepared for th e Gsoc yet? If really so, I think I still
> have time to switch to other topics. Any suggestion?
> 
> 
> Thanks.
> 
> Best Regards and Thanks
> Huang Qi Victor
> Computer Science of National University of Singapore

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-23 Thread Qi Huang




> Date: Thu, 22 Mar 2012 13:17:01 -0400
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: cbbro...@gmail.com
> To: kevin.gritt...@wicourts.gov
> CC: pgsql-hackers@postgresql.org
> 
> On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
>  wrote:
> > Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> Well, the standard syntax apparently aims to reduce the number of
> >>> returned rows, which ORDER BY does not.  Maybe you could do it
> >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> >>> like to sample the table without reading all of it first, so that
> >>> seems to miss the point.
> >>
> >> I think actually the traditional locution is more like
> >>   WHERE random() < constant
> >> where the constant is the fraction of the table you want.  And
> >> yeah, the presumption is that you'd like it to not actually read
> >> every row.  (Though unless the sampling density is quite a bit
> >> less than 1 row per page, it's not clear how much you're really
> >> going to win.)
> >
> > It's all going to depend on the use cases, which I don't think I've
> > heard described very well yet.
> >
> > I've had to pick random rows from, for example, a table of
> > disbursements to support a financial audit.  In those cases it has
> > been the sample size that mattered, and order didn't.  One
> > interesting twist there is that for some of these financial audits
> > they wanted the probability of a row being selected to be
> > proportional to the dollar amount of the disbursement.  I don't
> > think you can do this without a first pass across the whole data
> > set.
> 
> This one was commonly called "Dollar Unit Sampling," though the
> terminology has gradually gotten internationalized.
> http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
> 
> What the article doesn't mention is that some particularly large items
> might wind up covering multiple samples.  In the example, they're
> looking for a sample every $3125 down the list.  If there was a single
> transaction valued at $3, that (roughly) covers 10 of the desired
> samples.
> 
> It isn't possible to do this without scanning across the entire table.
> 
> If you want repeatability, you probably want to instantiate a copy of
> enough information to indicate the ordering chosen.  That's probably
> something that needs to be captured as part of the work of the audit,
> so not only does it need to involve a pass across the data, it
> probably requires capturing a fair bit of data for posterity.
> -- 
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"



The discussion till now has gone far beyond my understanding.Could anyone 
explain briefly what is the idea for now? The designing detail for me is still 
unfamiliar. I can only take time to understand while possible after being 
selected and put time on it to read relevant material. For now, I'm still 
curious why Neil's implementation is no longer working? The Postgres has been 
patched a lot, but the general idea behind Neil's implementation should still 
work, isn't it? Besides, whether this query is needed is still not decided. 
Seems this is another hard to decide point.  Is it that this topic is still not 
so prepared for the Gsoc yet? If really so, I think I still have time to switch 
to other topics. Any suggestion?
Thanks.

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore
  

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-22 Thread Christopher Browne
On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
 wrote:
> Tom Lane  wrote:
>> Robert Haas  writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not.  Maybe you could do it
>>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>>> like to sample the table without reading all of it first, so that
>>> seems to miss the point.
>>
>> I think actually the traditional locution is more like
>>       WHERE random() < constant
>> where the constant is the fraction of the table you want.  And
>> yeah, the presumption is that you'd like it to not actually read
>> every row.  (Though unless the sampling density is quite a bit
>> less than 1 row per page, it's not clear how much you're really
>> going to win.)
>
> It's all going to depend on the use cases, which I don't think I've
> heard described very well yet.
>
> I've had to pick random rows from, for example, a table of
> disbursements to support a financial audit.  In those cases it has
> been the sample size that mattered, and order didn't.  One
> interesting twist there is that for some of these financial audits
> they wanted the probability of a row being selected to be
> proportional to the dollar amount of the disbursement.  I don't
> think you can do this without a first pass across the whole data
> set.

This one was commonly called "Dollar Unit Sampling," though the
terminology has gradually gotten internationalized.
http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html

What the article doesn't mention is that some particularly large items
might wind up covering multiple samples.  In the example, they're
looking for a sample every $3125 down the list.  If there was a single
transaction valued at $3, that (roughly) covers 10 of the desired
samples.

It isn't possible to do this without scanning across the entire table.

If you want repeatability, you probably want to instantiate a copy of
enough information to indicate the ordering chosen.  That's probably
something that needs to be captured as part of the work of the audit,
so not only does it need to involve a pass across the data, it
probably requires capturing a fair bit of data for posterity.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-22 Thread Kevin Grittner
Tom Lane  wrote:
> Robert Haas  writes:
>> Well, the standard syntax apparently aims to reduce the number of
>> returned rows, which ORDER BY does not.  Maybe you could do it
>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>> like to sample the table without reading all of it first, so that
>> seems to miss the point.
> 
> I think actually the traditional locution is more like
>   WHERE random() < constant
> where the constant is the fraction of the table you want.  And
> yeah, the presumption is that you'd like it to not actually read
> every row.  (Though unless the sampling density is quite a bit
> less than 1 row per page, it's not clear how much you're really
> going to win.)
 
It's all going to depend on the use cases, which I don't think I've
heard described very well yet.
 
I've had to pick random rows from, for example, a table of
disbursements to support a financial audit.  In those cases it has
been the sample size that mattered, and order didn't.  One
interesting twist there is that for some of these financial audits
they wanted the probability of a row being selected to be
proportional to the dollar amount of the disbursement.  I don't
think you can do this without a first pass across the whole data
set.
 
I've also been involved in developing software to pick random people
for jury selection processes.  In some of these cases, you don't
want a certain percentage, you want a particular number of people,
and you want the list to be ordered so that an initial set of
potential jurors can be seated from the top of the list and then as
the voir dire[1] process progresses you can replace excused jurors
from progressive positions on the randomized list.
 
In both cases you need to be able to explain the technique used in
lay terms and show why it is very hard for anyone to predict or
control which rows are chosen, but also use statistical analysis to
prove that there is no significant correlation between the rows
chosen and identifiable characteristics of the rows.  For example,
selecting all the rows from random blocks would be right out for
juror selection because a list from the state DOT of people with
driver's licenses and state ID cards would probably be in license
number order when loaded, and since the start of the driver's
license number is a soundex of the last name, there is a strong
correlation between blocks of the table and ethnicity.
 
One technique which might be suitably random without reading the
whole table would be to figure out a maximum block number and tuple
ID for the table, and generate a series of random ctid values to
read.  If the tuple doesn't exist or is not visible to the snapshot,
you ignore it and continue, until you have read the requisite number
of rows.  You could try to generate them in advance and sort them by
block number, but then you need to solve the problems of what to do
if that set of ctids yields too many rows or too few rows, both of
which have sticky issues.
 
-Kevin
 
[1] http://en.wikipedia.org/wiki/Voir_dire#Use_in_the_United_States


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Tom Lane
Robert Haas  writes:
> Well, there's something mighty tempting about having a way to say
> "just give me a random sample of the blocks and I'll worry about
> whether that represents a random sample of the rows".

> It's occurred to me a few times that it's pretty unfortunate you can't
> do that with a TID condition.

> rhaas=# explain select * from randomtext where ctid >= '(500,1)' and
> ctid < '(501,1)';

Yeah, as you say that's come up more than once in data-recovery
situations.  It seems like it'd be just a SMOP to extend the tidscan
stuff to handle ranges.

Another thing that people sometimes wish for is joins using TIDs.
I think the latter would actually be pretty trivial to do now given the
parameterized-plan infrastructure; I'd hoped to look into it for 9.2 but
ran out of time...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Andrew Dunstan



On 03/21/2012 11:49 AM, Robert Haas wrote:

On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane  wrote:

Robert Haas  writes:

Well, the standard syntax apparently aims to reduce the number of
returned rows, which ORDER BY does not.  Maybe you could do it with
ORDER BY .. LIMIT, but the idea here I think is that we'd like to
sample the table without reading all of it first, so that seems to
miss the point.

I think actually the traditional locution is more like
WHERE random()<  constant
where the constant is the fraction of the table you want.  And yeah,
the presumption is that you'd like it to not actually read every row.
(Though unless the sampling density is quite a bit less than 1 row
per page, it's not clear how much you're really going to win.)

Well, there's something mighty tempting about having a way to say
"just give me a random sample of the blocks and I'll worry about
whether that represents a random sample of the rows".

It's occurred to me a few times that it's pretty unfortunate you can't
do that with a TID condition.

rhaas=# explain select * from randomtext where ctid>= '(500,1)' and
ctid<  '(501,1)';
  QUERY PLAN

  Seq Scan on randomtext  (cost=0.00..111764.90 rows=25000 width=31)
Filter: ((ctid>= '(500,1)'::tid) AND (ctid<  '(501,1)'::tid))
(2 rows)

The last time this came up for me was when I was trying to find which
row in a large table as making the SELECT blow up; but it seems like
it could be used to implement a poor man's sampling method, too... it
would be nicer, in either case, to be able to specify the block
numbers you'd like to be able to read, rather than bounding the CTID
from both ends as in the above example.



That would rapidly get unmanageable when you wanted lots of pages.


Maybe we could do something like a pagenum pseudovar, or a wildcard 
match for ctid against '(123,*)'.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Robert Haas
On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Well, the standard syntax apparently aims to reduce the number of
>> returned rows, which ORDER BY does not.  Maybe you could do it with
>> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
>> sample the table without reading all of it first, so that seems to
>> miss the point.
>
> I think actually the traditional locution is more like
>        WHERE random() < constant
> where the constant is the fraction of the table you want.  And yeah,
> the presumption is that you'd like it to not actually read every row.
> (Though unless the sampling density is quite a bit less than 1 row
> per page, it's not clear how much you're really going to win.)

Well, there's something mighty tempting about having a way to say
"just give me a random sample of the blocks and I'll worry about
whether that represents a random sample of the rows".

It's occurred to me a few times that it's pretty unfortunate you can't
do that with a TID condition.

rhaas=# explain select * from randomtext where ctid >= '(500,1)' and
ctid < '(501,1)';
 QUERY PLAN

 Seq Scan on randomtext  (cost=0.00..111764.90 rows=25000 width=31)
   Filter: ((ctid >= '(500,1)'::tid) AND (ctid < '(501,1)'::tid))
(2 rows)

The last time this came up for me was when I was trying to find which
row in a large table as making the SELECT blow up; but it seems like
it could be used to implement a poor man's sampling method, too... it
would be nicer, in either case, to be able to specify the block
numbers you'd like to be able to read, rather than bounding the CTID
from both ends as in the above example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Qi Huang


> Date: Wed, 21 Mar 2012 11:00:59 -0400
> From: and...@dunslane.net
> To: alvhe...@commandprompt.com
> CC: t...@sss.pgh.pa.us; robertmh...@gmail.com; huangq...@hotmail.com; 
> neil.con...@gmail.com; dan...@heroku.com; j...@agliodbs.com; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> 
> 
> 
> On 03/21/2012 10:47 AM, Alvaro Herrera wrote:
> > Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
> >
> >> Now that would all be fine if this were a widely-desired feature, but
> >> AFAIR the user demand for it has been about nil.  So I'm leaning to
> >> the position that we don't want it.
> > I disagree with there being zero interest ... the "order by random()"
> > stuff does come up occasionally.
> >
> 
> Presumably the reason that's not good enough is that is scans the whole 
> table (as well as being non-portable)? Maybe we could find some less 
> invasive way of avoiding that.
> 
> cheers
> 
> andrew


Thanks for your discussion and ideas. As I checked, MS SQL Server and DB2 
implemented tablesample for now. At least, it is useful for QUICK sample 
retrieval for large dataset. I suppose this clause itself will be much faster 
for using random().About implementation, will the code change be really very 
large? But the general structure should still be about the same, right? 
Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Tom Lane
Robert Haas  writes:
> Well, the standard syntax apparently aims to reduce the number of
> returned rows, which ORDER BY does not.  Maybe you could do it with
> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
> sample the table without reading all of it first, so that seems to
> miss the point.

I think actually the traditional locution is more like
WHERE random() < constant
where the constant is the fraction of the table you want.  And yeah,
the presumption is that you'd like it to not actually read every row.
(Though unless the sampling density is quite a bit less than 1 row
per page, it's not clear how much you're really going to win.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Tom Lane
Andrew Dunstan  writes:
> On 03/21/2012 10:47 AM, Alvaro Herrera wrote:
>> I disagree with there being zero interest ... the "order by random()"
>> stuff does come up occasionally.

> Presumably the reason that's not good enough is that is scans the whole 
> table (as well as being non-portable)?

The reason I'm concerned about the implementation effort is precisely
that I'm afraid people will have high expectations for the intelligence
of the feature.  If it's not materially better than you can get today
with "order by random()", it's not worth doing.  That will mean for
example that it can't just be something we bolt onto seqscans and be
done with --- it'll need to interact with indexscans, maybe joins, etc
etc.  And no shortcuts on the quality of the sampling, either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Robert Haas
On Wed, Mar 21, 2012 at 10:57 AM, Andres Freund  wrote:
> On Wednesday, March 21, 2012 03:47:23 PM Alvaro Herrera wrote:
>> Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
>> > Now that would all be fine if this were a widely-desired feature, but
>> > AFAIR the user demand for it has been about nil.  So I'm leaning to
>> > the position that we don't want it.
>>
>> I disagree with there being zero interest ... the "order by random()"
>> stuff does come up occasionally.
> Yes.
>
> I wonder if could be hacked ontop of a plain seqscan node instead of building
> a completely separate infrastructure. The standards syntax would then simply
> be transformed into a select with some special ORDER BY

Well, the standard syntax apparently aims to reduce the number of
returned rows, which ORDER BY does not.  Maybe you could do it with
ORDER BY .. LIMIT, but the idea here I think is that we'd like to
sample the table without reading all of it first, so that seems to
miss the point.

I have to admit I'm not very impressed by the argument that we
shouldn't do this because we'll need a new executor node.  Creating a
new executor node is not really that big of a deal; and in any case I
don't think Tom will like hacking another bit of functionality into
seq-scan any better, since he refactored both the Merge Append and
Index-Only Scan patches to avoid doing exactly that, and those were
more similar than this probably would be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Andrew Dunstan



On 03/21/2012 10:47 AM, Alvaro Herrera wrote:

Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:


Now that would all be fine if this were a widely-desired feature, but
AFAIR the user demand for it has been about nil.  So I'm leaning to
the position that we don't want it.

I disagree with there being zero interest ... the "order by random()"
stuff does come up occasionally.



Presumably the reason that's not good enough is that is scans the whole 
table (as well as being non-portable)? Maybe we could find some less 
invasive way of avoiding that.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Andres Freund
On Wednesday, March 21, 2012 03:47:23 PM Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
> > Now that would all be fine if this were a widely-desired feature, but
> > AFAIR the user demand for it has been about nil.  So I'm leaning to
> > the position that we don't want it.
> 
> I disagree with there being zero interest ... the "order by random()"
> stuff does come up occasionally.
Yes.

I wonder if could be hacked ontop of a plain seqscan node instead of building 
a completely separate infrastructure. The standards syntax would then simply 
be transformed into a select with some special ORDER BY

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:

> Now that would all be fine if this were a widely-desired feature, but
> AFAIR the user demand for it has been about nil.  So I'm leaning to
> the position that we don't want it.

I disagree with there being zero interest ... the "order by random()"
stuff does come up occasionally.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Tom Lane
Robert Haas  writes:
> One thing we should probably try to establish before you get started
> working on this is whether people want the feature, which is basically
> the ability to write something like this in the FROM clause of a
> query:

> table_name TABLESAMPLE { BERNOULLI | SYSTEM } ( sample_percent ) [
> REPEATABLE ( repeat_seed ) ] ]

> I have at present no position on whether we want that or not, but
> maybe someone else does.  The upside is that would be a more efficient
> replacement for the ORDER BY random() trick that is often used today;
> the downside is that it requires dedicated syntax and a whole new
> executor node for something that, realistically, isn't going to come
> up very often.

Yeah --- you're talking about chunks of new code in both planner and
executor.  A very rough estimate is that this might be about as
complicated to do properly as MergeAppend was (and we're still shaking
out the bugs in that :-().

Now that would all be fine if this were a widely-desired feature, but
AFAIR the user demand for it has been about nil.  So I'm leaning to
the position that we don't want it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Robert Haas
On Tue, Mar 20, 2012 at 10:22 PM, Qi Huang  wrote:
> Thanks so much, Neil.
> I think I kind of understand the situation for now. The implementation
> posted by Neil was for the purpose of the talk, thus rushed and may not be
> up to st andard of Postgres Community. Also Neil mentioned the PRNG state in
> the patch is buggy, and maybe also some others. Thus, in the Gsoc project, I
> could understand the details of Neil's implementation, fix the bugs, make
> the code fit for the community standard, and test.
> Is there any comment on this?

In addition to that, you'll probably find that the old patch doesn't
apply any more, and you'll need to fix a lot of things to get it
working again.  The code has changed a lot in the meantime.

One thing we should probably try to establish before you get started
working on this is whether people want the feature, which is basically
the ability to write something like this in the FROM clause of a
query:

table_name TABLESAMPLE { BERNOULLI | SYSTEM } ( sample_percent ) [
REPEATABLE ( repeat_seed ) ] ]

I have at present no position on whether we want that or not, but
maybe someone else does.  The upside is that would be a more efficient
replacement for the ORDER BY random() trick that is often used today;
the downside is that it requires dedicated syntax and a whole new
executor node for something that, realistically, isn't going to come
up very often.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-20 Thread Qi Huang







> Date: Tue, 20 Mar 2012 14:12:45 -0700> Subject: Re: [HACKERS] Gsoc2012 Idea 
> --- Social Network database schema
> From: neil.con...@gmail.com
> To: huangq...@hotmail.com
> CC: dan...@heroku.com; j...@agliodbs.com; pgsql-hackers@postgresql.org
> 
> 2012/3/19 Qi Huang :
> >> I actually tried to find out, personally...not sure if I was searching
> >> wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
> >> useful conversations at the right time in history (~2007), even when
> >> the search is given a broad date-horizon (all), so I, too, an
> >> uninformed as to the specific objections.
> >>
> >> http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d
> >
> > I sent a mail to Nail Conway asking him about this. Hope he could give a
> > good answer.
> 
> I never tried to get TABLESAMPLE support into the main PostgreSQL tree
> -- I just developed the original code as an exercise for the purposes
> of the talk. Implementing TABLESAMPLE would probably be a reasonable
> GSoc project.
> 
> My memory of the details is fuzzy, but one thing to check is whether
> the approach taken by my patch (randomly choose heap pages and then
> return all the live tuples in a chosen page) actually meets the
> standard's requirements -- obviously it is not true that each heap
> page has the same number of live tuples, so you aren't getting a truly
> random sample.
> 
> Neil
> 


Thanks so much, Neil. I think I kind of understand the situation for now. The 
implementation posted by Neil was for the purpose of the talk, thus rushed and 
may not be up to standard of Postgres Community. Also Neil mentioned the PRNG 
state in the patch is buggy, and maybe also some others. Thus, in the Gsoc 
project, I could understand the details of Neil's implementation, fix the bugs, 
make the code fit for the community standard, and test. Is there any comment on 
this? 


Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

  

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-20 Thread Neil Conway
2012/3/19 Qi Huang :
>> I actually tried to find out, personally...not sure if I was searching
>> wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
>> useful conversations at the right time in history (~2007), even when
>> the search is given a broad date-horizon (all), so I, too, an
>> uninformed as to the specific objections.
>>
>> http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d
>
> I sent a mail to Nail Conway asking him about this. Hope he could give a
> good answer.

I never tried to get TABLESAMPLE support into the main PostgreSQL tree
-- I just developed the original code as an exercise for the purposes
of the talk. Implementing TABLESAMPLE would probably be a reasonable
GSoc project.

My memory of the details is fuzzy, but one thing to check is whether
the approach taken by my patch (randomly choose heap pages and then
return all the live tuples in a chosen page) actually meets the
standard's requirements -- obviously it is not true that each heap
page has the same number of live tuples, so you aren't getting a truly
random sample.

Neil

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-19 Thread Qi Huang

 > On Mon, Mar 19, 2012 at 6:17 PM, Josh Berkus  wrote:
> > On 3/18/12 8:11 PM, HuangQi wrote:
> >> The implementation seems to be done quite fully. There is even a patch
> >> file. Why is the implementation not added into the release of Postgres? As
> >> so much has already being done, what could I do in this case for the Gsoc?
> >
> > That would be good for you to research.  archives.postgresql.org will
> > help you find the discussions around that.
> 
> I actually tried to find out, personally...not sure if I was searching
> wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
> useful conversations at the right time in history (~2007), even when
> the search is given a broad date-horizon (all), so I, too, an
> uninformed as to the specific objections.
> 
> http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d
> 


I sent a mail to Nail Conway asking him about this. Hope he could give a good 
answer. While waiting for the response, how about the skip scan? Daniel 
mentioned there is still some unknown.I searched this mail thread suggesting 
the skip scan to TODO list. 
http://archives.postgresql.org/pgsql-bugs/2010-03/msg00144.phpAlso this thread 
talking about http://archives.postgresql.org/pgsql-hackers/2010-03/msg00328.php
Not sure whether this is feasible for Gsoc

Best RegardsHuang Qi Victor   

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-19 Thread Daniel Farina
On Mon, Mar 19, 2012 at 6:17 PM, Josh Berkus  wrote:
> On 3/18/12 8:11 PM, HuangQi wrote:
>> The implementation seems to be done quite fully. There is even a patch
>> file. Why is the implementation not added into the release of Postgres? As
>> so much has already being done, what could I do in this case for the Gsoc?
>
> That would be good for you to research.  archives.postgresql.org will
> help you find the discussions around that.

I actually tried to find out, personally...not sure if I was searching
wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
useful conversations at the right time in history (~2007), even when
the search is given a broad date-horizon (all), so I, too, an
uninformed as to the specific objections.

http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-19 Thread Josh Berkus
On 3/18/12 8:11 PM, HuangQi wrote:
> The implementation seems to be done quite fully. There is even a patch
> file. Why is the implementation not added into the release of Postgres? As
> so much has already being done, what could I do in this case for the Gsoc?

That would be good for you to research.  archives.postgresql.org will
help you find the discussions around that.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-18 Thread HuangQi
The implementation seems to be done quite fully. There is even a patch
file. Why is the implementation not added into the release of Postgres? As
so much has already being done, what could I do in this case for the Gsoc?

On Sun, Mar 18, 2012 at 6:38 PM, Daniel Farina  wrote:

> On Sat, Mar 17, 2012 at 8:48 PM, HuangQi  wrote:
> > About the second topic, so currently TABLESAMPLE is not implemented
> > inside Postgres? I didn't see this query before, but I googled it just
> now
> > and the query seems very weird and
> > interesting.
> http://www.fotia.co.uk/fotia/DY.18.TheTableSampleClause.aspx
> > Still, do you have any mail thread talking about this?
>
> I think there may be a few, but there's a nice implementation plan
> discussed by Neil Conway and written into slides from a few years ago:
>
> http://www.pgcon.org/2007/schedule/attachments/9-Introduction_to_Hacking_PostgreSQL_Neil_Conway.pdf
>
> He also had his implementation, although at this point some of the
> bitrot will be intense:
>
> http://www.neilconway.org/talks/hacking/
>
> I also seem to remember writing this (to some degree) as a student as
> part of a class project, so a full-blown production implementation in
> a summer sounds reasonable, unless someone has thought more about this
> and ran into some icebergs.  I'm not sure exactly what the blockers
> were to this being committed back in 2007 (not to suggest there
> weren't any).
>
> I haven't thought enough about skipscan, but there a number more
> unknowns there to me...
>
> --
> fdr
>



-- 
Best Regards
Huang Qi Victor


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-18 Thread Daniel Farina
On Sat, Mar 17, 2012 at 8:48 PM, HuangQi  wrote:
>     About the second topic, so currently TABLESAMPLE is not implemented
> inside Postgres? I didn't see this query before, but I googled it just now
> and the query seems very weird and
> interesting. http://www.fotia.co.uk/fotia/DY.18.TheTableSampleClause.aspx
>     Still, do you have any mail thread talking about this?

I think there may be a few, but there's a nice implementation plan
discussed by Neil Conway and written into slides from a few years ago:
http://www.pgcon.org/2007/schedule/attachments/9-Introduction_to_Hacking_PostgreSQL_Neil_Conway.pdf

He also had his implementation, although at this point some of the
bitrot will be intense:

http://www.neilconway.org/talks/hacking/

I also seem to remember writing this (to some degree) as a student as
part of a class project, so a full-blown production implementation in
a summer sounds reasonable, unless someone has thought more about this
and ran into some icebergs.  I'm not sure exactly what the blockers
were to this being committed back in 2007 (not to suggest there
weren't any).

I haven't thought enough about skipscan, but there a number more
unknowns there to me...

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-17 Thread HuangQi
(Sorry, Daniel. Forgot to cc pgsql-hackers.)
Hi, Daniel
Thanks a lot for your response.
As I can see for now, in my FYP, as the acyclic schema has the property
that it has a join tree. I will check how many join trees it has and
investigate any best option for the RSN schema. If it does have, I will
modify the planner to just use this join tree if it detects a RSN database,
then no need to use System R to search through all possible join trees. The
implementation is narrow to only RSN schema. It might be too constraint for
Postgres as a generic database system.
For 'loose index join', I googled it and find the two sites useful to
understand.
http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html
http://wiki.postgresql.org/wiki/Loose_indexscan
This is implemented in MySQL already, while Postgres only emulates the
access method. Do you have any mail thread talking about the current design
and progress?
About the second topic, so currently TABLESAMPLE is not implemented
inside Postgres? I didn't see this query before, but I googled it just now
and the query seems very weird and interesting.
http://www.fotia.co.uk/fotia/DY.18.TheTableSampleClause.aspx
Still, do you have any mail thread talking about this?
Thanks.



-- 
Best Regards
Huang Qi Victor


Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-17 Thread Daniel Farina
On Sat, Mar 17, 2012 at 8:50 AM, HuangQi  wrote:
>     I'm quite glad if you could offer me some advices. Thanks a lot for your
> help!

Thank you for your interest! However, I am a little confused precisely
what you are thinking about implementing.  Are there particular access
methods or operators that you think are useful in this problem space,
or changes to the planner?

As long as you are soliciting for suggestions, I'll make one...

One that bites me (and my organization) all the time is the lack of
the access method skip scan (also called "loose index scan").  It's a
killer for append-mostly tables that track a much smaller number of
entities than the number of records in the table, and we have a
grotesque hack to do it right now.  In the more "social" space the
problem reappears in the form of newsfeeds, so I think that work would
have good impact across a nice spectrum of users.

Another skip-related feature that would be very nice is the
SQL-standard TABLESAMPLE feature.  I wonder if the notion of a
"SkipKind" could be taught to the executor that would provide cohesion
of implementation for most feature that involve skipping a lot of the
rows in a table while continuing a scan.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers