Re: [HACKERS] [PATCH]Tablesample Submission

2012-11-04 Thread Qi Huang
Dear hackers Sorry for not replying the patch review. I didn't see the 
review until recently as my mail box is full of Postgres mails and I didn't 
notice the one for me, my mail box configuration problem. I am still kind 
of busy with my university final year project. I shall not have time to work on 
updating the patch until this semester finishes which is December. I will work 
on then.Thanks.
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

From: huangq...@outlook.com
To: robertmh...@gmail.com
CC: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PATCH]Tablesample Submission
Date: Tue, 21 Aug 2012 23:08:41 +0800




 Please add your patch here:
 
 https://commitfest.postgresql.org/action/commitfest_view/open
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Hi, Robert
I added it under Miscellaneous.
https://commitfest.postgresql.org/action/patch_view?id=918

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 

  

[HACKERS] Estimation of HashJoin Cost

2012-11-01 Thread Qi Huang
Hi, Dear HackersI need to estimate the hashjoin cost in my research. As the 
textbook shows, it is 3(R+S) where R and S are the size of the tablesize of the 
two tables, which realistically only considers the cost of IO. But this is 
obviously too theoretical. What is the correct way to estimate the cost of 
hashjoin? I looked at the code of final_cost_hashjoin() . It is not clear what 
factor it is considering. So, except the I/O to and from disk, what other 
factors are affecting the cost of hahsjoin? Also, is there any way to force 
postgres abide on the estimation of Hashjoin cost as 3(R+S), which also means, 
to make hashjoin cost mainly spend on I/O?
Thanks

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

Re: [HACKERS] [PATCH]Tablesample Submission

2012-08-21 Thread Qi Huang
 Please add your patch here:
 
 https://commitfest.postgresql.org/action/commitfest_view/open
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Hi, Robert
I added it under Miscellaneous.
https://commitfest.postgresql.org/action/patch_view?id=918

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

[HACKERS] Git diff patch in context diff format

2012-08-08 Thread Qi Huang
Hi, hackersI was exporting my project to a patch file. As the patch review 
requires, the patch needs to be in context diff format 
(http://wiki.postgresql.org/wiki/Reviewing_a_Patch). But the git diff exports 
in a format similar to unified format. What is everyone doing with patching 
now? Is there any standard way?Thanks.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

Re: [HACKERS] Git diff patch in context diff format

2012-08-08 Thread Qi Huang
 Date: Wed, 8 Aug 2012 15:05:06 -0400
 From: and...@dunslane.net
 To: br...@momjian.us
 CC: huangq...@outlook.com; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Git diff patch in context diff format
 
 
 On 08/08/2012 01:29 PM, Bruce Momjian wrote:
  On Thu, Aug  2, 2012 at 05:03:04PM +0800, Qi Huang wrote:
  Hi, hackers
   I was exporting my project to a patch file. As the patch review 
  requires,
  the patch needs to be in context diff format 
  (http://wiki.postgresql.org/wiki/
  Reviewing_a_Patch). But the git diff exports in a format similar to unified
  format. What is everyone doing with patching now? Is there any standard 
  way?
  Have you read our wiki about git and diffs?
 
  http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
 
 
 
 I must confess that, like Robert, I just use:
 
 git diff | filterdiff --format=context
 
 I tried the git config stuff mentioned on the wiki, and it bit me a few 
 times, I forget exactly how, and this just works without hassle.
 
 cheers
 
 andrew
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

I'm sorry, but actually this issue has already been resolved. I don't why the 
system sent another mail... Thanks for your response.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore
  

[HACKERS] Git diff patch in context diff format

2012-08-03 Thread Qi Huang



Hi, hackersI was exporting my project to a patch file. As the patch review 
requires, the patch needs to be in context diff format 
(http://wiki.postgresql.org/wiki/Reviewing_a_Patch). But the git diff exports 
in a format similar to unified format. What is everyone doing with patching 
currently? Is there any standard way?Thanks.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  
  

Re: [HACKERS] Git diff patch in context diff format

2012-08-03 Thread Qi Huang
Thanks for your suggestion, Robert.I just found this page 
(http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git), 
quite useful and powerful. 
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

 Date: Fri, 3 Aug 2012 04:28:42 -0400
 Subject: Re: [HACKERS] Git diff patch in context diff format
 From: robertmh...@gmail.com
 To: huangq...@outlook.com
 CC: pgsql-hackers@postgresql.org
 
 On Fri, Aug 3, 2012 at 2:56 AM, Qi Huang huangq...@outlook.com wrote:
  Hi, hackers
  I was exporting my project to a patch file. As the patch review
  requires, the patch needs to be in context diff format
  (http://wiki.postgresql.org/wiki/Reviewing_a_Patch). But the git diff
  exports in a format similar to unified format. What is everyone doing with
  patching currently? Is there any standard way?
 
 When I want a context diff, I just do:
 
 git diff | filterdiff --format=context
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
  

[HACKERS] [patch] pg_stat_lwlocks view - lwlocks statistics

2012-07-30 Thread Qi Huang













Hi,I was doing patch review for patch of pg_stat_lwlocks view - lwlocks 
statistics.   https://commitfest.postgresql.org/action/patch_view?id=885
   The mail for the patch work is at: 
http://archives.postgresql.org/pgsql-hackers/2012-06/msg01518.php 
   Following the steps on http://wiki.postgresql.org/wiki/Reviewing_a_Patch, 
the review is below:
1. submission review  The patch is in standard git diff format.   The patch 
applies with git repo master branch, in commit 
8a504a363925fc5c7af48cd723da3f7e4d7ba9e2.  I applied back the patch with 
command patch -p1 pg_stat_lwlocks_20120626.diff.  The patch applies back 
cleanly.   There is no test file or doc patch, as I didn't find any.
2. Usability Review  I run queries to select from pg_stat_lwlocks and function 
pg_stat_get_lwlocks(), and they return the same results. And after running 
select from pg_stat_reset_lwlocks(), all the data seems to reset. So the stats 
relation and functions are working. 
3. Code format  Looking at the patch file, the code format looks nice. Also 
some basic comments inside. 
About the code quality, performance review and further bebugging, I'm not sure 
how should I do the testing Any suggestion? 
Any comment?Thanks.
 Best RegardsHuang Qi VictorComputer Science of National University of Singapore



  

pg_stat_lwlocks_20120626.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1

2012-06-26 Thread Qi Huang

Hi, hackersI modified the code in add_path() a bit so that all the query 
path candidates inside pathlist will not be removed and all new path will be 
added into the pathlist, thus all path candidates are kept in pathlist. I then 
tested a four-relation query. In 9.1.3, I can see thousands of candidates in 
the final RelOptInfo, and some of them are even busy trees. But in 9.2 beta1 
which I forked from github, there are no such busy trees and only about 50 join 
path in total, which should match the requirement of System R algo. Is there 
any modification regarding the system R algo in the new release? And something 
wrong in algo in 9.1.3?Thanks
   

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

Re: [HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1

2012-06-26 Thread Qi Huang


 [ shrug... ] When you're not showing us exactly what you did, it's hard
 to answer that for sure. But there is some prefiltering logic in
 joinpath.c that you might have to lobotomize too if you want to keep
 known-inferior join paths.
 
 regards, tom lane

Thanks, Tom.
Below is what I did for the code in add_path(). I commented out the section of 
remove_old, and also make sure always accept_new. Then at make_one_rel(), after 
calling rel = make_rel_from_joinlist(root, joinlist); , I set pprint(rel) 
in the next line. Checking the RelOptInfo printed out in logfile, I can see 
some bushy trees, in 9.1.3. 

267 ---/*268 --- * Loop to check proposed new path against old paths.  Note 
it is possible269 --- * for more than one old path to be tossed out because 
new_path dominates270 --- * it.271 --- *272 --- * We can't use foreach here 
because the loop body may delete the current273 --- * list cell.274 --- */275 
---p1_prev = NULL;276 ---for (p1 = list_head(parent_rel-pathlist); p1 != 
NULL; p1 = p1_next)277 ---{   ...338339 
--/* 340 -- * Remove current element from pathlist if dominated by 
new. 341 -- */ 342 //if (remove_old) 343 //{ 344 
//---parent_rel-pathlist = list_delete_cell(parent_rel-pathlist, 345 
//-p1, p1_prev);   
   346   347 -/* 348 
- * Delete the data pointed-to by the deleted cell, if possible 349 
- */ 350 //---if (!IsA(old_path, IndexPath)) 351 /
 /--pfree(old_path); 352 -/* p1_prev does not advance */ 
353 //} 354 //else 355 //{ 356 -/* new belongs 
after this old path if it has cost = old's */ 357 -if (costcmp = 
0) 358 insert_after = p1; 359 -/* p1_prev advances 
*/ 360 -p1_prev = p1; 361 //}362   363 --/* 364 -- 
* If we found an old path that dominates new_path, we can quit 365 -- * 
scanning the pathlist; we will not add new_path, and we assume 366 -- * 
new_path cannot dominate any other elements of the pathlist. 367 -- */ 
368 --if (!accept_new) 369 -break; 370 ---} 371   372 //-if 
(accept_new)
373 //-{ 374 --/* Accept the 
new path: insert it at proper place in pathlist */ 375 --if 
(insert_after) 376 -lappend_cell(parent_rel-pathlist, in
 sert_after, new_path); 377 --else 378 -parent_rel-pathlist = 
lcons(new_path, parent_rel-pathlist); 379 //-} 380 //-else 381 //-{ 382 
--/* Reject and recycle the new path */ 383 //if (!IsA(new_path, 
IndexPath)) 384 //---pfree(new_path); 385 //-}


Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-13 Thread Qi Huang

 Date: Fri, 11 May 2012 15:59:51 +0200
 From: s...@keybit.net
 To: robertmh...@gmail.com
 CC: kevin.gritt...@wicourts.gov; a...@cybertec.at; j...@agliodbs.com; 
 and...@anarazel.de; alvhe...@commandprompt.com; 
 heikki.linnakan...@enterprisedb.com; cbbro...@gmail.com; 
 neil.con...@gmail.com; dan...@heroku.com; huangq...@hotmail.com; 
 f...@phlo.org; pgsql-hackers@postgresql.org; sfr...@snowman.net
 Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
 
 On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote:
  On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
  kevin.gritt...@wicourts.gov wrote:
   Ants Aasma a...@cybertec.at wrote:
   It seems to me that the simplest thing to do would be to lift the
   sampling done in analyze.c (acquire_sample_rows) and use that to
   implement the SYSTEM sampling method.
  
   Definitely.  I thought we had all agreed on that ages ago.
  
  Right, and I don't think we should be considering any of this other
  stuff until that basic thing is implemented and working.
 
 Agreed. That's what I'd love to see as well, for the GIS part.
 
 --strk; 

Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first 
and try make SYSTEM sampling work. Other parts, I'll look at them later. 


Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore


  

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Qi Huang

Hi, AllThanks for your ideas on the implementation of TABLESAMPLE. I have a 
summary below of the high level requirements from the -hacker thread till now. 
Please give further comment and if I missed any point, please fell free to add. 
1. Build a new type of node, as I should not use SEQSCAN node, which will be 
scanning the whole table. One of the aims of TABLESAMPLE is to avoid scanning 
every row and thus save time. 
2. use TIDSCAN to directly access tuples. The below way of using ctid proposed 
by Kevin looks good.
-One technique which might be suitably random without reading the-whole table 
would be to figure out a maximum block number and tuple-ID for the table, and 
generate a series of random ctid values to-read. If the tuple doesn't exist or 
is not visible to the snapshot,-you ignore it and continue, until you have read 
the requisite number-of rows. You could try to generate them in advance and 
sort them by-block number, but then you need to solve the problems of what to 
do-if that set of ctids yields too many rows or too few rows, both of-which 
have sticky issues.--Kevin   I think this technique could be considered as an 
implementation algo for BERNOULLI method. It looks that it could still reduce a 
lot of cost compared to just assign random number to every tuple and then 
retrieve. 
3. Adding specific sampling, like dollar unit sampling, or geographic index 
support, which meets specific requirement, but costs more. -I have a gut 
feeling that Dollar Unit Sampling and other weighted-samples can be done with a 
similar approach of uniformly sampling-blocks and then running weighted 
reservoir sampling [1] over the-result.-[1]: 
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf-Cheers,--Ants 
Aasma
  Aasma proposed the above method for doing these problems, of using the 
reservoir weighted random sampling. 
4. Implement the way ANALYZE is doing sampling. -I'm looking for a way to fetch 
random samples these days so I confirm-the need for a quick way to fetch the 
same sample that analyze-command fetches but at SQL level.---strk; 
Proposed by Sandro Santilli, we can look at the way ANALYZE is doing the 
sampling and try to implement it in TABLESAMPLE. 


The above are the four points I summarized from the -hackers discussion. I put 
a brief introduction the this gSoc project on Postgres Wiki. You are welcomed 
to check and modify to enhance it. 
http://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

I will add the ideas formalized in this mailing thread to the wiki page once 
our discussion has been completed. 

Thanks.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

Re: [HACKERS] Welcome 2012 GSOC students

2012-04-24 Thread Qi Huang
Thanks for the hackers' support. The discussion on the mailing is quite 
encouraging. Expecting to enjoy the 3 months' fun with Postgres. I'm still 
under the final period of my university, will participate more after the exams 
finish.
Thanks!

Sent from my Windows Phone

From: Josh Berkus
Sent: 25/4/2012 1:06 AM
To: PostgreSQL-development
Subject: [HACKERS] Welcome 2012 GSOC students

Hackers,

We've chosen the 5 GSOC projects for this year:

* JDBC Foreign Data Wrapper, by Atri, mentored by Merlin Moncure
* Document Collection Foreign Data Wrapper, by Zheng Yang (a returning
student), mentored by Satoshi Nagayasu
* Implementing TABLESAMPLE, by Qi, mentored by Stephen Frost
* Better Indexing for Ranges, by Alexander (returning), mentored by
Heikki Linnakangas
* xReader Streaming xLog Reader, by Aakash, mentored by Kevin Grittner

Congratulations to the selected students, and expect to see them here on
-hackers working on their projects.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-21 Thread Qi Huang

Hi, Heikki
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.
 
 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.
 


   Based on the discussion these days and my understanding, I don't see much 
change to be made in my proposal. For the 3 points you raised, the first one 
and the last one are still not clear. Especially the first point, I see that 
some people think making the SQL syntax into grammar is first choice. For the 
second point, the SQL standard 2003 defines two methods for sampling, SYSTEM 
and BERNOULLI. I think there might be possible quality refinement for them. For 
the optimization statistics, I have an idea of using it to assign different 
sampling percentages to different pages, but I'm not sure about the detail yet, 
I need to see into and learn the optimization statistics (this part is 
mentioned by Neil in his idea, so I think there should be way of using it). 
Also there might be enhance on specific sampling, like monetary unit sampling 
or the geographic indexes sampling. I can do this part(sampling quality 
improvement) as research based project. We can still discuss deeper to 
 see whether these can be done and how we can do them. 
I post my current amended proposal below. The changes are in red color. 
---Project
 Details: 

Neil Conway has come up
with an implementation at 2007 while he gave a talk of introducing to hacking
in PostgreSQL. The code was just for demo purpose and was incomplete. It was
not integrated into PostgreSQL patch. The PostgreSQL now is quite different from
2007. To implement this query, I need to understand Neil’s implementation and
use the general idea to implement in the most up-to-date PostgreSQL release. In
the end, I need to test the implementation till it can be released in the next
patch. I will also explore possible ways of further enhancing the sampling 
quality, like using optimization statistics to produce more accurate sample. 
There is also suggestions that I can integrate different sampling types into 
this query, like for accounting data, I can use monetary unit sampling to get 
the result, or implement the geographic indexes sampling. These specific types 
of sampling might require at least a sequence scan on data, which means slow 
down the sampling speed, but sampling quality will enhance greatly and be very 
useful in some specific fields. 


List
of features:   

1. TABLESAMPLE using
select, delete, and update

2. SYSTEM method

3. REPEATABLE support

4. BERNOULLI method

5. Use optimizer
statistics to produce a more accurate sample

6. non-integer
sample percentage and repeat seed7. sampling quality enhancement

4, 5 and 6 are not
included in Neil’s implementation. 

For 5, we can use
optimizer statistics to refine the algorithm for the random number selection of
pages or rows. The sample produced shall be more accurate. 



Inch-stones: 

1.  Conduct
the basic features' implementation, able to query TABLESAMPLE clause using
select, SYSTEM, with different combination of SQL queries. 

2.
 Implementation of other basic features, REPEATABLE and BERNOULLI. 

3.  Improvement
implementation. Support for using optimizer statistics to produce more accurate
sample, non-integer sample percentage and repeat seed, and sampling quality 
improvement. 

 

Project Schedule: 

1. From
April 23rd-May 10th: learning and understanding.

2. From
Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and
no REPEATABLE support. And do testing.

3. Mid
June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI
method, and do testing. Improvement 5 and 6 are also implemented now. 

4. Mid July-
Mid Aug: Explore ways of improving sampling quality should be done at period 2 
and 3. This period will be used to implement those ideas. 
---
 




Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang




Hi, Heikki   Thanks for your advice.I will change my plan accordingly. But 
I have a few questions.
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 

1. This should be written as an extension, using custom functions as the API. 
Could you explain a bit more what does this mean?   
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.

2. In the plan, I mentioned using optimizer statistics to improve the quality 
of sampling. I may emphasize on that point. I will read about monetary unit 
sampling and add into the plan about possibility of implementing this idea. 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.

3. I read about the replies on using ctid. But I don't quite understand how 
that might help. ctid is just a physical location of row version within the 
table. If I do where ctid'(501, 1)', what is actually happening? Can I add 
in this as an optional implementation? I think I can check how to do this if I 
can have enough time in this project. 

Best Regards and ThanksHuang Qi VictorComputer Science Department- National 
University of Singapore

 Date: Tue, 17 Apr 2012 09:16:29 +0300
 From: heikki.linnakan...@enterprisedb.com
 To: j...@agliodbs.com
 CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; 
 alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; 
 cbbro...@gmail.com; kevin.gritt...@wicourts.gov
 Subject: [HACKERS] Gsoc2012 idea, tablesample
 
 On 24.03.2012 22:12, Joshua Berkus wrote:
  Qi,
 
  Yeah, I can see that.  That's a sign that you had a good idea for a 
  project, actually: your idea is interesting enough that people want to 
  debate it.  Make a proposal on Monday and our potential mentors will help 
  you refine the idea.
 
 Yep. The discussion withered, so let me try to summarize:
 
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.
 
 I think this would be a useful project if those two points are taken 
 care of.
 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid  '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

  

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang

Besides, I saw the Gsoc site editing has been closed. Should I just submit 
through this mailing list with attachment?

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

 Date: Tue, 17 Apr 2012 09:16:29 +0300
 From: heikki.linnakan...@enterprisedb.com
 To: j...@agliodbs.com
 CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; 
 alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; 
 cbbro...@gmail.com; kevin.gritt...@wicourts.gov
 Subject: [HACKERS] Gsoc2012 idea, tablesample
 
 On 24.03.2012 22:12, Joshua Berkus wrote:
  Qi,
 
  Yeah, I can see that.  That's a sign that you had a good idea for a 
  project, actually: your idea is interesting enough that people want to 
  debate it.  Make a proposal on Monday and our potential mentors will help 
  you refine the idea.
 
 Yep. The discussion withered, so let me try to summarize:
 
 1. We probably don't want the SQL syntax to be added to the grammar. 
 This should be written as an extension, using custom functions as the 
 API, instead of extra SQL syntax.
 
 2. It's not very useful if it's just a dummy replacement for WHERE 
 random()  ?. It has to be more advanced than that. Quality of the 
 sample is important, as is performance. There was also an interesting 
 idea of on implementing monetary unit sampling.
 
 I think this would be a useful project if those two points are taken 
 care of.
 
 Another idea that Robert Haas suggested was to add support doing a TID 
 scan for a query like WHERE ctid  '(501,1)'. That's not enough work 
 for GSoC project on its own, but could certainly be a part of it.
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
  

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang

  2. It's not very useful if it's just a dummy replacement for WHERE
  random()  ?. It has to be more advanced than that. Quality of the
  sample is important, as is performance. There was also an
  interesting idea of on implementing monetary unit sampling.
 
 In reviewing this, I got the impression (perhaps mistaken..), that
 different sampling methods are defined by the SQL standard and that it
 would simply be us to implement them according to what the standard
 requires.
 
  I think this would be a useful project if those two points are taken
  care of.
 
 Doing it 'right' certainly isn't going to be simply taking what Neil did
 and updating it, and I understand Tom's concerns about having this be
 more than a hack on seqscan, so I'm a bit nervous that this would turn
 into something bigger than a GSoC project.
 

As Christopher Browne mentioned, for this sampling method, it is not possible 
without scanning the whole data set. It improves the sampling quality but 
increases the sampling cost. I think it should also be using only for some 
special sampling types, not for general. The general sampling methods, as in 
the SQL standard, should have only SYSTEM and BERNOULLI methods. 

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang




 Date: Wed, 18 Apr 2012 02:45:09 +0300
 Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
 From: a...@cybertec.at
 To: cbbro...@gmail.com
 CC: sfr...@snowman.net; pgsql-hackers@postgresql.org
 
 On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com 
 wrote:
  Well, there may be cases where the quality of the sample isn't
  terribly important, it just needs to be reasonable.
 
  I browsed an article on the SYSTEM/BERNOULLI representations; they
  both amount to simple picks of tuples.
 
  - BERNOULLI implies picking tuples with a specified probability.
 
  - SYSTEM implies picking pages with a specified probability.  (I think
  we mess with this in ways that'll be fairly biased in view that tuples
  mayn't be of uniform size, particularly if Slightly Smaller strings
  stay in the main pages, whilst Slightly Larger strings get TOASTed...) 
Looking at the definition of BERNOULLI method and it means to scan all the 
tuples, I always have a question. What is the difference of using BERNOULLI 
method with using select *  where rand()  0.1? They will both go through 
all the tuples and cost a seq-scan. If the answer to the above question is no 
difference, I have one proposal for another method of BERNOULLI. For a 
relation, we can have all their tuples assigned an unique and continuous ID( we 
may use ctid or others). Then for each number in the set of IDs, we assign a 
random number and check whether that is smaller than the sampling percentage. 
If it is smaller, we retrieve the tuple corresponding to that ID. This method 
will not seq scan all the tuples, but it can sample by picking tuples.Thanks 

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore




  

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-23 Thread Qi Huang




 Date: Thu, 22 Mar 2012 13:17:01 -0400
 Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
 From: cbbro...@gmail.com
 To: kevin.gritt...@wicourts.gov
 CC: pgsql-hackers@postgresql.org
 
 On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
  Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  Well, the standard syntax apparently aims to reduce the number of
  returned rows, which ORDER BY does not.  Maybe you could do it
  with ORDER BY .. LIMIT, but the idea here I think is that we'd
  like to sample the table without reading all of it first, so that
  seems to miss the point.
 
  I think actually the traditional locution is more like
WHERE random()  constant
  where the constant is the fraction of the table you want.  And
  yeah, the presumption is that you'd like it to not actually read
  every row.  (Though unless the sampling density is quite a bit
  less than 1 row per page, it's not clear how much you're really
  going to win.)
 
  It's all going to depend on the use cases, which I don't think I've
  heard described very well yet.
 
  I've had to pick random rows from, for example, a table of
  disbursements to support a financial audit.  In those cases it has
  been the sample size that mattered, and order didn't.  One
  interesting twist there is that for some of these financial audits
  they wanted the probability of a row being selected to be
  proportional to the dollar amount of the disbursement.  I don't
  think you can do this without a first pass across the whole data
  set.
 
 This one was commonly called Dollar Unit Sampling, though the
 terminology has gradually gotten internationalized.
 http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
 
 What the article doesn't mention is that some particularly large items
 might wind up covering multiple samples.  In the example, they're
 looking for a sample every $3125 down the list.  If there was a single
 transaction valued at $3, that (roughly) covers 10 of the desired
 samples.
 
 It isn't possible to do this without scanning across the entire table.
 
 If you want repeatability, you probably want to instantiate a copy of
 enough information to indicate the ordering chosen.  That's probably
 something that needs to be captured as part of the work of the audit,
 so not only does it need to involve a pass across the data, it
 probably requires capturing a fair bit of data for posterity.
 -- 
 When confronted by a difficult problem, solve it by reducing it to the
 question, How would the Lone Ranger handle this?



The discussion till now has gone far beyond my understanding.Could anyone 
explain briefly what is the idea for now? The designing detail for me is still 
unfamiliar. I can only take time to understand while possible after being 
selected and put time on it to read relevant material. For now, I'm still 
curious why Neil's implementation is no longer working? The Postgres has been 
patched a lot, but the general idea behind Neil's implementation should still 
work, isn't it? Besides, whether this query is needed is still not decided. 
Seems this is another hard to decide point.  Is it that this topic is still not 
so prepared for the Gsoc yet? If really so, I think I still have time to switch 
to other topics. Any suggestion?
Thanks.

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore
  

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-21 Thread Qi Huang


 Date: Wed, 21 Mar 2012 11:00:59 -0400
 From: and...@dunslane.net
 To: alvhe...@commandprompt.com
 CC: t...@sss.pgh.pa.us; robertmh...@gmail.com; huangq...@hotmail.com; 
 neil.con...@gmail.com; dan...@heroku.com; j...@agliodbs.com; 
 pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
 
 
 
 On 03/21/2012 10:47 AM, Alvaro Herrera wrote:
  Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
 
  Now that would all be fine if this were a widely-desired feature, but
  AFAIR the user demand for it has been about nil.  So I'm leaning to
  the position that we don't want it.
  I disagree with there being zero interest ... the order by random()
  stuff does come up occasionally.
 
 
 Presumably the reason that's not good enough is that is scans the whole 
 table (as well as being non-portable)? Maybe we could find some less 
 invasive way of avoiding that.
 
 cheers
 
 andrew


Thanks for your discussion and ideas. As I checked, MS SQL Server and DB2 
implemented tablesample for now. At least, it is useful for QUICK sample 
retrieval for large dataset. I suppose this clause itself will be much faster 
for using random().About implementation, will the code change be really very 
large? But the general structure should still be about the same, right? 
Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-20 Thread Qi Huang







 Date: Tue, 20 Mar 2012 14:12:45 -0700 Subject: Re: [HACKERS] Gsoc2012 Idea 
 --- Social Network database schema
 From: neil.con...@gmail.com
 To: huangq...@hotmail.com
 CC: dan...@heroku.com; j...@agliodbs.com; pgsql-hackers@postgresql.org
 
 2012/3/19 Qi Huang huangq...@hotmail.com:
  I actually tried to find out, personally...not sure if I was searching
  wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
  useful conversations at the right time in history (~2007), even when
  the search is given a broad date-horizon (all), so I, too, an
  uninformed as to the specific objections.
 
  http://www.postgresql.org/search/?m=1q=TABLESAMPLEl=d=-1s=d
 
  I sent a mail to Nail Conway asking him about this. Hope he could give a
  good answer.
 
 I never tried to get TABLESAMPLE support into the main PostgreSQL tree
 -- I just developed the original code as an exercise for the purposes
 of the talk. Implementing TABLESAMPLE would probably be a reasonable
 GSoc project.
 
 My memory of the details is fuzzy, but one thing to check is whether
 the approach taken by my patch (randomly choose heap pages and then
 return all the live tuples in a chosen page) actually meets the
 standard's requirements -- obviously it is not true that each heap
 page has the same number of live tuples, so you aren't getting a truly
 random sample.
 
 Neil
 


Thanks so much, Neil. I think I kind of understand the situation for now. The 
implementation posted by Neil was for the purpose of the talk, thus rushed and 
may not be up to standard of Postgres Community. Also Neil mentioned the PRNG 
state in the patch is buggy, and maybe also some others. Thus, in the Gsoc 
project, I could understand the details of Neil's implementation, fix the bugs, 
make the code fit for the community standard, and test. Is there any comment on 
this? 


Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore

  

Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

2012-03-19 Thread Qi Huang

  On Mon, Mar 19, 2012 at 6:17 PM, Josh Berkus j...@agliodbs.com wrote:
  On 3/18/12 8:11 PM, HuangQi wrote:
  The implementation seems to be done quite fully. There is even a patch
  file. Why is the implementation not added into the release of Postgres? As
  so much has already being done, what could I do in this case for the Gsoc?
 
  That would be good for you to research.  archives.postgresql.org will
  help you find the discussions around that.
 
 I actually tried to find out, personally...not sure if I was searching
 wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
 useful conversations at the right time in history (~2007), even when
 the search is given a broad date-horizon (all), so I, too, an
 uninformed as to the specific objections.
 
 http://www.postgresql.org/search/?m=1q=TABLESAMPLEl=d=-1s=d
 


I sent a mail to Nail Conway asking him about this. Hope he could give a good 
answer. While waiting for the response, how about the skip scan? Daniel 
mentioned there is still some unknown.I searched this mail thread suggesting 
the skip scan to TODO list. 
http://archives.postgresql.org/pgsql-bugs/2010-03/msg00144.phpAlso this thread 
talking about http://archives.postgresql.org/pgsql-hackers/2010-03/msg00328.php
Not sure whether this is feasible for Gsoc

Best RegardsHuang Qi Victor