Re: [HACKERS] [PATCH]Tablesample Submission
Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then.Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore From: huangq...@outlook.com To: robertmh...@gmail.com CC: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PATCH]Tablesample Submission Date: Tue, 21 Aug 2012 23:08:41 +0800 Please add your patch here: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Hi, Robert I added it under Miscellaneous. https://commitfest.postgresql.org/action/patch_view?id=918 Best RegardsHuang Qi VictorComputer Science of National University of Singapore
[HACKERS] Estimation of HashJoin Cost
Hi, Dear HackersI need to estimate the hashjoin cost in my research. As the textbook shows, it is 3(R+S) where R and S are the size of the tablesize of the two tables, which realistically only considers the cost of IO. But this is obviously too theoretical. What is the correct way to estimate the cost of hashjoin? I looked at the code of final_cost_hashjoin() . It is not clear what factor it is considering. So, except the I/O to and from disk, what other factors are affecting the cost of hahsjoin? Also, is there any way to force postgres abide on the estimation of Hashjoin cost as 3(R+S), which also means, to make hashjoin cost mainly spend on I/O? Thanks Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] [PATCH]Tablesample Submission
Please add your patch here: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Hi, Robert I added it under Miscellaneous. https://commitfest.postgresql.org/action/patch_view?id=918 Best RegardsHuang Qi VictorComputer Science of National University of Singapore
[HACKERS] Git diff patch in context diff format
Hi, hackersI was exporting my project to a patch file. As the patch review requires, the patch needs to be in context diff format (http://wiki.postgresql.org/wiki/Reviewing_a_Patch). But the git diff exports in a format similar to unified format. What is everyone doing with patching now? Is there any standard way?Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Git diff patch in context diff format
Date: Wed, 8 Aug 2012 15:05:06 -0400 From: and...@dunslane.net To: br...@momjian.us CC: huangq...@outlook.com; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Git diff patch in context diff format On 08/08/2012 01:29 PM, Bruce Momjian wrote: On Thu, Aug 2, 2012 at 05:03:04PM +0800, Qi Huang wrote: Hi, hackers I was exporting my project to a patch file. As the patch review requires, the patch needs to be in context diff format (http://wiki.postgresql.org/wiki/ Reviewing_a_Patch). But the git diff exports in a format similar to unified format. What is everyone doing with patching now? Is there any standard way? Have you read our wiki about git and diffs? http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git I must confess that, like Robert, I just use: git diff | filterdiff --format=context I tried the git config stuff mentioned on the wiki, and it bit me a few times, I forget exactly how, and this just works without hassle. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers I'm sorry, but actually this issue has already been resolved. I don't why the system sent another mail... Thanks for your response. Best RegardsHuang Qi VictorComputer Science of National University of Singapore
[HACKERS] Git diff patch in context diff format
Hi, hackersI was exporting my project to a patch file. As the patch review requires, the patch needs to be in context diff format (http://wiki.postgresql.org/wiki/Reviewing_a_Patch). But the git diff exports in a format similar to unified format. What is everyone doing with patching currently? Is there any standard way?Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Git diff patch in context diff format
Thanks for your suggestion, Robert.I just found this page (http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git), quite useful and powerful. Best RegardsHuang Qi VictorComputer Science of National University of Singapore Date: Fri, 3 Aug 2012 04:28:42 -0400 Subject: Re: [HACKERS] Git diff patch in context diff format From: robertmh...@gmail.com To: huangq...@outlook.com CC: pgsql-hackers@postgresql.org On Fri, Aug 3, 2012 at 2:56 AM, Qi Huang huangq...@outlook.com wrote: Hi, hackers I was exporting my project to a patch file. As the patch review requires, the patch needs to be in context diff format (http://wiki.postgresql.org/wiki/Reviewing_a_Patch). But the git diff exports in a format similar to unified format. What is everyone doing with patching currently? Is there any standard way? When I want a context diff, I just do: git diff | filterdiff --format=context -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [patch] pg_stat_lwlocks view - lwlocks statistics
Hi,I was doing patch review for patch of pg_stat_lwlocks view - lwlocks statistics. https://commitfest.postgresql.org/action/patch_view?id=885 The mail for the patch work is at: http://archives.postgresql.org/pgsql-hackers/2012-06/msg01518.php Following the steps on http://wiki.postgresql.org/wiki/Reviewing_a_Patch, the review is below: 1. submission review The patch is in standard git diff format. The patch applies with git repo master branch, in commit 8a504a363925fc5c7af48cd723da3f7e4d7ba9e2. I applied back the patch with command patch -p1 pg_stat_lwlocks_20120626.diff. The patch applies back cleanly. There is no test file or doc patch, as I didn't find any. 2. Usability Review I run queries to select from pg_stat_lwlocks and function pg_stat_get_lwlocks(), and they return the same results. And after running select from pg_stat_reset_lwlocks(), all the data seems to reset. So the stats relation and functions are working. 3. Code format Looking at the patch file, the code format looks nice. Also some basic comments inside. About the code quality, performance review and further bebugging, I'm not sure how should I do the testing Any suggestion? Any comment?Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore pg_stat_lwlocks_20120626.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1
Hi, hackersI modified the code in add_path() a bit so that all the query path candidates inside pathlist will not be removed and all new path will be added into the pathlist, thus all path candidates are kept in pathlist. I then tested a four-relation query. In 9.1.3, I can see thousands of candidates in the final RelOptInfo, and some of them are even busy trees. But in 9.2 beta1 which I forked from github, there are no such busy trees and only about 50 join path in total, which should match the requirement of System R algo. Is there any modification regarding the system R algo in the new release? And something wrong in algo in 9.1.3?Thanks Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1
[ shrug... ] When you're not showing us exactly what you did, it's hard to answer that for sure. But there is some prefiltering logic in joinpath.c that you might have to lobotomize too if you want to keep known-inferior join paths. regards, tom lane Thanks, Tom. Below is what I did for the code in add_path(). I commented out the section of remove_old, and also make sure always accept_new. Then at make_one_rel(), after calling rel = make_rel_from_joinlist(root, joinlist); , I set pprint(rel) in the next line. Checking the RelOptInfo printed out in logfile, I can see some bushy trees, in 9.1.3. 267 ---/*268 --- * Loop to check proposed new path against old paths. Note it is possible269 --- * for more than one old path to be tossed out because new_path dominates270 --- * it.271 --- *272 --- * We can't use foreach here because the loop body may delete the current273 --- * list cell.274 --- */275 ---p1_prev = NULL;276 ---for (p1 = list_head(parent_rel-pathlist); p1 != NULL; p1 = p1_next)277 ---{ ...338339 --/* 340 -- * Remove current element from pathlist if dominated by new. 341 -- */ 342 //if (remove_old) 343 //{ 344 //---parent_rel-pathlist = list_delete_cell(parent_rel-pathlist, 345 //-p1, p1_prev); 346 347 -/* 348 - * Delete the data pointed-to by the deleted cell, if possible 349 - */ 350 //---if (!IsA(old_path, IndexPath)) 351 / /--pfree(old_path); 352 -/* p1_prev does not advance */ 353 //} 354 //else 355 //{ 356 -/* new belongs after this old path if it has cost = old's */ 357 -if (costcmp = 0) 358 insert_after = p1; 359 -/* p1_prev advances */ 360 -p1_prev = p1; 361 //}362 363 --/* 364 -- * If we found an old path that dominates new_path, we can quit 365 -- * scanning the pathlist; we will not add new_path, and we assume 366 -- * new_path cannot dominate any other elements of the pathlist. 367 -- */ 368 --if (!accept_new) 369 -break; 370 ---} 371 372 //-if (accept_new) 373 //-{ 374 --/* Accept the new path: insert it at proper place in pathlist */ 375 --if (insert_after) 376 -lappend_cell(parent_rel-pathlist, in sert_after, new_path); 377 --else 378 -parent_rel-pathlist = lcons(new_path, parent_rel-pathlist); 379 //-} 380 //-else 381 //-{ 382 --/* Reject and recycle the new path */ 383 //if (!IsA(new_path, IndexPath)) 384 //---pfree(new_path); 385 //-} Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 idea, tablesample
Date: Fri, 11 May 2012 15:59:51 +0200 From: s...@keybit.net To: robertmh...@gmail.com CC: kevin.gritt...@wicourts.gov; a...@cybertec.at; j...@agliodbs.com; and...@anarazel.de; alvhe...@commandprompt.com; heikki.linnakan...@enterprisedb.com; cbbro...@gmail.com; neil.con...@gmail.com; dan...@heroku.com; huangq...@hotmail.com; f...@phlo.org; pgsql-hackers@postgresql.org; sfr...@snowman.net Subject: Re: [HACKERS] Gsoc2012 idea, tablesample On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote: On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Ants Aasma a...@cybertec.at wrote: It seems to me that the simplest thing to do would be to lift the sampling done in analyze.c (acquire_sample_rows) and use that to implement the SYSTEM sampling method. Definitely. I thought we had all agreed on that ages ago. Right, and I don't think we should be considering any of this other stuff until that basic thing is implemented and working. Agreed. That's what I'd love to see as well, for the GIS part. --strk; Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first and try make SYSTEM sampling work. Other parts, I'll look at them later. Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 idea, tablesample
Hi, AllThanks for your ideas on the implementation of TABLESAMPLE. I have a summary below of the high level requirements from the -hacker thread till now. Please give further comment and if I missed any point, please fell free to add. 1. Build a new type of node, as I should not use SEQSCAN node, which will be scanning the whole table. One of the aims of TABLESAMPLE is to avoid scanning every row and thus save time. 2. use TIDSCAN to directly access tuples. The below way of using ctid proposed by Kevin looks good. -One technique which might be suitably random without reading the-whole table would be to figure out a maximum block number and tuple-ID for the table, and generate a series of random ctid values to-read. If the tuple doesn't exist or is not visible to the snapshot,-you ignore it and continue, until you have read the requisite number-of rows. You could try to generate them in advance and sort them by-block number, but then you need to solve the problems of what to do-if that set of ctids yields too many rows or too few rows, both of-which have sticky issues.--Kevin I think this technique could be considered as an implementation algo for BERNOULLI method. It looks that it could still reduce a lot of cost compared to just assign random number to every tuple and then retrieve. 3. Adding specific sampling, like dollar unit sampling, or geographic index support, which meets specific requirement, but costs more. -I have a gut feeling that Dollar Unit Sampling and other weighted-samples can be done with a similar approach of uniformly sampling-blocks and then running weighted reservoir sampling [1] over the-result.-[1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf-Cheers,--Ants Aasma Aasma proposed the above method for doing these problems, of using the reservoir weighted random sampling. 4. Implement the way ANALYZE is doing sampling. -I'm looking for a way to fetch random samples these days so I confirm-the need for a quick way to fetch the same sample that analyze-command fetches but at SQL level.---strk; Proposed by Sandro Santilli, we can look at the way ANALYZE is doing the sampling and try to implement it in TABLESAMPLE. The above are the four points I summarized from the -hackers discussion. I put a brief introduction the this gSoc project on Postgres Wiki. You are welcomed to check and modify to enhance it. http://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation I will add the ideas formalized in this mailing thread to the wiki page once our discussion has been completed. Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Welcome 2012 GSOC students
Thanks for the hackers' support. The discussion on the mailing is quite encouraging. Expecting to enjoy the 3 months' fun with Postgres. I'm still under the final period of my university, will participate more after the exams finish. Thanks! Sent from my Windows Phone From: Josh Berkus Sent: 25/4/2012 1:06 AM To: PostgreSQL-development Subject: [HACKERS] Welcome 2012 GSOC students Hackers, We've chosen the 5 GSOC projects for this year: * JDBC Foreign Data Wrapper, by Atri, mentored by Merlin Moncure * Document Collection Foreign Data Wrapper, by Zheng Yang (a returning student), mentored by Satoshi Nagayasu * Implementing TABLESAMPLE, by Qi, mentored by Stephen Frost * Better Indexing for Ranges, by Alexander (returning), mentored by Heikki Linnakangas * xReader Streaming xLog Reader, by Aakash, mentored by Kevin Grittner Congratulations to the selected students, and expect to see them here on -hackers working on their projects. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gsoc2012 idea, tablesample
Hi, Heikki 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. 2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more advanced than that. Quality of the sample is important, as is performance. There was also an interesting idea of on implementing monetary unit sampling. Another idea that Robert Haas suggested was to add support doing a TID scan for a query like WHERE ctid '(501,1)'. That's not enough work for GSoC project on its own, but could certainly be a part of it. Based on the discussion these days and my understanding, I don't see much change to be made in my proposal. For the 3 points you raised, the first one and the last one are still not clear. Especially the first point, I see that some people think making the SQL syntax into grammar is first choice. For the second point, the SQL standard 2003 defines two methods for sampling, SYSTEM and BERNOULLI. I think there might be possible quality refinement for them. For the optimization statistics, I have an idea of using it to assign different sampling percentages to different pages, but I'm not sure about the detail yet, I need to see into and learn the optimization statistics (this part is mentioned by Neil in his idea, so I think there should be way of using it). Also there might be enhance on specific sampling, like monetary unit sampling or the geographic indexes sampling. I can do this part(sampling quality improvement) as research based project. We can still discuss deeper to see whether these can be done and how we can do them. I post my current amended proposal below. The changes are in red color. ---Project Details: Neil Conway has come up with an implementation at 2007 while he gave a talk of introducing to hacking in PostgreSQL. The code was just for demo purpose and was incomplete. It was not integrated into PostgreSQL patch. The PostgreSQL now is quite different from 2007. To implement this query, I need to understand Neil’s implementation and use the general idea to implement in the most up-to-date PostgreSQL release. In the end, I need to test the implementation till it can be released in the next patch. I will also explore possible ways of further enhancing the sampling quality, like using optimization statistics to produce more accurate sample. There is also suggestions that I can integrate different sampling types into this query, like for accounting data, I can use monetary unit sampling to get the result, or implement the geographic indexes sampling. These specific types of sampling might require at least a sequence scan on data, which means slow down the sampling speed, but sampling quality will enhance greatly and be very useful in some specific fields. List of features: 1. TABLESAMPLE using select, delete, and update 2. SYSTEM method 3. REPEATABLE support 4. BERNOULLI method 5. Use optimizer statistics to produce a more accurate sample 6. non-integer sample percentage and repeat seed7. sampling quality enhancement 4, 5 and 6 are not included in Neil’s implementation. For 5, we can use optimizer statistics to refine the algorithm for the random number selection of pages or rows. The sample produced shall be more accurate. Inch-stones: 1. Conduct the basic features' implementation, able to query TABLESAMPLE clause using select, SYSTEM, with different combination of SQL queries. 2. Implementation of other basic features, REPEATABLE and BERNOULLI. 3. Improvement implementation. Support for using optimizer statistics to produce more accurate sample, non-integer sample percentage and repeat seed, and sampling quality improvement. Project Schedule: 1. From April 23rd-May 10th: learning and understanding. 2. From Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and no REPEATABLE support. And do testing. 3. Mid June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI method, and do testing. Improvement 5 and 6 are also implemented now. 4. Mid July- Mid Aug: Explore ways of improving sampling quality should be done at period 2 and 3. This period will be used to implement those ideas. --- Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 idea, tablesample
Hi, Heikki Thanks for your advice.I will change my plan accordingly. But I have a few questions. 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. 1. This should be written as an extension, using custom functions as the API. Could you explain a bit more what does this mean? 2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more advanced than that. Quality of the sample is important, as is performance. There was also an interesting idea of on implementing monetary unit sampling. 2. In the plan, I mentioned using optimizer statistics to improve the quality of sampling. I may emphasize on that point. I will read about monetary unit sampling and add into the plan about possibility of implementing this idea. Another idea that Robert Haas suggested was to add support doing a TID scan for a query like WHERE ctid '(501,1)'. That's not enough work for GSoC project on its own, but could certainly be a part of it. 3. I read about the replies on using ctid. But I don't quite understand how that might help. ctid is just a physical location of row version within the table. If I do where ctid'(501, 1)', what is actually happening? Can I add in this as an optional implementation? I think I can check how to do this if I can have enough time in this project. Best Regards and ThanksHuang Qi VictorComputer Science Department- National University of Singapore Date: Tue, 17 Apr 2012 09:16:29 +0300 From: heikki.linnakan...@enterprisedb.com To: j...@agliodbs.com CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; cbbro...@gmail.com; kevin.gritt...@wicourts.gov Subject: [HACKERS] Gsoc2012 idea, tablesample On 24.03.2012 22:12, Joshua Berkus wrote: Qi, Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea. Yep. The discussion withered, so let me try to summarize: 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. 2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more advanced than that. Quality of the sample is important, as is performance. There was also an interesting idea of on implementing monetary unit sampling. I think this would be a useful project if those two points are taken care of. Another idea that Robert Haas suggested was to add support doing a TID scan for a query like WHERE ctid '(501,1)'. That's not enough work for GSoC project on its own, but could certainly be a part of it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gsoc2012 idea, tablesample
Besides, I saw the Gsoc site editing has been closed. Should I just submit through this mailing list with attachment? Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore Date: Tue, 17 Apr 2012 09:16:29 +0300 From: heikki.linnakan...@enterprisedb.com To: j...@agliodbs.com CC: huangq...@hotmail.com; pgsql-hackers@postgresql.org; and...@anarazel.de; alvhe...@commandprompt.com; neil.con...@gmail.com; dan...@heroku.com; cbbro...@gmail.com; kevin.gritt...@wicourts.gov Subject: [HACKERS] Gsoc2012 idea, tablesample On 24.03.2012 22:12, Joshua Berkus wrote: Qi, Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea. Yep. The discussion withered, so let me try to summarize: 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of extra SQL syntax. 2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more advanced than that. Quality of the sample is important, as is performance. There was also an interesting idea of on implementing monetary unit sampling. I think this would be a useful project if those two points are taken care of. Another idea that Robert Haas suggested was to add support doing a TID scan for a query like WHERE ctid '(501,1)'. That's not enough work for GSoC project on its own, but could certainly be a part of it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gsoc2012 idea, tablesample
2. It's not very useful if it's just a dummy replacement for WHERE random() ?. It has to be more advanced than that. Quality of the sample is important, as is performance. There was also an interesting idea of on implementing monetary unit sampling. In reviewing this, I got the impression (perhaps mistaken..), that different sampling methods are defined by the SQL standard and that it would simply be us to implement them according to what the standard requires. I think this would be a useful project if those two points are taken care of. Doing it 'right' certainly isn't going to be simply taking what Neil did and updating it, and I understand Tom's concerns about having this be more than a hack on seqscan, so I'm a bit nervous that this would turn into something bigger than a GSoC project. As Christopher Browne mentioned, for this sampling method, it is not possible without scanning the whole data set. It improves the sampling quality but increases the sampling cost. I think it should also be using only for some special sampling types, not for general. The general sampling methods, as in the SQL standard, should have only SYSTEM and BERNOULLI methods. Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 idea, tablesample
Date: Wed, 18 Apr 2012 02:45:09 +0300 Subject: Re: [HACKERS] Gsoc2012 idea, tablesample From: a...@cybertec.at To: cbbro...@gmail.com CC: sfr...@snowman.net; pgsql-hackers@postgresql.org On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne cbbro...@gmail.com wrote: Well, there may be cases where the quality of the sample isn't terribly important, it just needs to be reasonable. I browsed an article on the SYSTEM/BERNOULLI representations; they both amount to simple picks of tuples. - BERNOULLI implies picking tuples with a specified probability. - SYSTEM implies picking pages with a specified probability. (I think we mess with this in ways that'll be fairly biased in view that tuples mayn't be of uniform size, particularly if Slightly Smaller strings stay in the main pages, whilst Slightly Larger strings get TOASTed...) Looking at the definition of BERNOULLI method and it means to scan all the tuples, I always have a question. What is the difference of using BERNOULLI method with using select * where rand() 0.1? They will both go through all the tuples and cost a seq-scan. If the answer to the above question is no difference, I have one proposal for another method of BERNOULLI. For a relation, we can have all their tuples assigned an unique and continuous ID( we may use ctid or others). Then for each number in the set of IDs, we assign a random number and check whether that is smaller than the sampling percentage. If it is smaller, we retrieve the tuple corresponding to that ID. This method will not seq scan all the tuples, but it can sample by picking tuples.Thanks Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
Date: Thu, 22 Mar 2012 13:17:01 -0400 Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema From: cbbro...@gmail.com To: kevin.gritt...@wicourts.gov CC: pgsql-hackers@postgresql.org On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Well, the standard syntax apparently aims to reduce the number of returned rows, which ORDER BY does not. Maybe you could do it with ORDER BY .. LIMIT, but the idea here I think is that we'd like to sample the table without reading all of it first, so that seems to miss the point. I think actually the traditional locution is more like WHERE random() constant where the constant is the fraction of the table you want. And yeah, the presumption is that you'd like it to not actually read every row. (Though unless the sampling density is quite a bit less than 1 row per page, it's not clear how much you're really going to win.) It's all going to depend on the use cases, which I don't think I've heard described very well yet. I've had to pick random rows from, for example, a table of disbursements to support a financial audit. In those cases it has been the sample size that mattered, and order didn't. One interesting twist there is that for some of these financial audits they wanted the probability of a row being selected to be proportional to the dollar amount of the disbursement. I don't think you can do this without a first pass across the whole data set. This one was commonly called Dollar Unit Sampling, though the terminology has gradually gotten internationalized. http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html What the article doesn't mention is that some particularly large items might wind up covering multiple samples. In the example, they're looking for a sample every $3125 down the list. If there was a single transaction valued at $3, that (roughly) covers 10 of the desired samples. It isn't possible to do this without scanning across the entire table. If you want repeatability, you probably want to instantiate a copy of enough information to indicate the ordering chosen. That's probably something that needs to be captured as part of the work of the audit, so not only does it need to involve a pass across the data, it probably requires capturing a fair bit of data for posterity. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? The discussion till now has gone far beyond my understanding.Could anyone explain briefly what is the idea for now? The designing detail for me is still unfamiliar. I can only take time to understand while possible after being selected and put time on it to read relevant material. For now, I'm still curious why Neil's implementation is no longer working? The Postgres has been patched a lot, but the general idea behind Neil's implementation should still work, isn't it? Besides, whether this query is needed is still not decided. Seems this is another hard to decide point. Is it that this topic is still not so prepared for the Gsoc yet? If really so, I think I still have time to switch to other topics. Any suggestion? Thanks. Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
Date: Wed, 21 Mar 2012 11:00:59 -0400 From: and...@dunslane.net To: alvhe...@commandprompt.com CC: t...@sss.pgh.pa.us; robertmh...@gmail.com; huangq...@hotmail.com; neil.con...@gmail.com; dan...@heroku.com; j...@agliodbs.com; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema On 03/21/2012 10:47 AM, Alvaro Herrera wrote: Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012: Now that would all be fine if this were a widely-desired feature, but AFAIR the user demand for it has been about nil. So I'm leaning to the position that we don't want it. I disagree with there being zero interest ... the order by random() stuff does come up occasionally. Presumably the reason that's not good enough is that is scans the whole table (as well as being non-portable)? Maybe we could find some less invasive way of avoiding that. cheers andrew Thanks for your discussion and ideas. As I checked, MS SQL Server and DB2 implemented tablesample for now. At least, it is useful for QUICK sample retrieval for large dataset. I suppose this clause itself will be much faster for using random().About implementation, will the code change be really very large? But the general structure should still be about the same, right? Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
Date: Tue, 20 Mar 2012 14:12:45 -0700 Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema From: neil.con...@gmail.com To: huangq...@hotmail.com CC: dan...@heroku.com; j...@agliodbs.com; pgsql-hackers@postgresql.org 2012/3/19 Qi Huang huangq...@hotmail.com: I actually tried to find out, personally...not sure if I was searching wrongly, but searching for TABLESAMPLE did not yield a cornucopia of useful conversations at the right time in history (~2007), even when the search is given a broad date-horizon (all), so I, too, an uninformed as to the specific objections. http://www.postgresql.org/search/?m=1q=TABLESAMPLEl=d=-1s=d I sent a mail to Nail Conway asking him about this. Hope he could give a good answer. I never tried to get TABLESAMPLE support into the main PostgreSQL tree -- I just developed the original code as an exercise for the purposes of the talk. Implementing TABLESAMPLE would probably be a reasonable GSoc project. My memory of the details is fuzzy, but one thing to check is whether the approach taken by my patch (randomly choose heap pages and then return all the live tuples in a chosen page) actually meets the standard's requirements -- obviously it is not true that each heap page has the same number of live tuples, so you aren't getting a truly random sample. Neil Thanks so much, Neil. I think I kind of understand the situation for now. The implementation posted by Neil was for the purpose of the talk, thus rushed and may not be up to standard of Postgres Community. Also Neil mentioned the PRNG state in the patch is buggy, and maybe also some others. Thus, in the Gsoc project, I could understand the details of Neil's implementation, fix the bugs, make the code fit for the community standard, and test. Is there any comment on this? Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
On Mon, Mar 19, 2012 at 6:17 PM, Josh Berkus j...@agliodbs.com wrote: On 3/18/12 8:11 PM, HuangQi wrote: The implementation seems to be done quite fully. There is even a patch file. Why is the implementation not added into the release of Postgres? As so much has already being done, what could I do in this case for the Gsoc? That would be good for you to research. archives.postgresql.org will help you find the discussions around that. I actually tried to find out, personally...not sure if I was searching wrongly, but searching for TABLESAMPLE did not yield a cornucopia of useful conversations at the right time in history (~2007), even when the search is given a broad date-horizon (all), so I, too, an uninformed as to the specific objections. http://www.postgresql.org/search/?m=1q=TABLESAMPLEl=d=-1s=d I sent a mail to Nail Conway asking him about this. Hope he could give a good answer. While waiting for the response, how about the skip scan? Daniel mentioned there is still some unknown.I searched this mail thread suggesting the skip scan to TODO list. http://archives.postgresql.org/pgsql-bugs/2010-03/msg00144.phpAlso this thread talking about http://archives.postgresql.org/pgsql-hackers/2010-03/msg00328.php Not sure whether this is feasible for Gsoc Best RegardsHuang Qi Victor