Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Tue, Nov 22, 2016 at 2:42 PM, Tomas Vondra wrote: > On 11/21/2016 11:10 PM, Robert Haas wrote: > >> [ reviving an old multivariate statistics thread ] >> >> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs >> wrote: >> >>> On 12 October 2014 23:00, Tomas Vondra wrote: >>> >>> It however seems to be working sufficiently well at this point, enough to get some useful feedback. So here we go. >>> >>> This looks interesting and useful. >>> >>> What I'd like to check before a detailed review is that this has >>> sufficient applicability to be useful. >>> >>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a >>> result of multi-column stats errors. >>> >>> Could you look at those queries and confirm that this patch can >>> produce better plans for them? >>> >> >> Tomas, did you ever do any testing in this area? One of my >> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @ >> 20 GB. Q18 actually doesn't complete at all right now because of an >> issue with the new simplehash implementation. I reported it to Andres >> and he tracked it down, but hasn't posted the patch yet - see >> http://archives.postgresql.org/message-id/20161115192802.jfb >> ec5s6ougxw...@alap3.anarazel.de >> >> Of the remaining queries, the slowest are Q9 and Q20, and both of them >> have serious estimation errors. On Q9, things go wrong here: >> >> -> Merge Join >> (cost=5225092.04..6595105.57 rows=154 width=47) (actual >> time=103592.821..149335.010 rows=6503988 loops=1) >>Merge Cond: >> (partsupp.ps_partkey = lineitem.l_partkey) >>Join Filter: >> (lineitem.l_suppkey = partsupp.ps_suppkey) >>Rows Removed by Join Filter: >> 19511964 >>-> Index Scan using >> > > [snip] > >> >> Rows Removed by Filter: 756627 >> >> The estimate for the index scan on partsupp is essentially perfect, >> and the lineitem-part join is off by about 3x. However, the merge >> join is off by about 4000x, which is real bad. >> >> > The patch only deals with statistics on base relations, no joins, at this > point. It's meant to be extended in that direction, so the syntax supports > it, but at this point that's all. No joins. > > That being said, this estimate should be improved in 9.6, when you create > a foreign key between the tables. In fact, that patch was exactly about Q9. > > This is how the join estimate looks on scale 1 without the FK between the > two tables: > > QUERY PLAN > --- > Merge Join (cost=19.19..700980.12 rows=2404 width=261) >Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND > (lineitem.l_suppkey = partsupp.ps_suppkey)) >-> Index Scan using idx_lineitem_part_supp on lineitem > (cost=0.43..605856.84 rows=6001117 width=117) >-> Index Scan using partsupp_pkey on partsupp > (cost=0.42..61141.76 rows=80 width=144) > (4 rows) > > > and with the foreign key: > > QUERY PLAN > --- > Merge Join (cost=19.19..700980.12 rows=6001117 width=261) > (actual rows=6001215 loops=1) >Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND > (lineitem.l_suppkey = partsupp.ps_suppkey)) >-> Index Scan using idx_lineitem_part_supp on lineitem > (cost=0.43..605856.84 rows=6001117 width=117) > (actual rows=6001215 loops=1) >-> Index Scan using partsupp_pkey on partsupp > (cost=0.42..61141.76 rows=80 width=144) > (actual rows=6001672 loops=1) > Planning time: 3.840 ms > Execution time: 21987.913 ms > (6 rows) > > > On Q20, things go wrong here: >> > > > >> [snip] >> >> The estimate for the GroupAggregate feeding one side of the merge join >> is quite accurate. The estimate for the part-partsupp join on the >> other side is off by 8x. Then things get much worse: the estimate for >> the merge join is off by 400x. >> >> > Well, most of the estimation error comes from the join, but sadly the > aggregate makes using the foreign keys impossible - at least in the current > version. I don't know if it can be improved, somehow. > > I'm not really sure whether the multivariate statistics stuff will fix >> this kind of case or not, but if it did it would be awesome. >> >> > Join statistics are something I'd like to add eventually, but I don't see > how it could happen in the first version. Also, the patch received no > reviews this CF, and making it even larger is unlikely to make it more > attractive. > Moved to next CF with "needs review" status. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 11/21/2016 11:10 PM, Robert Haas wrote: [ reviving an old multivariate statistics thread ] On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs wrote: On 12 October 2014 23:00, Tomas Vondra wrote: It however seems to be working sufficiently well at this point, enough to get some useful feedback. So here we go. This looks interesting and useful. What I'd like to check before a detailed review is that this has sufficient applicability to be useful. My understanding is that Q9 and Q18 of TPC-H have poor plans as a result of multi-column stats errors. Could you look at those queries and confirm that this patch can produce better plans for them? Tomas, did you ever do any testing in this area? One of my colleagues, Rafia Sabih, recently did some testing of TPC-H queries @ 20 GB. Q18 actually doesn't complete at all right now because of an issue with the new simplehash implementation. I reported it to Andres and he tracked it down, but hasn't posted the patch yet - see http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxw...@alap3.anarazel.de Of the remaining queries, the slowest are Q9 and Q20, and both of them have serious estimation errors. On Q9, things go wrong here: -> Merge Join (cost=5225092.04..6595105.57 rows=154 width=47) (actual time=103592.821..149335.010 rows=6503988 loops=1) Merge Cond: (partsupp.ps_partkey = lineitem.l_partkey) Join Filter: (lineitem.l_suppkey = partsupp.ps_suppkey) Rows Removed by Join Filter: 19511964 -> Index Scan using > [snip] Rows Removed by Filter: 756627 The estimate for the index scan on partsupp is essentially perfect, and the lineitem-part join is off by about 3x. However, the merge join is off by about 4000x, which is real bad. The patch only deals with statistics on base relations, no joins, at this point. It's meant to be extended in that direction, so the syntax supports it, but at this point that's all. No joins. That being said, this estimate should be improved in 9.6, when you create a foreign key between the tables. In fact, that patch was exactly about Q9. This is how the join estimate looks on scale 1 without the FK between the two tables: QUERY PLAN --- Merge Join (cost=19.19..700980.12 rows=2404 width=261) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) -> Index Scan using idx_lineitem_part_supp on lineitem (cost=0.43..605856.84 rows=6001117 width=117) -> Index Scan using partsupp_pkey on partsupp (cost=0.42..61141.76 rows=80 width=144) (4 rows) and with the foreign key: QUERY PLAN --- Merge Join (cost=19.19..700980.12 rows=6001117 width=261) (actual rows=6001215 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) -> Index Scan using idx_lineitem_part_supp on lineitem (cost=0.43..605856.84 rows=6001117 width=117) (actual rows=6001215 loops=1) -> Index Scan using partsupp_pkey on partsupp (cost=0.42..61141.76 rows=80 width=144) (actual rows=6001672 loops=1) Planning time: 3.840 ms Execution time: 21987.913 ms (6 rows) On Q20, things go wrong here: > [snip] The estimate for the GroupAggregate feeding one side of the merge join is quite accurate. The estimate for the part-partsupp join on the other side is off by 8x. Then things get much worse: the estimate for the merge join is off by 400x. Well, most of the estimation error comes from the join, but sadly the aggregate makes using the foreign keys impossible - at least in the current version. I don't know if it can be improved, somehow. I'm not really sure whether the multivariate statistics stuff will fix this kind of case or not, but if it did it would be awesome. Join statistics are something I'd like to add eventually, but I don't see how it could happen in the first version. Also, the patch received no reviews this CF, and making it even larger is unlikely to make it more attractive. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
[ reviving an old multivariate statistics thread ] On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs wrote: > On 12 October 2014 23:00, Tomas Vondra wrote: > >> It however seems to be working sufficiently well at this point, enough >> to get some useful feedback. So here we go. > > This looks interesting and useful. > > What I'd like to check before a detailed review is that this has > sufficient applicability to be useful. > > My understanding is that Q9 and Q18 of TPC-H have poor plans as a > result of multi-column stats errors. > > Could you look at those queries and confirm that this patch can > produce better plans for them? Tomas, did you ever do any testing in this area? One of my colleagues, Rafia Sabih, recently did some testing of TPC-H queries @ 20 GB. Q18 actually doesn't complete at all right now because of an issue with the new simplehash implementation. I reported it to Andres and he tracked it down, but hasn't posted the patch yet - see http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxw...@alap3.anarazel.de Of the remaining queries, the slowest are Q9 and Q20, and both of them have serious estimation errors. On Q9, things go wrong here: -> Merge Join (cost=5225092.04..6595105.57 rows=154 width=47) (actual time=103592.821..149335.010 rows=6503988 loops=1) Merge Cond: (partsupp.ps_partkey = lineitem.l_partkey) Join Filter: (lineitem.l_suppkey = partsupp.ps_suppkey) Rows Removed by Join Filter: 19511964 -> Index Scan using idx_partsupp_partkey on partsupp (cost=0.43..781956.32 rows=15999792 width=22) (actual time=0.044..11825.481 rows=15999881 loops=1) -> Sort (cost=5224967.03..5245348.02 rows=8152396 width=45) (actual time=103592.505..112205.444 rows=26015949 loops=1) Sort Key: part.p_partkey Sort Method: quicksort Memory: 704733kB -> Hash Join (cost=127278.36..4289121.18 rows=8152396 width=45) (actual time=1084.370..94732.951 rows=6503988 loops=1) Hash Cond: (lineitem.l_partkey = part.p_partkey) -> Seq Scan on lineitem (cost=0.00..3630339.08 rows=119994608 width=41) (actual time=0.015..33355.637 rows=119994608 loops=1) -> Hash (cost=123743.07..123743.07 rows=282823 width=4) (actual time=1083.686..1083.686 rows=216867 loops=1) Buckets: 524288 Batches: 1 Memory Usage: 11721kB -> Gather (cost=1000.00..123743.07 rows=282823 width=4) (actual time=0.418..926.283 rows=216867 loops=1) Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on part (cost=0.00..94460.77 rows=70706 width=4) (actual time=0.063..962.909 rows=43373 loops=5) Filter: ((p_name)::text ~~ '%grey%'::text) Rows Removed by Filter: 756627 The estimate for the index scan on partsupp is essentially perfect, and the lineitem-part join is off by about 3x. However, the merge join is off by about 4000x, which is real bad. On Q20, things go wrong here: -> Merge Join (cost=5928271.92..6411281.44 rows=278 width=16) (actual time=77887.963..136614.284 rows=118124 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) Join Filter: ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity Rows Removed by Join Filter: 242 -> GroupAggregate (cost=5363980.40..5691151.45 rows=9681876 width=48) (actual time=76672.726..131482.677 rows=10890067 loops=1) Group Key: lineitem.l_partkey, lineitem.l_suppkey -> Sort (cost=5363980.40..5409466.13 rows=18194291 width=21) (actual time=76672.661..86405.882 rows=18194084 loops=1) Sort Key: lineitem.l_partkey, lineitem.l_suppkey Sort Method: external merge Disk: 551376kB -> Bitmap Heap Scan on lineitem (cost=466716.05..3170023.42 rows=18194291 width=21) (actual time=13735.552..39289.995 rows=18195269 loops=1) Recheck Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone))
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 12/12/14 05:53, Heikki Linnakangas wrote: On 10/13/2014 01:00 AM, Tomas Vondra wrote: Hi, attached is a WIP patch implementing multivariate statistics. Great! Really glad to see you working on this. + * FIXME This sample sizing is mostly OK when computing stats for + * individual columns, but when computing multi-variate stats + * for multivariate stats (histograms, mcv, ...) it's rather + * insufficient. For small number of dimensions it works, but + * for complex stats it'd be nice use sample proportional to + * the table (say, 0.5% - 1%) instead of a fixed size. I don't think a fraction of the table is appropriate. As long as the sample is random, the accuracy of a sample doesn't depend much on the size of the population. For example, if you sample 1,000 rows from a table with 100,000 rows, or 1000 rows from a table with 100,000,000 rows, the accuracy is pretty much the same. That doesn't change when you go from a single variable to multiple variables. You do need a bigger sample with multiple variables, however. My gut feeling is that if you sample N rows for a single variable, with two variables you need to sample N^2 rows to get the same accuracy. But it's not proportional to the table size. (I have no proof for that, but I'm sure there is literature on this.) [...] I did stage III statistics at University many moons ago... The accuracy of the sample only depends on the value of N, not the total size of the population, with the obvious constraint that N <= population size. The standard deviation in a random sample is proportional to the square root of N. So using N = 100 would have a standard deviation of about 10%, so to reduce it to 5% you would need N = 400. For multiple variables, it will also be a function of N - I don't recall precisely how, I suspect it might M * N were M is the number of parameters (but I'm not as certain). I think M^N might be needed if you want all the possible correlations between sets of variable to be reasonably significant - but I'm mostly just guessing here. So using a % of table size is somewhat silly, looking at the above. However, if you want to detect frequencies that occur at the 1% level, then you will need to sample 1% of the table or greater. So which approach is 'best', depends on what you are trying to determine. The sample size is more useful when you need to decide between 2 different hypothesises. The sampling methodology, is far more important than the ratio of N to population size - consider the bias imposed by using random telephone numbers, even before the event of mobile phones! Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Hi, On 04/28/15 19:36, Jeff Janes wrote: > ... Thanks. I think I tried that, but was still having trouble. But it turns out that the trouble was for an unrelated reason, and I got it to compile now. Yeah, a new column was added to pg_proc the day after I submitted the pacth. Will address that in a new version, hopefully in a few days. Some of the fdw's need a patch as well in order to compile, see attached. Thanks, I forgot to tweak the clauselist_selectivity() calls contrib :-( -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Tue, Apr 28, 2015 at 9:13 AM, Stephen Frost wrote: > * Jeff Janes (jeff.ja...@gmail.com) wrote: > > On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra < > tomas.von...@2ndquadrant.com> > > wrote: > > > attached is a new version of the patch series. Aside from fixing > various > > > issues (crashes, memory leaks). The patches are rebased to current > > > master, and I also attach a few SQL scripts I used for testing (nothing > > > fancy, just stress-testing all the parts the patch touches). > > > > I get cascading conflicts in pg_proc.h. It looked easy enough to fix, > > except then I get compiler errors: > > Yeah, those are because you didn't address the new column which was > added to pg_proc. You need to add another _null_ in the pg_proc.h lines > in the correct place, apparently on four lines. > Thanks. I think I tried that, but was still having trouble. But it turns out that the trouble was for an unrelated reason, and I got it to compile now. Some of the fdw's need a patch as well in order to compile, see attached. Cheers, Jeff multivariate_contrib.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
* Jeff Janes (jeff.ja...@gmail.com) wrote: > On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra > wrote: > > attached is a new version of the patch series. Aside from fixing various > > issues (crashes, memory leaks). The patches are rebased to current > > master, and I also attach a few SQL scripts I used for testing (nothing > > fancy, just stress-testing all the parts the patch touches). > > I get cascading conflicts in pg_proc.h. It looked easy enough to fix, > except then I get compiler errors: Yeah, those are because you didn't address the new column which was added to pg_proc. You need to add another _null_ in the pg_proc.h lines in the correct place, apparently on four lines. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra wrote: > Hello, > > attached is a new version of the patch series. Aside from fixing various > issues (crashes, memory leaks). The patches are rebased to current > master, and I also attach a few SQL scripts I used for testing (nothing > fancy, just stress-testing all the parts the patch touches). > Hi Tomas, I get cascading conflicts in pg_proc.h. It looked easy enough to fix, except then I get compiler errors: funcapi.c: In function 'get_func_trftypes': funcapi.c:890: warning: unused variable 'procStruct' utils/fmgrtab.o:(.rodata+0x10cf8): undefined reference to `_null_' utils/fmgrtab.o:(.rodata+0x10d18): undefined reference to `_null_' utils/fmgrtab.o:(.rodata+0x10d38): undefined reference to `_null_' utils/fmgrtab.o:(.rodata+0x10d58): undefined reference to `_null_' collect2: ld returned 1 exit status make[2]: *** [postgres] Error 1 make[1]: *** [all-backend-recurse] Error 2 make: *** [all-src-recurse] Error 2 make: *** Waiting for unfinished jobs make: *** [temp-install] Error 2 Cheers, Jeff
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Hello, On 03/24/15 06:34, Kyotaro HORIGUCHI wrote: Sorry, not shown above, the *previous* t1 had been done "alter table t1 add statistics (a, b, c)". Removing t1 didn't remove the setting. reiniting cluster let me do that without error. OK, thanks. My guess is this issue got already fixed in my working copy, but I will double-check that. Admittedly, the management of the stats (e.g. removing stats when the table is dropped) is one of the incomplete parts. You have to delete the rows manually from pg_mv_statistic. -- -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Hello, > > Patch 0001 needs changes for OIDs since my patch was > > committed. The attached is compatible with current master. > > Thanks. I plan to submit a new version of the patch in a few days, with > significant progress in various directions. I'll have to rebase to > current master before submitting the new version anyway (which includes > fixing duplicate OIDs). > > > And I tried this like this, and got the following error on > > analyze. But unfortunately I don't have enough time to > > investigate it now. > > > > postgres=# create table t1 (a int, b int, c int); > > insert into t1 (select a/ 1, a / 1, a / 1 from > > generate_series(0, 9) a); > > postgres=# analyze t1; > > ERROR: invalid memory alloc request size 1485176862 > > Interesting - particularly because this does not involve any > multivariate stats. I can't reproduce it with the current version of the > patch, so either it's unrelated, or I've fixed it since posting the last > version. Sorry, not shown above, the *previous* t1 had been done "alter table t1 add statistics (a, b, c)". Removing t1 didn't remove the setting. reiniting cluster let me do that without error. The steps throughout was as following. === create table t1 (a int, b int, c int); alter table t1 add statistics (histogram) on (a, b, c); drop table t1; -- This does not remove the above setting. create table t1 (a int, b int, c int); insert into t1 (select a/ 1, a / 1, a / 1 from generate_series(0, 9) a);insert into t1 ... regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Hello, On 20.3.2015 09:33, Kyotaro HORIGUCHI wrote: > Hello, > > > Patch 0001 needs changes for OIDs since my patch was > committed. The attached is compatible with current master. Thanks. I plan to submit a new version of the patch in a few days, with significant progress in various directions. I'll have to rebase to current master before submitting the new version anyway (which includes fixing duplicate OIDs). > And I tried this like this, and got the following error on > analyze. But unfortunately I don't have enough time to > investigate it now. > > postgres=# create table t1 (a int, b int, c int); > insert into t1 (select a/ 1, a / 1, a / 1 from > generate_series(0, 9) a); > postgres=# analyze t1; > ERROR: invalid memory alloc request size 1485176862 Interesting - particularly because this does not involve any multivariate stats. I can't reproduce it with the current version of the patch, so either it's unrelated, or I've fixed it since posting the last version. regards -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Hello, Patch 0001 needs changes for OIDs since my patch was committed. The attached is compatible with current master. And I tried this like this, and got the following error on analyze. But unfortunately I don't have enough time to investigate it now. postgres=# create table t1 (a int, b int, c int); insert into t1 (select a/ 1, a / 1, a / 1 from generate_series(0, 9) a); postgres=# analyze t1; ERROR: invalid memory alloc request size 1485176862 regards, At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra wrote in <54c3fed3.1060...@2ndquadrant.com> > Hi, > > attached is an updated version of the multivariate stats patch. This is > going to be a bit longer mail, so I'll put here a small ToC ;-) > > 1) patch split into 4 parts > 2) where to start / documentation > 3) state of the code > 4) main changes/improvements > 5) remaining limitations > > The motivation and design ideas, explained in the first message of this > thread are still valid. It might be a good idea to read it first: > > http://www.postgresql.org/message-id/flat/543afa15.4080...@fuzzy.cz > > BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro > into the patch in person, or discuss the patch in general. > > > 1) Patch split into 4 parts > --- > Firstly, the patch got broken into the following four pieces, to make > the reviews somewhat easier: > > 1) 0001-shared-infrastructure-and-functional-dependencies.patch > >- infrastructure, shared by all the kinds of stats added > in the following patches (catalog, ALTER TABLE, ANALYZE ...) > >- implementation of a simple statistics, tracking functional > dependencies between columns (previously called "associative > rules", but that's incorrect for several reasons) > >- this does not modify the optimizer in any way > 2) 0002-clause-reduction-using-functional-dependencies.patch > >- applies the functional dependencies to optimizer (i.e. considers > the rules in clauselist_selectivity()) > > 3) 0003-multivariate-MCV-lists.patch > >- multivariate MCV lists (both ANALYZE and optimizer parts) > > 4) 0004-multivariate-histograms.patch > >- multivariate histograms (both ANALYZE and optimizer parts) > > > You may look at the patches at github here: > > https://github.com/tvondra/postgres/tree/multivariate-stats-squashed > > The branch is not stable, i.e. I'll rebase / squash / force-push changes > in the future. (There's also multivariate-stats development branch with > unsquashed changes, but you don't want to look at that, trust me.) > > The patches are not exactly small (being in the 50-100 kB range), but > that's mostly because of the amount of comments explaining the goals and > implementation details. > > > 2) Where to start / documentation > - > I strived to document all the pieces properly, mostly in the form of > comments. There's no sgml documentation at this point, which should > obviously change in the future. > > Anyway, I'd suggest reading the first e-mail in this thread, explaining > the ideas, and then these comments: > > 1) functional dependencies (patch 0001) >- src/backend/utils/mvstats/dependencies.c > > 2) MCV lists (patch 0003) >- src/backend/utils/mvstats/mcv.c > > 3) histograms (patch 0004) >- src/backend/utils/mvstats/mcv.c > >- also see clauselist_mv_selectivity_mcvlist() in clausesel.c >- also see clauselist_mv_selectivity_histogram() in clausesel.c > > 4) selectivity estimation (patches 0002-0004) >- all in src/backend/optimizer/path/clausesel.c >- clauselist_selectivity() - overview of how the stats are applied >- clauselist_apply_dependencies() - functional dependencies reduction >- clauselist_mv_selectivity_mcvlist() - MCV list estimation >- clauselist_mv_selectivity_histogram() - histogram estimation > > > 3) State of the code > > I've spent a fair amount of time testing the patches, and while I > believe there are no segfaults or so, I know parts of the code need a > bit more love. > > The part most in need of improvements / comments is probably the code in > clausesel.c - that seems a bit quirky. Reviews / comments regarding this > part of the code are very welcome - I'm sure there are many ways to > improve this part. > > There are a few FIXMEs elsewhere (e.g. about memory allocation in the > (de)serialization code), but those are mostly well-defined issues that I > know how to address (at least I believe so). > > > 4) Main changes/improvements > > There are many significant improvements. The previous patch version was > in the 'proof of concept' category (missing pieces, knowingly broken in > some areas), the current patch should 'mostly work'. > > The patch fixes two most annoying limitations of the first version: > > (a) support for all data types (not just those passed by value) > (b) handles NUL
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Mon, Dec 15, 2014 at 11:55 AM, Michael Paquier wrote: > On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra wrote: >> I agree with moving the patch to the next CF - I'm working on the patch, >> but I will take a bit more time to submit a new version and I can do >> that in the next CF. > OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12. Marked as returned with feedback. No new version showed up in the last month and this patch was waiting for input from author. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra wrote: > I agree with moving the patch to the next CF - I'm working on the patch, > but I will take a bit more time to submit a new version and I can do > that in the next CF. OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12. Thanks, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 11.12.2014 17:53, Heikki Linnakangas wrote: > On 10/13/2014 01:00 AM, Tomas Vondra wrote: >> Hi, >> >> attached is a WIP patch implementing multivariate statistics. > > Great! Really glad to see you working on this. > >> + * FIXME This sample sizing is mostly OK when computing stats for >> + * individual columns, but when computing multi-variate stats >> + * for multivariate stats (histograms, mcv, ...) it's rather >> + * insufficient. For small number of dimensions it works, but >> + * for complex stats it'd be nice use sample proportional to >> + * the table (say, 0.5% - 1%) instead of a fixed size. > > I don't think a fraction of the table is appropriate. As long as the > sample is random, the accuracy of a sample doesn't depend much on > the size of the population. For example, if you sample 1,000 rows > from a table with 100,000 rows, or 1000 rows from a table with > 100,000,000 rows, the accuracy is pretty much the same. That doesn't > change when you go from a single variable to multiple variables. I might be wrong, but I doubt that. First, I read a number of papers while working on this patch, and all of them used samples proportional to the data set. That's an indirect evidence, though. > You do need a bigger sample with multiple variables, however. My gut > feeling is that if you sample N rows for a single variable, with two > variables you need to sample N^2 rows to get the same accuracy. But > it's not proportional to the table size. (I have no proof for that, > but I'm sure there is literature on this.) Maybe. I think it's somehow related to the number of buckets (which somehow determines the precision of the histogram). If you want 1000 buckets, the number of rows scanned needs to be e.g. 10x that. With multi-variate histograms, we may shoot for more buckets (say, 100 in each dimension). > >> + * Multivariate histograms >> + * >> + * Histograms are a collection of buckets, represented by n-dimensional >> + * rectangles. Each rectangle is delimited by an array of lower and >> + * upper boundaries, so that for for the i-th attribute >> + * >> + * min[i] <= value[i] <= max[i] >> + * >> + * Each bucket tracks frequency (fraction of tuples it contains), >> + * information about the inequalities, number of distinct values in >> + * each dimension (which is used when building the histogram) etc. >> + * >> + * The boundaries may be either inclusive or exclusive, or the whole >> + * dimension may be NULL. >> + * >> + * The buckets may overlap (assuming the build algorithm keeps the >> + * frequencies additive) or may not cover the whole space (i.e. allow >> + * gaps). This entirely depends on the algorithm used to build the >> + * histogram. > > That sounds pretty exotic. These buckets are quite different from > the single-dimension buckets we currently have. > > The paper you reference in partition_bucket() function, M. > Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating > Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference > 1988: 28-36, actually doesn't mention overlapping buckets at all. I > haven't read the code in detail, but if it implements the algorithm > from that paper, there will be no overlap. The algorithm implemented in partition_bucket() is very simple and naive, and it mostly resembles the algorithm described in the paper. I'm sure there are differences, it's not a 1:1 implementation, but you're right it produces non-overlapping buckets. The point is that I envision more complex algorithms or different histogram types, and some of them may produce overlapping buckets. Maybe that's premature comment, and it will turn out it's not really necessary. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 10/13/2014 01:00 AM, Tomas Vondra wrote: Hi, attached is a WIP patch implementing multivariate statistics. Great! Really glad to see you working on this. +* FIXME This sample sizing is mostly OK when computing stats for +* individual columns, but when computing multi-variate stats +* for multivariate stats (histograms, mcv, ...) it's rather +* insufficient. For small number of dimensions it works, but +* for complex stats it'd be nice use sample proportional to +* the table (say, 0.5% - 1%) instead of a fixed size. I don't think a fraction of the table is appropriate. As long as the sample is random, the accuracy of a sample doesn't depend much on the size of the population. For example, if you sample 1,000 rows from a table with 100,000 rows, or 1000 rows from a table with 100,000,000 rows, the accuracy is pretty much the same. That doesn't change when you go from a single variable to multiple variables. You do need a bigger sample with multiple variables, however. My gut feeling is that if you sample N rows for a single variable, with two variables you need to sample N^2 rows to get the same accuracy. But it's not proportional to the table size. (I have no proof for that, but I'm sure there is literature on this.) + * Multivariate histograms + * + * Histograms are a collection of buckets, represented by n-dimensional + * rectangles. Each rectangle is delimited by an array of lower and + * upper boundaries, so that for for the i-th attribute + * + * min[i] <= value[i] <= max[i] + * + * Each bucket tracks frequency (fraction of tuples it contains), + * information about the inequalities, number of distinct values in + * each dimension (which is used when building the histogram) etc. + * + * The boundaries may be either inclusive or exclusive, or the whole + * dimension may be NULL. + * + * The buckets may overlap (assuming the build algorithm keeps the + * frequencies additive) or may not cover the whole space (i.e. allow + * gaps). This entirely depends on the algorithm used to build the + * histogram. That sounds pretty exotic. These buckets are quite different from the single-dimension buckets we currently have. The paper you reference in partition_bucket() function, M. Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 1988: 28-36, actually doesn't mention overlapping buckets at all. I haven't read the code in detail, but if it implements the algorithm from that paper, there will be no overlap. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 8.12.2014 02:01, Michael Paquier wrote: > On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra wrote: >> Thanks for the link. I've been looking for a good dataset with such >> data, and this one is by far the best one. >> >> The current version of the patch supports only data types passed by >> value (i.e. no varlena types - text, ), which means it's impossible to >> build multivariate stats on some of the interesting columns (state, >> city, ...). >> >> I guess it's time to start working on removing this limitation. > Tomas, what's your status on this patch? Are you planning to make it > more complicated than it is? For now I have switched it to a "Needs > Review" state because even your first version did not get advanced > review (that's quite big btw). I guess that we should switch it to the > next CF. Hello Michael, I agree with moving the patch to the next CF - I'm working on the patch, but I will take a bit more time to submit a new version and I can do that in the next CF. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra wrote: > Thanks for the link. I've been looking for a good dataset with such > data, and this one is by far the best one. > > The current version of the patch supports only data types passed by > value (i.e. no varlena types - text, ), which means it's impossible to > build multivariate stats on some of the interesting columns (state, > city, ...). > > I guess it's time to start working on removing this limitation. Tomas, what's your status on this patch? Are you planning to make it more complicated than it is? For now I have switched it to a "Needs Review" state because even your first version did not get advanced review (that's quite big btw). I guess that we should switch it to the next CF. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 15.11.2014 18:49, Kevin Grittner > If you eliminate the quals besides the zipcode column you get 61 > rows and it gets much stranger, with legal municipalities that are > completely surrounded by Madison that the postal service would > rather you didn't use in addressing your envelopes, but they have > to deliver to anyway, and organizations inside Madison receiving > enough mail to (literally) have their own zip code -- where the > postal service allows the organization name as a deliverable > "city". > > If you want to have your own fun with this data, you can download > it here: > > http://federalgovernmentzipcodes.us/free-zipcode-database.csv > ... > > I bet there are all sorts of correlation possibilities with, for > example, latitude and longitude and other variables. With 81831 > rows and so many correlations among the columns, it might be a > useful data set to test with. Thanks for the link. I've been looking for a good dataset with such data, and this one is by far the best one. The current version of the patch supports only data types passed by value (i.e. no varlena types - text, ), which means it's impossible to build multivariate stats on some of the interesting columns (state, city, ...). I guess it's time to start working on removing this limitation. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Tomas Vondra wrote: > Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a): >> On 13.11.2014 14:11, Tomas Vondra wrote: >> >>> The only place where I think this might work are the associative rules. >>> It's simple to specify rules like ("ZIP code" implies "city") and we could >>> even do some simple check against the data to see if it actually makes >>> sense (and 'disable' the rule if not). >> >> and even this simple example has its limits, at least in Germany ZIP >> codes are not unique for rural areas, where several villages have the >> same ZIP code. > as you point out most real-world data either contain bugs > or are inherently imperfect (we have the same kind of ZIP/city > inconsistencies in Czech). You can have lots of fun with U.S. zip code, too. Just on the nominally "Madison, Wisconsin" zip codes (those starting with 537), there are several exceptions: select zipcode, city, locationtype from zipcode where zipcode like '537%' and Decommisioned = 'false' and zipcodetype = 'STANDARD' and locationtype in ('PRIMARY', 'ACCEPTABLE') order by zipcode, city; zipcode | city | locationtype -+---+-- 53703 | MADISON | PRIMARY 53704 | MADISON | PRIMARY 53705 | MADISON | PRIMARY 53706 | MADISON | PRIMARY 53711 | FITCHBURG | ACCEPTABLE 53711 | MADISON | PRIMARY 53713 | FITCHBURG | ACCEPTABLE 53713 | MADISON | PRIMARY 53713 | MONONA | ACCEPTABLE 53714 | MADISON | PRIMARY 53714 | MONONA | ACCEPTABLE 53715 | MADISON | PRIMARY 53716 | MADISON | PRIMARY 53716 | MONONA | ACCEPTABLE 53717 | MADISON | PRIMARY 53718 | MADISON | PRIMARY 53719 | FITCHBURG | ACCEPTABLE 53719 | MADISON | PRIMARY 53725 | MADISON | PRIMARY 53726 | MADISON | PRIMARY 53744 | MADISON | PRIMARY (21 rows) If you eliminate the quals besides the zipcode column you get 61 rows and it gets much stranger, with legal municipalities that are completely surrounded by Madison that the postal service would rather you didn't use in addressing your envelopes, but they have to deliver to anyway, and organizations inside Madison receiving enough mail to (literally) have their own zip code -- where the postal service allows the organization name as a deliverable "city". If you want to have your own fun with this data, you can download it here: http://federalgovernmentzipcodes.us/free-zipcode-database.csv I was able to load it into PostgreSQL with this: create table zipcode ( recordnumber integer not null, zipcode text not null, zipcodetype text not null, city text not null, state text not null, locationtype text not null, lat double precision, long double precision, xaxis double precision not null, yaxis double precision not null, zaxis double precision not null, worldregion text not null, country text not null, locationtext text, location text, decommisioned text not null, taxreturnsfiled bigint, estimatedpopulation bigint, totalwages bigint, notes text ); comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)'; comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, Military(implies APO or FPO)'; comment on column zipcode.city is 'offical city name(s)'; comment on column zipcode.state is 'offical state, territory, or quasi-state (AA, AE, AP) abbreviation code'; comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable'; comment on column zipcode.lat is 'Decimal Latitude, if available'; comment on column zipcode.long is 'Decimal Longitude, if available'; comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago Pago, AS ; Melbourne, AU )'; comment on column zipcode.decommisioned is 'If Primary location, Yes implies historical Zipcode, No Implies current Zipcode; If not Primary, Yes implies Historical Placename'; comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns Filed in 2008'; copy zipcode from 'filepath' with (format csv, header); alter table zipcode add primary key (recordnumber); create unique index zipcode_city on zipcode (zipcode, city); I bet there are all sorts of correlation possibilities with, for example, latitude and longitude and other variables. With 81831 rows and so many correlations among the columns, it might be a useful data set to test with. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a): > On 13.11.2014 14:11, Tomas Vondra wrote: > >> The only place where I think this might work are the associative rules. >> It's simple to specify rules like ("ZIP code" implies "city") and we >> could >> even do some simple check against the data to see if it actually makes >> sense (and 'disable' the rule if not). > > and even this simple example has its limits, at least in Germany ZIP > codes are not unique for rural areas, where several villages have the > same ZIP code. > > I guess there are just a few examples where columns are completely > functional dependent without any exceptions. > But of course, if the user gives this information just for optimization > the statistics, some exceptions don't matter. > If this information should be used for creating different execution > plans (e.g. on column A is an index and column B is functional > dependent, one could think about using this index on A and the > dependency instead of running through the whole table to find all tuples > that fit the query on column B), exceptions are a very important issue. Yes, exactly. The aim of this patch is "only" improving estimates, not removing conditions from the plan (e.g. checking only the ZIP code and not the city name). That certainly can't be done solely based on approximate statistics, and as you point out most real-world data either contain bugs or are inherently imperfect (we have the same kind of ZIP/city inconsistencies in Czech). That's not a big issue for estimates (assuming only small fraction of rows violates the rule) though. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 13.11.2014 14:11, Tomas Vondra wrote: Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a): On 12 October 2014 23:00, Tomas Vondra wrote: It however seems to be working sufficiently well at this point, enough to get some useful feedback. So here we go. This looks interesting and useful. What I'd like to check before a detailed review is that this has sufficient applicability to be useful. My understanding is that Q9 and Q18 of TPC-H have poor plans as a result of multi-column stats errors. Could you look at those queries and confirm that this patch can produce better plans for them? Sure. I planned to do such verification/demonstration anyway, after discussing the overall approach. I planned to give it a try on TPC-DS, but I can start with the TPC-H queries you propose. I'm not sure whether the poor estimates in Q9 & Q18 come from column correlation though - if it's due to some other issues (e.g. conditions that are difficult to estimate), this patch can't do anything with them. But it's a good start. If so, I will work with you to review this patch. Thanks! One aspect of the patch that seems to be missing is a user declaration of correlation, just as we have for setting n_distinct. It seems like an even easier place to start to just let the user specify the stats declaratively. That way we can split the patch into two parts. First, allow multi column stats that are user declared. Then add user stats collected by ANALYZE. The first part is possibly contentious and thus a good initial focus. The second part will have lots of discussion, so good to skip for a first version. I'm not a big fan of this approach, for a number of reasons. Firstly, it only works for "simple" parameters that are trivial to specify (say, Pearson's correlation coefficient), and the patch does not work with those at all - it only works with histograms, MCV lists (and might work with associative rules in the future). And we certainly can't ask users to specify multivariate histograms - because it's very difficult to do, and also because complex stats are more susceptible to get stale after adding new data to the table. Secondly, even if we add such "simple" parameters to the patch, we have to come up with a way to apply those parameters to the estimates. The problem is that as the parameters get simpler, it's less and less useful to compute the stats. Another question is whether it should support more than 2 columns ... The only place where I think this might work are the associative rules. It's simple to specify rules like ("ZIP code" implies "city") and we could even do some simple check against the data to see if it actually makes sense (and 'disable' the rule if not). and even this simple example has its limits, at least in Germany ZIP codes are not unique for rural areas, where several villages have the same ZIP code. I guess there are just a few examples where columns are completely functional dependent without any exceptions. But of course, if the user gives this information just for optimization the statistics, some exceptions don't matter. If this information should be used for creating different execution plans (e.g. on column A is an index and column B is functional dependent, one could think about using this index on A and the dependency instead of running through the whole table to find all tuples that fit the query on column B), exceptions are a very important issue. But maybe I got it wrong and you have something particular in mind? Can you give an example of how it would work? regards Tomas -- Dipl.-Math. Katharina Büchse Friedrich-Schiller-Universität Jena Institut für Informatik Lehrstuhl für Datenbanken und Informationssysteme Ernst-Abbe-Platz 2 07743 Jena Telefon 03641/946367 Webseite http://users.minet.uni-jena.de/~re89qen/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a): > On 12 October 2014 23:00, Tomas Vondra wrote: > >> It however seems to be working sufficiently well at this point, enough >> to get some useful feedback. So here we go. > > This looks interesting and useful. > > What I'd like to check before a detailed review is that this has > sufficient applicability to be useful. > > My understanding is that Q9 and Q18 of TPC-H have poor plans as a > result of multi-column stats errors. > > Could you look at those queries and confirm that this patch can > produce better plans for them? Sure. I planned to do such verification/demonstration anyway, after discussing the overall approach. I planned to give it a try on TPC-DS, but I can start with the TPC-H queries you propose. I'm not sure whether the poor estimates in Q9 & Q18 come from column correlation though - if it's due to some other issues (e.g. conditions that are difficult to estimate), this patch can't do anything with them. But it's a good start. > If so, I will work with you to review this patch. Thanks! > One aspect of the patch that seems to be missing is a user declaration > of correlation, just as we have for setting n_distinct. It seems like > an even easier place to start to just let the user specify the stats > declaratively. That way we can split the patch into two parts. First, > allow multi column stats that are user declared. Then add user stats > collected by ANALYZE. The first part is possibly contentious and thus > a good initial focus. The second part will have lots of discussion, so > good to skip for a first version. I'm not a big fan of this approach, for a number of reasons. Firstly, it only works for "simple" parameters that are trivial to specify (say, Pearson's correlation coefficient), and the patch does not work with those at all - it only works with histograms, MCV lists (and might work with associative rules in the future). And we certainly can't ask users to specify multivariate histograms - because it's very difficult to do, and also because complex stats are more susceptible to get stale after adding new data to the table. Secondly, even if we add such "simple" parameters to the patch, we have to come up with a way to apply those parameters to the estimates. The problem is that as the parameters get simpler, it's less and less useful to compute the stats. Another question is whether it should support more than 2 columns ... The only place where I think this might work are the associative rules. It's simple to specify rules like ("ZIP code" implies "city") and we could even do some simple check against the data to see if it actually makes sense (and 'disable' the rule if not). But maybe I got it wrong and you have something particular in mind? Can you give an example of how it would work? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 12 October 2014 23:00, Tomas Vondra wrote: > It however seems to be working sufficiently well at this point, enough > to get some useful feedback. So here we go. This looks interesting and useful. What I'd like to check before a detailed review is that this has sufficient applicability to be useful. My understanding is that Q9 and Q18 of TPC-H have poor plans as a result of multi-column stats errors. Could you look at those queries and confirm that this patch can produce better plans for them? If so, I will work with you to review this patch. One aspect of the patch that seems to be missing is a user declaration of correlation, just as we have for setting n_distinct. It seems like an even easier place to start to just let the user specify the stats declaratively. That way we can split the patch into two parts. First, allow multi column stats that are user declared. Then add user stats collected by ANALYZE. The first part is possibly contentious and thus a good initial focus. The second part will have lots of discussion, so good to skip for a first version. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Dne 30 Říjen 2014, 10:17, David Rowley napsal(a): > On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra wrote: > >> Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a): >> >> I've not really gotten around to looking at the patch yet, but I'm >> also >> >> wondering if it would be simple include allowing functional >> statistics >> >> too. The pg_mv_statistic name seems to indicate multi columns, but >> how >> >> about stats on date(datetime_column), or perhaps any non-volatile >> >> function. This would help to solve the problem highlighted here >> >> >> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com >> >> . Without giving it too much thought, perhaps any expression that can >> be >> >> indexed should be allowed to have stats? Would that be really >> difficult >> >> to implement in comparison to what you've already done with the patch >> so >> >> far? >> >> >> > >> > I would not over-complicate requirements for the first version of >> this, >> > I think it's already complicated enough. >> >> My thoughts, exactly. I'm not willing to put more features into the >> initial version of the patch. Actually, I'm thinking about ripping out >> some experimental features (particularly "hashed MCV" and "associative >> rules"). >> >> > That's fair, but I didn't really mean to imply that you should go work on > that too and that it should be part of this patch.. > I was thinking more along the lines of that I don't really agree with the > table name for the new stats and that at some later date someone will want > to add expression stats and we'd probably better come up design that would > be friendly towards that. At this time I can only think that the name of > the table might not suit well to expression stats, I'd hate to see someone > have to invent a 3rd table to support these when we could likely come up > with something that could be extended later and still make sense both > today > and in the future. > > I was just looking at how expression indexes are stored in pg_index and I > see that if it's an expression index that the expression is stored in > the indexprs column which is of type pg_node_tree, so quite possibly at > some point in the future the new stats table could just have an extra > column added, and for today, we'd just need to come up with a future proof > name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and > source files something along those lines instead? Ah, OK. I don't think the catalog name "pg_mv_statistic" is somehow inappropriate for this purpose, though. IMHO the "multivariate" does not mean "only columns" or "no expressions", it simply describes that the approximated density function has multiple input variables, be it attributes or expressions. But maybe there's a better name. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra wrote: > Dne 29 Říjen 2014, 10:41, David Rowley napsal(a): > > I'm quite interested in reviewing your work on this, but it appears that > > some of your changes are not C89: > > > > src\backend\commands\analyze.c(3774): error C2057: expected constant > > expression [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(3774): error C2466: cannot allocate an > > array of constant size 0 [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown > > size [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(4302): error C2057: expected constant > > expression [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(4302): error C2466: cannot allocate an > > array of constant size 0 [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : > unknown > > size [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(4775): error C2057: expected constant > > expression [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(4775): error C2466: cannot allocate an > > array of constant size 0 [D:\Postgres\a\postgres.vcxproj] > > src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size > > [D:\Postgres\a\postgres.vcxproj] > > > > I'll look into that. The thing is I don't have access to MSVC, so it's a > bit > difficult to spot / fix those issues :-( > > It should be a pretty simple fix, just use the files and line numbers from the above. It's just a problem that in those 3 places you're declaring an array of a variable size, which is not allowed in C89. The thing to do instead would just be to palloc() the size you need and the pfree() it when you're done. Regards David Rowley
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra wrote: > Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a): > >> I've not really gotten around to looking at the patch yet, but I'm also > >> wondering if it would be simple include allowing functional statistics > >> too. The pg_mv_statistic name seems to indicate multi columns, but how > >> about stats on date(datetime_column), or perhaps any non-volatile > >> function. This would help to solve the problem highlighted here > >> > http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com > >> . Without giving it too much thought, perhaps any expression that can be > >> indexed should be allowed to have stats? Would that be really difficult > >> to implement in comparison to what you've already done with the patch so > >> far? > >> > > > > I would not over-complicate requirements for the first version of this, > > I think it's already complicated enough. > > My thoughts, exactly. I'm not willing to put more features into the > initial version of the patch. Actually, I'm thinking about ripping out > some experimental features (particularly "hashed MCV" and "associative > rules"). > > That's fair, but I didn't really mean to imply that you should go work on that too and that it should be part of this patch.. I was thinking more along the lines of that I don't really agree with the table name for the new stats and that at some later date someone will want to add expression stats and we'd probably better come up design that would be friendly towards that. At this time I can only think that the name of the table might not suit well to expression stats, I'd hate to see someone have to invent a 3rd table to support these when we could likely come up with something that could be extended later and still make sense both today and in the future. I was just looking at how expression indexes are stored in pg_index and I see that if it's an expression index that the expression is stored in the indexprs column which is of type pg_node_tree, so quite possibly at some point in the future the new stats table could just have an extra column added, and for today, we'd just need to come up with a future proof name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and source files something along those lines instead? Regards David Rowley
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a): > On 29/10/14 10:41, David Rowley wrote: >> On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra > >> The last point is really just "unfinished implementation" - the >> syntax I >> propose is this: >> >> ALTER TABLE ... ADD STATISTICS (options) ON (columns) >> >> where the options influence the MCV list and histogram size, etc. >> The >> options are recognized and may give you an idea of what it might do, >> but >> it's not really used at the moment (except for storing in the >> pg_mv_statistic catalog). >> >> >> >> I've not really gotten around to looking at the patch yet, but I'm also >> wondering if it would be simple include allowing functional statistics >> too. The pg_mv_statistic name seems to indicate multi columns, but how >> about stats on date(datetime_column), or perhaps any non-volatile >> function. This would help to solve the problem highlighted here >> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com >> . Without giving it too much thought, perhaps any expression that can be >> indexed should be allowed to have stats? Would that be really difficult >> to implement in comparison to what you've already done with the patch so >> far? >> > > I would not over-complicate requirements for the first version of this, > I think it's already complicated enough. My thoughts, exactly. I'm not willing to put more features into the initial version of the patch. Actually, I'm thinking about ripping out some experimental features (particularly "hashed MCV" and "associative rules"). > Quick look at the patch suggests that it mainly needs discussion about > design and particular implementation choices, there is fair amount of > TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have > time to do in depth review in this CF. Yes. I think it's a bit premature to discuss the code thoroughly at this point - I'd like to discuss the general approach to the feature (i.e. minimizing the impact on those not using it, etc.). The most interesting part of the code are probably the comments, explaining the design in more detail, known shortcomings and possible ways to address them. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On 29/10/14 10:41, David Rowley wrote: On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com . Without giving it too much thought, perhaps any expression that can be indexed should be allowed to have stats? Would that be really difficult to implement in comparison to what you've already done with the patch so far? I would not over-complicate requirements for the first version of this, I think it's already complicated enough. Quick look at the patch suggests that it mainly needs discussion about design and particular implementation choices, there is fair amount of TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have time to do in depth review in this CF. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Dne 29 Říjen 2014, 10:41, David Rowley napsal(a): > > I've not really gotten around to looking at the patch yet, but I'm also > wondering if it would be simple include allowing functional statistics > too. > The pg_mv_statistic name seems to indicate multi columns, but how about > stats on date(datetime_column), or perhaps any non-volatile function. This > would help to solve the problem highlighted here > http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com > . Without giving it too much thought, perhaps any expression that can be > indexed should be allowed to have stats? Would that be really difficult to > implement in comparison to what you've already done with the patch so far? I don't know, but it seems mostly orthogonal to what the patch aims to do. If we add collecting statistics on expressions (on a single column), then I'd expect it to be reasonably simple to add this to the multi-column case. There are features like join stats or range type stats, that are probably more directly related to the patch (but out of scope for the initial version). > I'm quite interested in reviewing your work on this, but it appears that > some of your changes are not C89: > > src\backend\commands\analyze.c(3774): error C2057: expected constant > expression [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(3774): error C2466: cannot allocate an > array of constant size 0 [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown > size [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(4302): error C2057: expected constant > expression [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(4302): error C2466: cannot allocate an > array of constant size 0 [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown > size [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(4775): error C2057: expected constant > expression [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(4775): error C2466: cannot allocate an > array of constant size 0 [D:\Postgres\a\postgres.vcxproj] > src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size > [D:\Postgres\a\postgres.vcxproj] > > The compiler I'm using is a bit too stupid to understand the C99 syntax. > > I guess you'd need to palloc() these arrays instead in order to comply > with > the project standards. > > http://www.postgresql.org/docs/devel/static/install-requirements.html > > I'm going to sign myself up to review this, so probably my first feedback > would be the compiling problem. I'll look into that. The thing is I don't have access to MSVC, so it's a bit difficult to spot / fix those issues :-( regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra wrote: > Hi, > > attached is a WIP patch implementing multivariate statistics. The code > certainly is not "ready" - parts of it look as if written by a rogue > chimp who got bored of attempts to type the complete works of William > Shakespeare, and decided to try something different. > > I'm really glad you're working on this. I had been thinking of looking into doing this myself. > The last point is really just "unfinished implementation" - the syntax I > propose is this: > >ALTER TABLE ... ADD STATISTICS (options) ON (columns) > > where the options influence the MCV list and histogram size, etc. The > options are recognized and may give you an idea of what it might do, but > it's not really used at the moment (except for storing in the > pg_mv_statistic catalog). > > > I've not really gotten around to looking at the patch yet, but I'm also wondering if it would be simple include allowing functional statistics too. The pg_mv_statistic name seems to indicate multi columns, but how about stats on date(datetime_column), or perhaps any non-volatile function. This would help to solve the problem highlighted here http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com . Without giving it too much thought, perhaps any expression that can be indexed should be allowed to have stats? Would that be really difficult to implement in comparison to what you've already done with the patch so far? I'm quite interested in reviewing your work on this, but it appears that some of your changes are not C89: src\backend\commands\analyze.c(3774): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(3774): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown size [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(4302): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(4302): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown size [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(4775): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(4775): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj] src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size [D:\Postgres\a\postgres.vcxproj] The compiler I'm using is a bit too stupid to understand the C99 syntax. I guess you'd need to palloc() these arrays instead in order to comply with the project standards. http://www.postgresql.org/docs/devel/static/install-requirements.html I'm going to sign myself up to review this, so probably my first feedback would be the compiling problem. Regards David Rowley
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Hi! On 13.10.2014 09:36, Albe Laurenz wrote: > Tomas Vondra wrote: >> attached is a WIP patch implementing multivariate statistics. > > I think that is pretty useful. > Oracle has an identical feature called "extended statistics". > > That's probably an entirely different thing, but it would be very > nice to have statistics to estimate the correlation between columns > of different tables, to improve the estimate for the number of rows > in a join. I don't have a clear idea of how that should work, but from the quick look at how join selectivity estimation is implemented, I believe two things might be possible: (a) using conditional probabilities Say we have a join "ta JOIN tb ON (ta.x = tb.y)" Currently, the selectivity is derived from stats on the two keys. Essentially probabilities P(x), P(y), represented by the MCV lists. But if there are additional WHERE conditions on the tables, and we have suitable multivariate stats, it's possible to use conditional probabilities. E.g. if the query actually uses ... ta JOIN tb ON (ta.x = tb.y) WHERE ta.z = 10 and we have stats on (ta.x, ta.z), we can use P(x|z=10) instead. If the two columns are correlated, this might be much different. (b) using this for multi-column conditions If the join condition involves multiple columns, e.g. ON (ta.x = tb.y AND ta.p = tb.q) and we happen to have stats on (ta.x,ta.p) and (tb.y,tb.q), we may use this to compute the cardinality (pretty much as we do today). But I haven't really worked on this so far, I suspect there are various subtle issues and I certainly don't plan to address this in the first phase of the patch. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: multivariate statistics / proof of concept
Tomas Vondra wrote: > attached is a WIP patch implementing multivariate statistics. I think that is pretty useful. Oracle has an identical feature called "extended statistics". That's probably an entirely different thing, but it would be very nice to have statistics to estimate the correlation between columns of different tables, to improve the estimate for the number of rows in a join. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers