Re: [HACKERS] WIP: multivariate statistics / proof of concept

2016-12-02 Thread Haribabu Kommi
On Tue, Nov 22, 2016 at 2:42 PM, Tomas Vondra 
wrote:

> On 11/21/2016 11:10 PM, Robert Haas wrote:
>
>> [ reviving an old multivariate statistics thread ]
>>
>> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs 
>> wrote:
>>
>>> On 12 October 2014 23:00, Tomas Vondra  wrote:
>>>
>>> It however seems to be working sufficiently well at this point, enough
 to get some useful feedback. So here we go.

>>>
>>> This looks interesting and useful.
>>>
>>> What I'd like to check before a detailed review is that this has
>>> sufficient applicability to be useful.
>>>
>>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>>> result of multi-column stats errors.
>>>
>>> Could you look at those queries and confirm that this patch can
>>> produce better plans for them?
>>>
>>
>> Tomas, did you ever do any testing in this area?  One of my
>> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
>> 20 GB.  Q18 actually doesn't complete at all right now because of an
>> issue with the new simplehash implementation.  I reported it to Andres
>> and he tracked it down, but hasn't posted the patch yet - see
>> http://archives.postgresql.org/message-id/20161115192802.jfb
>> ec5s6ougxw...@alap3.anarazel.de
>>
>> Of the remaining queries, the slowest are Q9 and Q20, and both of them
>> have serious estimation errors.  On Q9, things go wrong here:
>>
>>  ->  Merge Join
>> (cost=5225092.04..6595105.57 rows=154 width=47) (actual
>> time=103592.821..149335.010 rows=6503988 loops=1)
>>Merge Cond:
>> (partsupp.ps_partkey = lineitem.l_partkey)
>>Join Filter:
>> (lineitem.l_suppkey = partsupp.ps_suppkey)
>>Rows Removed by Join Filter:
>> 19511964
>>->  Index Scan using
>>
> > [snip]
>
>>
>> Rows Removed by Filter: 756627
>>
>> The estimate for the index scan on partsupp is essentially perfect,
>> and the lineitem-part join is off by about 3x.  However, the merge
>> join is off by about 4000x, which is real bad.
>>
>>
> The patch only deals with statistics on base relations, no joins, at this
> point. It's meant to be extended in that direction, so the syntax supports
> it, but at this point that's all. No joins.
>
> That being said, this estimate should be improved in 9.6, when you create
> a foreign key between the tables. In fact, that patch was exactly about Q9.
>
> This is how the join estimate looks on scale 1 without the FK between the
> two tables:
>
>   QUERY PLAN
> ---
>  Merge Join  (cost=19.19..700980.12 rows=2404 width=261)
>Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
> (lineitem.l_suppkey = partsupp.ps_suppkey))
>->  Index Scan using idx_lineitem_part_supp on lineitem
> (cost=0.43..605856.84 rows=6001117 width=117)
>->  Index Scan using partsupp_pkey on partsupp
> (cost=0.42..61141.76 rows=80 width=144)
> (4 rows)
>
>
> and with the foreign key:
>
>  QUERY PLAN
> ---
>  Merge Join  (cost=19.19..700980.12 rows=6001117 width=261)
>  (actual rows=6001215 loops=1)
>Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
> (lineitem.l_suppkey = partsupp.ps_suppkey))
>->  Index Scan using idx_lineitem_part_supp on lineitem
> (cost=0.43..605856.84 rows=6001117 width=117)
> (actual rows=6001215 loops=1)
>->  Index Scan using partsupp_pkey on partsupp
> (cost=0.42..61141.76 rows=80 width=144)
> (actual rows=6001672 loops=1)
>  Planning time: 3.840 ms
>  Execution time: 21987.913 ms
> (6 rows)
>
>
> On Q20, things go wrong here:
>>
> >
>
>> [snip]
>>
>> The estimate for the GroupAggregate feeding one side of the merge join
>> is quite accurate.  The estimate for the part-partsupp join on the
>> other side is off by 8x.  Then things get much worse: the estimate for
>> the merge join is off by 400x.
>>
>>
> Well, most of the estimation error comes from the join, but sadly the
> aggregate makes using the foreign keys impossible - at least in the current
> version. I don't know if it can be improved, somehow.
>
> I'm not really sure whether the multivariate statistics stuff will fix
>> this kind of case or not, but if it did it would be awesome.
>>
>>
> Join statistics are something I'd like to add eventually, but I don't see
> how it could happen in the first version. Also, the patch received no
> reviews this CF, and making it even larger is unlikely to make it more
> attractive.
>

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2016-11-21 Thread Tomas Vondra

On 11/21/2016 11:10 PM, Robert Haas wrote:

[ reviving an old multivariate statistics thread ]

On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs  wrote:

On 12 October 2014 23:00, Tomas Vondra  wrote:


It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.


This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?


Tomas, did you ever do any testing in this area?  One of my
colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
20 GB.  Q18 actually doesn't complete at all right now because of an
issue with the new simplehash implementation.  I reported it to Andres
and he tracked it down, but hasn't posted the patch yet - see
http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxw...@alap3.anarazel.de

Of the remaining queries, the slowest are Q9 and Q20, and both of them
have serious estimation errors.  On Q9, things go wrong here:

 ->  Merge Join
(cost=5225092.04..6595105.57 rows=154 width=47) (actual
time=103592.821..149335.010 rows=6503988 loops=1)
   Merge Cond:
(partsupp.ps_partkey = lineitem.l_partkey)
   Join Filter:
(lineitem.l_suppkey = partsupp.ps_suppkey)
   Rows Removed by Join Filter: 19511964
   ->  Index Scan using

> [snip]


Rows Removed by Filter: 756627

The estimate for the index scan on partsupp is essentially perfect,
and the lineitem-part join is off by about 3x.  However, the merge
join is off by about 4000x, which is real bad.



The patch only deals with statistics on base relations, no joins, at 
this point. It's meant to be extended in that direction, so the syntax 
supports it, but at this point that's all. No joins.


That being said, this estimate should be improved in 9.6, when you 
create a foreign key between the tables. In fact, that patch was exactly 
about Q9.


This is how the join estimate looks on scale 1 without the FK between 
the two tables:


  QUERY PLAN
---
 Merge Join  (cost=19.19..700980.12 rows=2404 width=261)
   Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
(lineitem.l_suppkey = partsupp.ps_suppkey))
   ->  Index Scan using idx_lineitem_part_supp on lineitem
(cost=0.43..605856.84 rows=6001117 width=117)
   ->  Index Scan using partsupp_pkey on partsupp
(cost=0.42..61141.76 rows=80 width=144)
(4 rows)


and with the foreign key:

 QUERY PLAN
---
 Merge Join  (cost=19.19..700980.12 rows=6001117 width=261)
 (actual rows=6001215 loops=1)
   Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
(lineitem.l_suppkey = partsupp.ps_suppkey))
   ->  Index Scan using idx_lineitem_part_supp on lineitem
(cost=0.43..605856.84 rows=6001117 width=117)
(actual rows=6001215 loops=1)
   ->  Index Scan using partsupp_pkey on partsupp
(cost=0.42..61141.76 rows=80 width=144)
(actual rows=6001672 loops=1)
 Planning time: 3.840 ms
 Execution time: 21987.913 ms
(6 rows)



On Q20, things go wrong here:

>

[snip]

The estimate for the GroupAggregate feeding one side of the merge join
is quite accurate.  The estimate for the part-partsupp join on the
other side is off by 8x.  Then things get much worse: the estimate for
the merge join is off by 400x.



Well, most of the estimation error comes from the join, but sadly the 
aggregate makes using the foreign keys impossible - at least in the 
current version. I don't know if it can be improved, somehow.



I'm not really sure whether the multivariate statistics stuff will fix
this kind of case or not, but if it did it would be awesome.



Join statistics are something I'd like to add eventually, but I don't 
see how it could happen in the first version. Also, the patch received 
no reviews this CF, and making it even larger is unlikely to make it 
more attractive.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2016-11-21 Thread Robert Haas
[ reviving an old multivariate statistics thread ]

On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs  wrote:
> On 12 October 2014 23:00, Tomas Vondra  wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Tomas, did you ever do any testing in this area?  One of my
colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
20 GB.  Q18 actually doesn't complete at all right now because of an
issue with the new simplehash implementation.  I reported it to Andres
and he tracked it down, but hasn't posted the patch yet - see
http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxw...@alap3.anarazel.de

Of the remaining queries, the slowest are Q9 and Q20, and both of them
have serious estimation errors.  On Q9, things go wrong here:

 ->  Merge Join
(cost=5225092.04..6595105.57 rows=154 width=47) (actual
time=103592.821..149335.010 rows=6503988 loops=1)
   Merge Cond:
(partsupp.ps_partkey = lineitem.l_partkey)
   Join Filter:
(lineitem.l_suppkey = partsupp.ps_suppkey)
   Rows Removed by Join Filter: 19511964
   ->  Index Scan using
idx_partsupp_partkey on partsupp  (cost=0.43..781956.32 rows=15999792
width=22) (actual time=0.044..11825.481 rows=15999881 loops=1)
   ->  Sort
(cost=5224967.03..5245348.02 rows=8152396 width=45) (actual
time=103592.505..112205.444 rows=26015949 loops=1)
 Sort Key: part.p_partkey
 Sort Method: quicksort
Memory: 704733kB
 ->  Hash Join
(cost=127278.36..4289121.18 rows=8152396 width=45) (actual
time=1084.370..94732.951 rows=6503988 loops=1)
   Hash Cond:
(lineitem.l_partkey = part.p_partkey)
   ->  Seq Scan on
lineitem  (cost=0.00..3630339.08 rows=119994608 width=41) (actual
time=0.015..33355.637 rows=119994608 loops=1)
   ->  Hash
(cost=123743.07..123743.07 rows=282823 width=4) (actual
time=1083.686..1083.686 rows=216867 loops=1)
 Buckets:
524288  Batches: 1  Memory Usage: 11721kB
 ->  Gather
(cost=1000.00..123743.07 rows=282823 width=4) (actual
time=0.418..926.283 rows=216867 loops=1)
   Workers
Planned: 4
   Workers
Launched: 4
   ->
Parallel Seq Scan on part  (cost=0.00..94460.77 rows=70706 width=4)
(actual time=0.063..962.909 rows=43373 loops=5)

Filter: ((p_name)::text ~~ '%grey%'::text)

Rows Removed by Filter: 756627

The estimate for the index scan on partsupp is essentially perfect,
and the lineitem-part join is off by about 3x.  However, the merge
join is off by about 4000x, which is real bad.

On Q20, things go wrong here:

 ->  Merge Join  (cost=5928271.92..6411281.44
rows=278 width=16) (actual time=77887.963..136614.284 rows=118124
loops=1)
   Merge Cond: ((lineitem.l_partkey =
partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))
   Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity
   Rows Removed by Join Filter: 242
   ->  GroupAggregate
(cost=5363980.40..5691151.45 rows=9681876 width=48) (actual
time=76672.726..131482.677 rows=10890067 loops=1)
 Group Key: lineitem.l_partkey,
lineitem.l_suppkey
 ->  Sort
(cost=5363980.40..5409466.13 rows=18194291 width=21) (actual
time=76672.661..86405.882 rows=18194084 loops=1)
   Sort Key: lineitem.l_partkey,
lineitem.l_suppkey
   Sort Method: external merge
Disk: 551376kB
   ->  Bitmap Heap Scan on
lineitem  (cost=466716.05..3170023.42 rows=18194291 width=21) (actual
time=13735.552..39289.995 rows=18195269 loops=1)
 Recheck Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))
  

Re: [HACKERS] WIP: multivariate statistics / proof of concept

2016-01-19 Thread Gavin Flower

On 12/12/14 05:53, Heikki Linnakangas wrote:

On 10/13/2014 01:00 AM, Tomas Vondra wrote:

Hi,

attached is a WIP patch implementing multivariate statistics.


Great! Really glad to see you working on this.


+ * FIXME This sample sizing is mostly OK when computing stats for
+ *   individual columns, but when computing multi-variate stats
+ *   for multivariate stats (histograms, mcv, ...) it's rather
+ *   insufficient. For small number of dimensions it works, but
+ *   for complex stats it'd be nice use sample proportional to
+ *   the table (say, 0.5% - 1%) instead of a fixed size.


I don't think a fraction of the table is appropriate. As long as the 
sample is random, the accuracy of a sample doesn't depend much on the 
size of the population. For example, if you sample 1,000 rows from a 
table with 100,000 rows, or 1000 rows from a table with 100,000,000 
rows, the accuracy is pretty much the same. That doesn't change when 
you go from a single variable to multiple variables.


You do need a bigger sample with multiple variables, however. My gut 
feeling is that if you sample N rows for a single variable, with two 
variables you need to sample N^2 rows to get the same accuracy. But 
it's not proportional to the table size. (I have no proof for that, 
but I'm sure there is literature on this.)

[...]

I did stage III statistics at University many moons ago...

The accuracy of the sample only depends on the value of N, not the total 
size of the population, with the obvious constraint that N <= population 
size.


The standard deviation in a random sample is proportional to the square 
root of N.  So using N = 100 would have a standard deviation of about 
10%, so to reduce it to 5% you would need N = 400.


For multiple variables, it will also be a function of N - I don't recall 
precisely how, I suspect it might M * N were M is the number of 
parameters (but I'm not as certain).  I think M^N might be needed if you 
want all the possible correlations between sets of variable to be 
reasonably significant - but I'm mostly just guessing here.


So using a % of table size is somewhat silly, looking at the above. 
However, if you want to detect frequencies that occur at the 1% level, 
then you will need to sample 1% of the table or greater.  So which 
approach is 'best', depends on what you are trying to determine. The 
sample size is more useful when you need to decide between 2 different 
hypothesises.


The sampling methodology, is far more important than the ratio of N to 
population size - consider the bias imposed by using random telephone 
numbers, even before the event of mobile phones!



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-04-28 Thread Tomas Vondra

Hi,

On 04/28/15 19:36, Jeff Janes wrote:
>
...


Thanks. I think I tried that, but was still having trouble. But it
turns out that the trouble was for an unrelated reason, and I got it
to compile now.


Yeah, a new column was added to pg_proc the day after I submitted the 
pacth. Will address that in a new version, hopefully in a few days.




Some of the fdw's need a patch as well in order to compile, see
attached.


Thanks, I forgot to tweak the clauselist_selectivity() calls contrib :-(


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-04-28 Thread Jeff Janes
On Tue, Apr 28, 2015 at 9:13 AM, Stephen Frost  wrote:

> * Jeff Janes (jeff.ja...@gmail.com) wrote:
> > On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> > wrote:
> > > attached is a new version of the patch series. Aside from fixing
> various
> > > issues (crashes, memory leaks). The patches are rebased to current
> > > master, and I also attach a few SQL scripts I used for testing (nothing
> > > fancy, just stress-testing all the parts the patch touches).
> >
> > I get cascading conflicts in pg_proc.h.  It looked easy enough to fix,
> > except then I get compiler errors:
>
> Yeah, those are because you didn't address the new column which was
> added to pg_proc.  You need to add another _null_ in the pg_proc.h lines
> in the correct place, apparently on four lines.
>

Thanks.  I think I tried that, but was still having trouble.  But it turns
out that the trouble was for an unrelated reason, and I got it to compile
now.

Some of the fdw's need a patch as well in order to compile, see attached.

Cheers,

Jeff


multivariate_contrib.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-04-28 Thread Stephen Frost
* Jeff Janes (jeff.ja...@gmail.com) wrote:
> On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra 
> wrote:
> > attached is a new version of the patch series. Aside from fixing various
> > issues (crashes, memory leaks). The patches are rebased to current
> > master, and I also attach a few SQL scripts I used for testing (nothing
> > fancy, just stress-testing all the parts the patch touches).
> 
> I get cascading conflicts in pg_proc.h.  It looked easy enough to fix,
> except then I get compiler errors:

Yeah, those are because you didn't address the new column which was
added to pg_proc.  You need to add another _null_ in the pg_proc.h lines
in the correct place, apparently on four lines.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-04-28 Thread Jeff Janes
On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra 
wrote:

> Hello,
>
> attached is a new version of the patch series. Aside from fixing various
> issues (crashes, memory leaks). The patches are rebased to current
> master, and I also attach a few SQL scripts I used for testing (nothing
> fancy, just stress-testing all the parts the patch touches).
>

Hi Tomas,

I get cascading conflicts in pg_proc.h.  It looked easy enough to fix,
except then I get compiler errors:

funcapi.c: In function 'get_func_trftypes':
funcapi.c:890: warning: unused variable 'procStruct'
utils/fmgrtab.o:(.rodata+0x10cf8): undefined reference to `_null_'
utils/fmgrtab.o:(.rodata+0x10d18): undefined reference to `_null_'
utils/fmgrtab.o:(.rodata+0x10d38): undefined reference to `_null_'
utils/fmgrtab.o:(.rodata+0x10d58): undefined reference to `_null_'
collect2: ld returned 1 exit status
make[2]: *** [postgres] Error 1
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2
make: *** Waiting for unfinished jobs
make: *** [temp-install] Error 2


Cheers,

Jeff


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-03-24 Thread Tomas Vondra

Hello,

On 03/24/15 06:34, Kyotaro HORIGUCHI wrote:


Sorry, not shown above, the *previous* t1 had been done "alter table
t1 add statistics (a, b, c)". Removing t1 didn't remove the setting.
reiniting cluster let me do that without error.


OK, thanks. My guess is this issue got already fixed in my working copy, 
but I will double-check that.


Admittedly, the management of the stats (e.g. removing stats when the 
table is dropped) is one of the incomplete parts. You have to delete the 
rows manually from pg_mv_statistic.


--
--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-03-23 Thread Kyotaro HORIGUCHI
Hello,

> > Patch 0001 needs changes for OIDs since my patch was
> > committed. The attached is compatible with current master.
> 
> Thanks. I plan to submit a new version of the patch in a few days, with
> significant progress in various directions. I'll have to rebase to
> current master before submitting the new version anyway (which includes
> fixing duplicate OIDs).
> 
> > And I tried this like this, and got the following error on
> > analyze. But unfortunately I don't have enough time to
> > investigate it now.
> > 
> > postgres=# create table t1 (a int, b int, c int);
> > insert into t1 (select a/ 1, a / 1, a / 1 from
> > generate_series(0, 9) a);
> > postgres=# analyze t1;
> > ERROR:  invalid memory alloc request size 1485176862
> 
> Interesting - particularly because this does not involve any
> multivariate stats. I can't reproduce it with the current version of the
> patch, so either it's unrelated, or I've fixed it since posting the last
> version.

Sorry, not shown above, the *previous* t1 had been done "alter
table t1 add statistics (a, b, c)". Removing t1 didn't remove the
setting. reiniting cluster let me do that without error.

The steps throughout was as following.
===
create table t1 (a int, b int, c int);
alter table t1 add statistics (histogram) on (a, b, c);
drop table t1;  -- This does not remove the above setting.
create table t1 (a int, b int, c int);
insert into t1 (select a/ 1, a / 1, a / 1 from generate_series(0, 
9) a);insert into t1 ...
regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-03-20 Thread Tomas Vondra
Hello,

On 20.3.2015 09:33, Kyotaro HORIGUCHI wrote:
> Hello,
> 
> 
> Patch 0001 needs changes for OIDs since my patch was
> committed. The attached is compatible with current master.

Thanks. I plan to submit a new version of the patch in a few days, with
significant progress in various directions. I'll have to rebase to
current master before submitting the new version anyway (which includes
fixing duplicate OIDs).

> And I tried this like this, and got the following error on
> analyze. But unfortunately I don't have enough time to
> investigate it now.
> 
> postgres=# create table t1 (a int, b int, c int);
> insert into t1 (select a/ 1, a / 1, a / 1 from
> generate_series(0, 9) a);
> postgres=# analyze t1;
> ERROR:  invalid memory alloc request size 1485176862

Interesting - particularly because this does not involve any
multivariate stats. I can't reproduce it with the current version of the
patch, so either it's unrelated, or I've fixed it since posting the last
version.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-03-20 Thread Kyotaro HORIGUCHI
Hello,


Patch 0001 needs changes for OIDs since my patch was
committed. The attached is compatible with current master.

And I tried this like this, and got the following error on
analyze. But unfortunately I don't have enough time to
investigate it now.

postgres=# create table t1 (a int, b int, c int);
insert into t1 (select a/ 1, a / 1, a / 1 from generate_series(0, 
9) a);
postgres=# analyze t1;
ERROR:  invalid memory alloc request size 1485176862

regards,


At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra  
wrote in <54c3fed3.1060...@2ndquadrant.com>
> Hi,
> 
> attached is an updated version of the multivariate stats patch. This is
> going to be a bit longer mail, so I'll put here a small ToC ;-)
> 
> 1) patch split into 4 parts
> 2) where to start / documentation
> 3) state of the code
> 4) main changes/improvements
> 5) remaining limitations
> 
> The motivation and design ideas, explained in the first message of this
> thread are still valid. It might be a good idea to read it first:
> 
>   http://www.postgresql.org/message-id/flat/543afa15.4080...@fuzzy.cz
> 
> BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
> into the patch in person, or discuss the patch in general.
> 
> 
> 1) Patch split into 4 parts
> ---
> Firstly, the patch got broken into the following four pieces, to make
> the reviews somewhat easier:
> 
> 1) 0001-shared-infrastructure-and-functional-dependencies.patch
> 
>- infrastructure, shared by all the kinds of stats added
>  in the following patches (catalog, ALTER TABLE, ANALYZE ...)
> 
>- implementation of a simple statistics, tracking functional
>  dependencies between columns (previously called "associative
>  rules", but that's incorrect for several reasons)
> 
>- this does not modify the optimizer in any way
> 2) 0002-clause-reduction-using-functional-dependencies.patch
> 
>- applies the functional dependencies to optimizer (i.e. considers
>  the rules in clauselist_selectivity())
> 
> 3) 0003-multivariate-MCV-lists.patch
> 
>- multivariate MCV lists (both ANALYZE and optimizer parts)
> 
> 4) 0004-multivariate-histograms.patch
> 
>- multivariate histograms (both ANALYZE and optimizer parts)
> 
> 
> You may look at the patches at github here:
> 
>   https://github.com/tvondra/postgres/tree/multivariate-stats-squashed
> 
> The branch is not stable, i.e. I'll rebase / squash / force-push changes
> in the future. (There's also multivariate-stats development branch with
> unsquashed changes, but you don't want to look at that, trust me.)
> 
> The patches are not exactly small (being in the 50-100 kB range), but
> that's mostly because of the amount of comments explaining the goals and
> implementation details.
> 
> 
> 2) Where to start / documentation
> -
> I strived to document all the pieces properly, mostly in the form of
> comments. There's no sgml documentation at this point, which should
> obviously change in the future.
> 
> Anyway, I'd suggest reading the first e-mail in this thread, explaining
> the ideas, and then these comments:
> 
> 1) functional dependencies (patch 0001)
>- src/backend/utils/mvstats/dependencies.c
> 
> 2) MCV lists (patch 0003)
>- src/backend/utils/mvstats/mcv.c
> 
> 3) histograms (patch 0004)
>- src/backend/utils/mvstats/mcv.c
> 
>- also see clauselist_mv_selectivity_mcvlist() in clausesel.c
>- also see clauselist_mv_selectivity_histogram() in clausesel.c
> 
> 4) selectivity estimation (patches 0002-0004)
>- all in src/backend/optimizer/path/clausesel.c
>- clauselist_selectivity() - overview of how the stats are applied
>- clauselist_apply_dependencies() - functional dependencies reduction
>- clauselist_mv_selectivity_mcvlist() - MCV list estimation
>- clauselist_mv_selectivity_histogram() - histogram estimation
> 
> 
> 3) State of the code
> 
> I've spent a fair amount of time testing the patches, and while I
> believe there are no segfaults or so, I know parts of the code need a
> bit more love.
> 
> The part most in need of improvements / comments is probably the code in
> clausesel.c - that seems a bit quirky. Reviews / comments regarding this
> part of the code are very welcome - I'm sure there are many ways to
> improve this part.
> 
> There are a few FIXMEs elsewhere (e.g. about memory allocation in the
> (de)serialization code), but those are mostly well-defined issues that I
> know how to address (at least I believe so).
> 
> 
> 4) Main changes/improvements
> 
> There are many significant improvements. The previous patch version was
> in the 'proof of concept' category (missing pieces, knowingly broken in
> some areas), the current patch should 'mostly work'.
> 
> The patch fixes two most annoying limitations of the first version:
> 
>   (a) support for all data types (not just those passed by value)
>   (b) handles NUL

Re: [HACKERS] WIP: multivariate statistics / proof of concept

2015-01-15 Thread Michael Paquier
On Mon, Dec 15, 2014 at 11:55 AM, Michael Paquier
 wrote:
> On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra  wrote:
>> I agree with moving the patch to the next CF - I'm working on the patch,
>> but I will take a bit more time to submit a new version and I can do
>> that in the next CF.
> OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12.
Marked as returned with feedback. No new version showed up in the last
month and this patch was waiting for input from author.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-12-14 Thread Michael Paquier
On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra  wrote:
> I agree with moving the patch to the next CF - I'm working on the patch,
> but I will take a bit more time to submit a new version and I can do
> that in the next CF.
OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12.
Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-12-11 Thread Tomas Vondra
On 11.12.2014 17:53, Heikki Linnakangas wrote:
> On 10/13/2014 01:00 AM, Tomas Vondra wrote:
>> Hi,
>>
>> attached is a WIP patch implementing multivariate statistics.
> 
> Great! Really glad to see you working on this.
> 
>> + * FIXME This sample sizing is mostly OK when computing stats for
>> + *   individual columns, but when computing multi-variate stats
>> + *   for multivariate stats (histograms, mcv, ...) it's rather
>> + *   insufficient. For small number of dimensions it works, but
>> + *   for complex stats it'd be nice use sample proportional to
>> + *   the table (say, 0.5% - 1%) instead of a fixed size.
> 
> I don't think a fraction of the table is appropriate. As long as the 
> sample is random, the accuracy of a sample doesn't depend much on
> the size of the population. For example, if you sample 1,000 rows
> from a table with 100,000 rows, or 1000 rows from a table with
> 100,000,000 rows, the accuracy is pretty much the same. That doesn't
> change when you go from a single variable to multiple variables.

I might be wrong, but I doubt that. First, I read a number of papers
while working on this patch, and all of them used samples proportional
to the data set. That's an indirect evidence, though.

> You do need a bigger sample with multiple variables, however. My gut 
> feeling is that if you sample N rows for a single variable, with two 
> variables you need to sample N^2 rows to get the same accuracy. But
> it's not proportional to the table size. (I have no proof for that,
> but I'm sure there is literature on this.)

Maybe. I think it's somehow related to the number of buckets (which
somehow determines the precision of the histogram). If you want 1000
buckets, the number of rows scanned needs to be e.g. 10x that. With
multi-variate histograms, we may shoot for more buckets (say, 100 in
each dimension).

> 
>> + * Multivariate histograms
>> + *
>> + * Histograms are a collection of buckets, represented by n-dimensional
>> + * rectangles. Each rectangle is delimited by an array of lower and
>> + * upper boundaries, so that for for the i-th attribute
>> + *
>> + * min[i] <= value[i] <= max[i]
>> + *
>> + * Each bucket tracks frequency (fraction of tuples it contains),
>> + * information about the inequalities, number of distinct values in
>> + * each dimension (which is used when building the histogram) etc.
>> + *
>> + * The boundaries may be either inclusive or exclusive, or the whole
>> + * dimension may be NULL.
>> + *
>> + * The buckets may overlap (assuming the build algorithm keeps the
>> + * frequencies additive) or may not cover the whole space (i.e. allow
>> + * gaps). This entirely depends on the algorithm used to build the
>> + * histogram.
> 
> That sounds pretty exotic. These buckets are quite different from
> the single-dimension buckets we currently have.
> 
> The paper you reference in partition_bucket() function, M. 
> Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating 
> Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 
> 1988: 28-36, actually doesn't mention overlapping buckets at all. I 
> haven't read the code in detail, but if it implements the algorithm
> from that paper, there will be no overlap.

The algorithm implemented in partition_bucket() is very simple and
naive, and it mostly resembles the algorithm described in the paper. I'm
sure there are differences, it's not a 1:1 implementation, but you're
right it produces non-overlapping buckets.

The point is that I envision more complex algorithms or different
histogram types, and some of them may produce overlapping buckets. Maybe
that's premature comment, and it will turn out it's not really necessary.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-12-11 Thread Heikki Linnakangas

On 10/13/2014 01:00 AM, Tomas Vondra wrote:

Hi,

attached is a WIP patch implementing multivariate statistics.


Great! Really glad to see you working on this.


+* FIXME This sample sizing is mostly OK when computing stats for
+*   individual columns, but when computing multi-variate stats
+*   for multivariate stats (histograms, mcv, ...) it's rather
+*   insufficient. For small number of dimensions it works, but
+*   for complex stats it'd be nice use sample proportional to
+*   the table (say, 0.5% - 1%) instead of a fixed size.


I don't think a fraction of the table is appropriate. As long as the 
sample is random, the accuracy of a sample doesn't depend much on the 
size of the population. For example, if you sample 1,000 rows from a 
table with 100,000 rows, or 1000 rows from a table with 100,000,000 
rows, the accuracy is pretty much the same. That doesn't change when you 
go from a single variable to multiple variables.


You do need a bigger sample with multiple variables, however. My gut 
feeling is that if you sample N rows for a single variable, with two 
variables you need to sample N^2 rows to get the same accuracy. But it's 
not proportional to the table size. (I have no proof for that, but I'm 
sure there is literature on this.)



+ * Multivariate histograms
+ *
+ * Histograms are a collection of buckets, represented by n-dimensional
+ * rectangles. Each rectangle is delimited by an array of lower and
+ * upper boundaries, so that for for the i-th attribute
+ *
+ * min[i] <= value[i] <= max[i]
+ *
+ * Each bucket tracks frequency (fraction of tuples it contains),
+ * information about the inequalities, number of distinct values in
+ * each dimension (which is used when building the histogram) etc.
+ *
+ * The boundaries may be either inclusive or exclusive, or the whole
+ * dimension may be NULL.
+ *
+ * The buckets may overlap (assuming the build algorithm keeps the
+ * frequencies additive) or may not cover the whole space (i.e. allow
+ * gaps). This entirely depends on the algorithm used to build the
+ * histogram.


That sounds pretty exotic. These buckets are quite different from the 
single-dimension buckets we currently have.


The paper you reference in partition_bucket() function, M. 
Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating 
Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 
1988: 28-36, actually doesn't mention overlapping buckets at all. I 
haven't read the code in detail, but if it implements the algorithm from 
that paper, there will be no overlap.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-12-09 Thread Tomas Vondra
On 8.12.2014 02:01, Michael Paquier wrote:
> On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra  wrote:
>> Thanks for the link. I've been looking for a good dataset with such
>> data, and this one is by far the best one.
>>
>> The current version of the patch supports only data types passed by
>> value (i.e. no varlena types - text, ), which means it's impossible to
>> build multivariate stats on some of the interesting columns (state,
>> city, ...).
>>
>> I guess it's time to start working on removing this limitation.
> Tomas, what's your status on this patch? Are you planning to make it
> more complicated than it is? For now I have switched it to a "Needs
> Review" state because even your first version did not get advanced
> review (that's quite big btw). I guess that we should switch it to the
> next CF.

Hello Michael,

I agree with moving the patch to the next CF - I'm working on the patch,
but I will take a bit more time to submit a new version and I can do
that in the next CF.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-12-07 Thread Michael Paquier
On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra  wrote:
> Thanks for the link. I've been looking for a good dataset with such
> data, and this one is by far the best one.
>
> The current version of the patch supports only data types passed by
> value (i.e. no varlena types - text, ), which means it's impossible to
> build multivariate stats on some of the interesting columns (state,
> city, ...).
>
> I guess it's time to start working on removing this limitation.
Tomas, what's your status on this patch? Are you planning to make it
more complicated than it is? For now I have switched it to a "Needs
Review" state because even your first version did not get advanced
review (that's quite big btw). I guess that we should switch it to the
next CF.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-15 Thread Tomas Vondra
On 15.11.2014 18:49, Kevin Grittner
> If you eliminate the quals besides the zipcode column you get 61
> rows and it gets much stranger, with legal municipalities that are
> completely surrounded by Madison that the postal service would
> rather you didn't use in addressing your envelopes, but they have
> to deliver to anyway, and organizations inside Madison receiving
> enough mail to (literally) have their own zip code -- where the
> postal service allows the organization name as a deliverable
> "city".
> 
> If you want to have your own fun with this data, you can download
> it here:
> 
> http://federalgovernmentzipcodes.us/free-zipcode-database.csv
>
...
> 
> I bet there are all sorts of correlation possibilities with, for
> example, latitude and longitude and other variables.  With 81831
> rows and so many correlations among the columns, it might be a
> useful data set to test with.

Thanks for the link. I've been looking for a good dataset with such
data, and this one is by far the best one.

The current version of the patch supports only data types passed by
value (i.e. no varlena types - text, ), which means it's impossible to
build multivariate stats on some of the interesting columns (state,
city, ...).

I guess it's time to start working on removing this limitation.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-15 Thread Kevin Grittner
Tomas Vondra  wrote:
> Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
>> On 13.11.2014 14:11, Tomas Vondra wrote:
>>
>>> The only place where I think this might work are the associative rules.
>>> It's simple to specify rules like ("ZIP code" implies "city") and we could
>>> even do some simple check against the data to see if it actually makes
>>> sense (and 'disable' the rule if not).
>>
>> and even this simple example has its limits, at least in Germany ZIP
>> codes are not unique for rural areas, where several villages have the
>> same ZIP code.

> as you point out most real-world data either contain bugs
> or are inherently imperfect (we have the same kind of ZIP/city
> inconsistencies in Czech).

You can have lots of fun with U.S. zip code, too. Just on the
nominally "Madison, Wisconsin" zip codes (those starting with 537),
there are several exceptions:

select zipcode, city, locationtype
from zipcode
where zipcode like '537%'
and Decommisioned = 'false'
and zipcodetype = 'STANDARD'
and locationtype in ('PRIMARY', 'ACCEPTABLE')
order by zipcode, city;

zipcode | city | locationtype
-+---+--
53703 | MADISON | PRIMARY
53704 | MADISON | PRIMARY
53705 | MADISON | PRIMARY
53706 | MADISON | PRIMARY
53711 | FITCHBURG | ACCEPTABLE
53711 | MADISON | PRIMARY
53713 | FITCHBURG | ACCEPTABLE
53713 | MADISON | PRIMARY
53713 | MONONA | ACCEPTABLE
53714 | MADISON | PRIMARY
53714 | MONONA | ACCEPTABLE
53715 | MADISON | PRIMARY
53716 | MADISON | PRIMARY
53716 | MONONA | ACCEPTABLE
53717 | MADISON | PRIMARY
53718 | MADISON | PRIMARY
53719 | FITCHBURG | ACCEPTABLE
53719 | MADISON | PRIMARY
53725 | MADISON | PRIMARY
53726 | MADISON | PRIMARY
53744 | MADISON | PRIMARY
(21 rows)

If you eliminate the quals besides the zipcode column you get 61
rows and it gets much stranger, with legal municipalities that are
completely surrounded by Madison that the postal service would
rather you didn't use in addressing your envelopes, but they have
to deliver to anyway, and organizations inside Madison receiving
enough mail to (literally) have their own zip code -- where the
postal service allows the organization name as a deliverable
"city".

If you want to have your own fun with this data, you can download
it here:

http://federalgovernmentzipcodes.us/free-zipcode-database.csv

I was able to load it into PostgreSQL with this:

create table zipcode
(
recordnumber integer not null,
zipcode text not null,
zipcodetype text not null,
city text not null,
state text not null,
locationtype text not null,
lat double precision,
long double precision,
xaxis double precision not null,
yaxis double precision not null,
zaxis double precision not null,
worldregion text not null,
country text not null,
locationtext text,
location text,
decommisioned text not null,
taxreturnsfiled bigint,
estimatedpopulation bigint,
totalwages bigint,
notes text
);
comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)';
comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, 
Military(implies APO or FPO)';
comment on column zipcode.city is 'offical city name(s)';
comment on column zipcode.state is 'offical state, territory, or quasi-state 
(AA, AE, AP) abbreviation code';
comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable';
comment on column zipcode.lat is 'Decimal Latitude, if available';
comment on column zipcode.long is 'Decimal Longitude, if available';
comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago 
Pago, AS ; Melbourne, AU )';
comment on column zipcode.decommisioned is 'If Primary location, Yes implies 
historical Zipcode, No Implies current Zipcode; If not Primary, Yes implies 
Historical Placename';
comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns 
Filed in 2008';
copy zipcode from 'filepath' with (format csv, header);
alter table zipcode add primary key (recordnumber);
create unique index zipcode_city on zipcode (zipcode, city);

I bet there are all sorts of correlation possibilities with, for
example, latitude and longitude and other variables.  With 81831
rows and so many correlations among the columns, it might be a
useful data set to test with.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-13 Thread Tomas Vondra
Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
> On 13.11.2014 14:11, Tomas Vondra wrote:
>
>> The only place where I think this might work are the associative rules.
>> It's simple to specify rules like ("ZIP code" implies "city") and we
>> could
>> even do some simple check against the data to see if it actually makes
>> sense (and 'disable' the rule if not).
>
> and even this simple example has its limits, at least in Germany ZIP
> codes are not unique for rural areas, where several villages have the
> same ZIP code.
>
> I guess there are just a few examples where columns are completely
> functional dependent without any exceptions.
> But of course, if the user gives this information just for optimization
> the statistics, some exceptions don't matter.
> If this information should be used for creating different execution
> plans (e.g. on column A is an index and column B is functional
> dependent, one could think about using this index on A and the
> dependency instead of running through the whole table to find all tuples
> that fit the query on column B), exceptions are a very important issue.

Yes, exactly. The aim of this patch is "only" improving estimates, not
removing conditions from the plan (e.g. checking only the ZIP code and not
the city name). That certainly can't be done solely based on approximate
statistics, and as you point out most real-world data either contain bugs
or are inherently imperfect (we have the same kind of ZIP/city
inconsistencies in Czech). That's not a big issue for estimates (assuming
only small fraction of rows violates the rule) though.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-13 Thread Katharina Büchse

On 13.11.2014 14:11, Tomas Vondra wrote:

Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):

On 12 October 2014 23:00, Tomas Vondra  wrote:


It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.

I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.


If so, I will work with you to review this patch.

Thanks!


One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.

I'm not a big fan of this approach, for a number of reasons.

Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.

Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a  way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.

Another question is whether it should support more than 2 columns ...

The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).
and even this simple example has its limits, at least in Germany ZIP 
codes are not unique for rural areas, where several villages have the 
same ZIP code.


I guess there are just a few examples where columns are completely 
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization 
the statistics, some exceptions don't matter.
If this information should be used for creating different execution 
plans (e.g. on column A is an index and column B is functional 
dependent, one could think about using this index on A and the 
dependency instead of running through the whole table to find all tuples 
that fit the query on column B), exceptions are a very important issue.


But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?

regards
Tomas






--
Dipl.-Math. Katharina Büchse
Friedrich-Schiller-Universität Jena
Institut für Informatik
Lehrstuhl für Datenbanken und Informationssysteme
Ernst-Abbe-Platz 2
07743 Jena
Telefon 03641/946367
Webseite http://users.minet.uni-jena.de/~re89qen/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-13 Thread Tomas Vondra
Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
> On 12 October 2014 23:00, Tomas Vondra  wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.

I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.

> If so, I will work with you to review this patch.

Thanks!

> One aspect of the patch that seems to be missing is a user declaration
> of correlation, just as we have for setting n_distinct. It seems like
> an even easier place to start to just let the user specify the stats
> declaratively. That way we can split the patch into two parts. First,
> allow multi column stats that are user declared. Then add user stats
> collected by ANALYZE. The first part is possibly contentious and thus
> a good initial focus. The second part will have lots of discussion, so
> good to skip for a first version.

I'm not a big fan of this approach, for a number of reasons.

Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.

Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a  way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.

Another question is whether it should support more than 2 columns ...

The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).

But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-13 Thread Simon Riggs
On 12 October 2014 23:00, Tomas Vondra  wrote:

> It however seems to be working sufficiently well at this point, enough
> to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

If so, I will work with you to review this patch.

One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-30 Thread Tomas Vondra
Dne 30 Říjen 2014, 10:17, David Rowley napsal(a):
> On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra  wrote:
>
>> Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
>> >> I've not really gotten around to looking at the patch yet, but I'm
>> also
>> >> wondering if it would be simple include allowing functional
>> statistics
>> >> too. The pg_mv_statistic name seems to indicate multi columns, but
>> how
>> >> about stats on date(datetime_column), or perhaps any non-volatile
>> >> function. This would help to solve the problem highlighted here
>> >>
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com
>> >> . Without giving it too much thought, perhaps any expression that can
>> be
>> >> indexed should be allowed to have stats? Would that be really
>> difficult
>> >> to implement in comparison to what you've already done with the patch
>> so
>> >> far?
>> >>
>> >
>> > I would not over-complicate requirements for the first version of
>> this,
>> > I think it's already complicated enough.
>>
>> My thoughts, exactly. I'm not willing to put more features into the
>> initial version of the patch. Actually, I'm thinking about ripping out
>> some experimental features (particularly "hashed MCV" and "associative
>> rules").
>>
>>
> That's fair, but I didn't really mean to imply that you should go work on
> that too and that it should be part of this patch..
> I was thinking more along the lines of that I don't really agree with the
> table name for the new stats and that at some later date someone will want
> to add expression stats and we'd probably better come up design that would
> be friendly towards that. At this time I can only think that the name of
> the table might not suit well to expression stats, I'd hate to see someone
> have to invent a 3rd table to support these when we could likely come up
> with something that could be extended later and still make sense both
> today
> and in the future.
>
> I was just looking at how expression indexes are stored in pg_index and I
> see that if it's an expression index that the expression is stored in
> the indexprs column which is of type pg_node_tree, so quite possibly at
> some point in the future the new stats table could just have an extra
> column added, and for today, we'd just need to come up with a future proof
> name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and
> source files something along those lines instead?

Ah, OK. I don't think the catalog name "pg_mv_statistic" is somehow
inappropriate for this purpose, though. IMHO the "multivariate" does not
mean "only columns" or "no expressions", it simply describes that the
approximated density function has multiple input variables, be it
attributes or expressions.

But maybe there's a better name.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-30 Thread David Rowley
On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra  wrote:

> Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
> > I'm quite interested in reviewing your work on this, but it appears that
> > some of your changes are not C89:
> >
> >  src\backend\commands\analyze.c(3774): error C2057: expected constant
> > expression [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
> > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
> > size [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(4302): error C2057: expected constant
> > expression [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
> > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' :
> unknown
> > size [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(4775): error C2057: expected constant
> > expression [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
> > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
> >  src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
> > [D:\Postgres\a\postgres.vcxproj]
> >
>
> I'll look into that. The thing is I don't have access to MSVC, so it's a
> bit
> difficult to spot / fix those issues :-(
>
>
It should be a pretty simple fix, just use the files and line numbers from
the above. It's just a problem that in those 3 places you're declaring an
array of a variable size, which is not allowed in C89. The thing to do
instead would just be to palloc() the size you need and the pfree() it when
you're done.

Regards

David Rowley


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-30 Thread David Rowley
On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra  wrote:

> Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
> >> I've not really gotten around to looking at the patch yet, but I'm also
> >> wondering if it would be simple include allowing functional statistics
> >> too. The pg_mv_statistic name seems to indicate multi columns, but how
> >> about stats on date(datetime_column), or perhaps any non-volatile
> >> function. This would help to solve the problem highlighted here
> >>
> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com
> >> . Without giving it too much thought, perhaps any expression that can be
> >> indexed should be allowed to have stats? Would that be really difficult
> >> to implement in comparison to what you've already done with the patch so
> >> far?
> >>
> >
> > I would not over-complicate requirements for the first version of this,
> > I think it's already complicated enough.
>
> My thoughts, exactly. I'm not willing to put more features into the
> initial version of the patch. Actually, I'm thinking about ripping out
> some experimental features (particularly "hashed MCV" and "associative
> rules").
>
>
That's fair, but I didn't really mean to imply that you should go work on
that too and that it should be part of this patch..
I was thinking more along the lines of that I don't really agree with the
table name for the new stats and that at some later date someone will want
to add expression stats and we'd probably better come up design that would
be friendly towards that. At this time I can only think that the name of
the table might not suit well to expression stats, I'd hate to see someone
have to invent a 3rd table to support these when we could likely come up
with something that could be extended later and still make sense both today
and in the future.

I was just looking at how expression indexes are stored in pg_index and I
see that if it's an expression index that the expression is stored in
the indexprs column which is of type pg_node_tree, so quite possibly at
some point in the future the new stats table could just have an extra
column added, and for today, we'd just need to come up with a future proof
name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and
source files something along those lines instead?

Regards

David Rowley


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-29 Thread Tomas Vondra
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
> On 29/10/14 10:41, David Rowley wrote:
>> On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra >
>> The last point is really just "unfinished implementation" - the
>> syntax I
>> propose is this:
>>
>> ALTER TABLE ... ADD STATISTICS (options) ON (columns)
>>
>> where the options influence the MCV list and histogram size, etc.
>> The
>> options are recognized and may give you an idea of what it might do,
>> but
>> it's not really used at the moment (except for storing in the
>> pg_mv_statistic catalog).
>>
>>
>>
>> I've not really gotten around to looking at the patch yet, but I'm also
>> wondering if it would be simple include allowing functional statistics
>> too. The pg_mv_statistic name seems to indicate multi columns, but how
>> about stats on date(datetime_column), or perhaps any non-volatile
>> function. This would help to solve the problem highlighted here
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com
>> . Without giving it too much thought, perhaps any expression that can be
>> indexed should be allowed to have stats? Would that be really difficult
>> to implement in comparison to what you've already done with the patch so
>> far?
>>
>
> I would not over-complicate requirements for the first version of this,
> I think it's already complicated enough.

My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").

> Quick look at the patch suggests that it mainly needs discussion about
> design and particular implementation choices, there is fair amount of
> TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have
> time to do in depth review in this CF.

Yes. I think it's a bit premature to discuss the code thoroughly at this
point - I'd like to discuss the general approach to the feature (i.e.
minimizing the impact on those not using it, etc.).

The most interesting part of the code are probably the comments,
explaining the design in more detail, known shortcomings and possible ways
to address them.

regards
Tomas




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-29 Thread Petr Jelinek

On 29/10/14 10:41, David Rowley wrote:

On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult
to implement in comparison to what you've already done with the patch so
far?



I would not over-complicate requirements for the first version of this, 
I think it's already complicated enough.


Quick look at the patch suggests that it mainly needs discussion about 
design and particular implementation choices, there is fair amount of 
TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have 
time to do in depth review in this CF.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-29 Thread Tomas Vondra
Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
>
> I've not really gotten around to looking at the patch yet, but I'm also
> wondering if it would be simple include allowing functional statistics
> too.
> The pg_mv_statistic name seems to indicate multi columns, but how about
> stats on date(datetime_column), or perhaps any non-volatile function. This
> would help to solve the problem highlighted here
> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com
> . Without giving it too much thought, perhaps any expression that can be
> indexed should be allowed to have stats? Would that be really difficult to
> implement in comparison to what you've already done with the patch so far?

I don't know, but it seems mostly orthogonal to what the patch aims to do.
If we add collecting statistics on expressions (on a single column), then I'd
expect it to be reasonably simple to add this to the multi-column case.

There are features like join stats or range type stats, that are probably
more directly related to the patch (but out of scope for the initial
version).

> I'm quite interested in reviewing your work on this, but it appears that
> some of your changes are not C89:
>
>  src\backend\commands\analyze.c(3774): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
> [D:\Postgres\a\postgres.vcxproj]
>
> The compiler I'm using is a bit too stupid to understand the C99 syntax.
>
> I guess you'd need to palloc() these arrays instead in order to comply
> with
> the project standards.
>
> http://www.postgresql.org/docs/devel/static/install-requirements.html
>
> I'm going to sign myself up to review this, so probably my first feedback
> would be the compiling problem.

I'll look into that. The thing is I don't have access to MSVC, so it's a bit
difficult to spot / fix those issues :-(

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-29 Thread David Rowley
On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra  wrote:

> Hi,
>
> attached is a WIP patch implementing multivariate statistics. The code
> certainly is not "ready" - parts of it look as if written by a rogue
> chimp who got bored of attempts to type the complete works of William
> Shakespeare, and decided to try something different.
>
>
I'm really glad you're working on this. I had been thinking of looking into
doing this myself.


> The last point is really just "unfinished implementation" - the syntax I
> propose is this:
>
>ALTER TABLE ... ADD STATISTICS (options) ON (columns)
>
> where the options influence the MCV list and histogram size, etc. The
> options are recognized and may give you an idea of what it might do, but
> it's not really used at the moment (except for storing in the
> pg_mv_statistic catalog).
>
>
>
I've not really gotten around to looking at the patch yet, but I'm also
wondering if it would be simple include allowing functional statistics too.
The pg_mv_statistic name seems to indicate multi columns, but how about
stats on date(datetime_column), or perhaps any non-volatile function. This
would help to solve the problem highlighted here
http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9uf34...@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult to
implement in comparison to what you've already done with the patch so far?


I'm quite interested in reviewing your work on this, but it appears that
some of your changes are not C89:

 src\backend\commands\analyze.c(3774): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
size [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4302): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
size [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4775): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
[D:\Postgres\a\postgres.vcxproj]

The compiler I'm using is a bit too stupid to understand the C99 syntax.

I guess you'd need to palloc() these arrays instead in order to comply with
the project standards.

http://www.postgresql.org/docs/devel/static/install-requirements.html

I'm going to sign myself up to review this, so probably my first feedback
would be the compiling problem.

Regards

David Rowley


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-13 Thread Tomas Vondra
Hi!

On 13.10.2014 09:36, Albe Laurenz wrote:
> Tomas Vondra wrote:
>> attached is a WIP patch implementing multivariate statistics.
> 
> I think that is pretty useful.
> Oracle has an identical feature called "extended statistics".
> 
> That's probably an entirely different thing, but it would be very 
> nice to have statistics to estimate the correlation between columns 
> of different tables, to improve the estimate for the number of rows 
> in a join.

I don't have a clear idea of how that should work, but from the quick
look at how join selectivity estimation is implemented, I believe two
things might be possible:

 (a) using conditional probabilities

 Say we have a join "ta JOIN tb ON (ta.x = tb.y)"

 Currently, the selectivity is derived from stats on the two keys.
 Essentially probabilities P(x), P(y), represented by the MCV lists.
 But if there are additional WHERE conditions on the tables, and we
 have suitable multivariate stats, it's possible to use conditional
 probabilities.

 E.g. if the query actually uses

 ... ta JOIN tb ON (ta.x = tb.y) WHERE ta.z = 10

 and we have stats on (ta.x, ta.z), we can use P(x|z=10) instead.
 If the two columns are correlated, this might be much different.

 (b) using this for multi-column conditions

 If the join condition involves multiple columns, e.g.

 ON (ta.x = tb.y AND ta.p = tb.q)

 and we happen to have stats on (ta.x,ta.p) and (tb.y,tb.q), we may
 use this to compute the cardinality (pretty much as we do today).

But I haven't really worked on this so far, I suspect there are various
subtle issues and I certainly don't plan to address this in the first
phase of the patch.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-10-13 Thread Albe Laurenz
Tomas Vondra wrote:
> attached is a WIP patch implementing multivariate statistics.

I think that is pretty useful.
Oracle has an identical feature called "extended statistics".

That's probably an entirely different thing, but it would be very
nice to have statistics to estimate the correlation between columns
of different tables, to improve the estimate for the number of rows
in a join.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers